This chapter discusses some key properties of the Ordinary Least Squares (OLS) estimator. We will focus on the properties that are most relevant for understanding the behavior of the OLS estimator in practice. Furthermore, this chapter introduces one of the most important theorems in Econometrics, the Frisch-Waugh-Lovell theorem, which provides a useful way to employ OLS in practice.
4.2 Orthogonal Projections
The OLS estimator can be interpreted in terms of orthogonal projections. OLS decomposes the dependent variable \(Y\) into two components: one that is explained by the regressors \(X\) and another that is not. We can write this decomposition as:
\[
Y = P_XY + (I - P_X)Y = P_X Y + M_X Y = X\hat{\beta} + \hat{U},
\tag{4.1}\]
where \(I\) is the identity matrix, \(P_XY = X\hat{\beta}\) is the part of \(Y\) that is explained by \(X\), and \((I - P_X)Y = \hat{U}\) is the part of \(Y\) that is not explained by \(X\).
Above, the matrix \(P_X\) is given by: \[
P_X = X(X'X)^{-1}X',
\tag{4.2}\]
which is called the projection matrix onto the column space of \(X\). The matrix \(P_X\) projects \(Y\) into the space spanned by the columns of \(X\).
Furthermore, we define the matrix \(M_X\) or maker of residuals as:
\[
M_X := (I - P_X),
\tag{4.3}\]
which is a projection matrix onto the orthogonal space to the columns of \(X\).
4.2.1 Properties of Projection Matrices
As the descriptions above suggest, \(P_X\) and \(M_X\) are projection matrices. They are symmetric and idempotent, as the reader is asked to verify in Exercise 4.1.
Moreover, they are complementary in the sense that \(P_XM_X = 0\) and \(P_X + M_X = I\). Hence, they are orthogonal projections.
Finally, note that \(P_XX = X\) and \(M_XX = 0\). The last two properties are intuitive, as \(P_X\) projects into the space spanned by the columns of \(X\), while \(M_X\) projects into the orthogonal space to the columns of \(X\). We can explain \(X\) perfectly with itself, and there is no part of \(X\) that is orthogonal to itself.
4.2.2 Consequences of the Decomposition
From Equation 4.1 note that the \(P_XY\) and \(M_XY\) projections can be represented by a right-angled triangle, where the hypotenuse is \(Y\), the projection of \(Y\) onto \(X\) is \(P_XY\), and the projection of \(Y\) onto the orthogonal space to \(X\) is \(M_XY\).
By Pythagoras’ Theorem, a direct consequence of this decomposition is that the length of the explained part is smaller or equal to the length of the dependent variable:
\[
||P_XY||^2 \leq ||Y||^2.
\tag{4.4}\]
The inequality above is the basis for the coefficient of determination, which we will discuss later.
Moreover, by construction, given that they minimise Equation 3.5, the length of the residuals is smaller or equal to the length of the error term: \[
||\hat{U}||^2 \leq ||U||^2.
\tag{4.5}\]
This difference is further explored when finding an unbiased estimator for the variance of the error term.
A final property of the projections is that the residuals are orthogonal to the regressors:
\[
X'\hat{U} = X'M_XY = 0,
\tag{4.6}\]
where the last equality follows from the fact that \(X'M_X = 0\).
The orthogonality of the residuals to the regressors is a key property of the OLS estimator and does not depend on the true error term being orthogonal to the regressors. That is, Equation 4.6 holds regardless of whether the regressors are exogenous or not.
A direct consequence of this is that if the regressors include a constant, then the residuals sum to zero, \(\sum_{t=1}^N \hat{U}_t = 0\). Hence, the residuals are demeaned, regardless of the expected value of the error term.
Example 4.1 The following example illustrates the mean-zero property of the residuals. We generate data from a simple regression model where the error term has a non-zero mean. We then estimate the model using OLS and show that the residuals have a mean of zero. In this example, the errors are generated from a normal distribution with a mean of 1 and a standard deviation of 1.
usingRandom, Distributions, StatisticsRandom.seed!(123)N =100U =rand(Normal(1, 1), N ) # error termX =rand(Normal(10, 1), N) # regressorY = X + U # regressandβ = (X'*X) \ (X'*Y) # OLS estimatorresid = Y - X*β # residualsdisplay(["Mean of the errors: $(mean(U))" ; "Mean of the residuals: $(mean(resid))"])
2-element Vector{String}:
"Mean of the errors: 0.951722851246017"
"Mean of the residuals: -0.0014347501176984246"
Mean-zero property of the residuals
In the output, we can see that the mean of the errors is approximately 1, while the mean of the residuals is approximately 0, illustrating the mean-zero property of the residuals.
In the code, we have used the $ sign to substitute the values of the means into the strings displayed.
4.3 The Frisch-Waugh-Lovell Theorem
The Frisch-Waugh-Lovell (FWL) theorem provides a useful way to understand the OLS estimator in the presence of multiple regressors. The theorem states that the OLS estimator of a subset of regressors can be obtained by regressing the dependent variable and the subset of regressors on the remaining regressors and then regressing the residuals of the dependent variable on the residuals of the subset of regressors. It is a powerful result that allows us to understand the OLS estimator in terms of partial correlations.
4.3.1 Orthogonal Regressors
We are interested in analyzing the effect that partitioning the regressors have on the estimators. Assume we broke up the regressors into two groups \(X = [X_1\ X_2]\) so that \[
Y = X_1\beta_1 + X_2\beta_2 + U.
\]
In general, the OLS estimator of \(\beta_1\) depends on \(X_2\). In fact, the OLS estimator of \(\beta_1\) when \(X_2\) is not included in the regression may be biased and inconsistent if \(X_1\) and \(X_2\) are correlated, see XXXX.
Nevertheless, we can show that in the special case that \(X_1\) is orthogonal to \(X_2\) we obtain the same OLS estimate for \(\beta_1\) using the complete specification than the one using the reduced specification.
Lemma 4.1 Let \(X = [X_1\ X_2]\) be a partition of the regressors such that \(X_1'X_2 = 0\); that is, \(X_1\) is orthogonal to \(X_2\). Then, the OLS estimator of \(\beta_1\) in the regression: \[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] is identical to the OLS estimator of \(\beta_1\) in the regression: \[
Y = X_1\beta_1 + V.
\]
Now, using the formula for the inverse of a block diagonal matrix we have that, \[
(X'X)^{-1} = \begin{bmatrix} (X_1'X_1)^{-1} & 0 \\ 0 & (X_2'X_2)^{-1} \end{bmatrix}.
\]
where \(\hat{\beta}_1^{(r)}\) is the OLS estimator of \(\beta_1\) in the regression \(Y = X_1\beta_1 + V\). \(\square\)
Corollary 4.1 In the same conditions as in Lemma 4.1, the OLS estimator of \(\beta_2\) in the regression:
\[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] is identical to the OLS estimator of \(\beta_2\) in the regression:
\[
Y = X_2\beta_2 + V.
\]
Example 4.2 The following example illustrates the result in Lemma 4.1. We generate data from a regression model where the regressors are orthogonal. We then estimate the model using OLS and show that the estimates of \(\beta_1\) are the same whether we include \(X_2\) in the regression or not.
usingRandom, Distributions, Statistics, LinearAlgebra# Set seed for reproducibilityRandom.seed!(123)# Generate dataN =100X1 =rand(Normal(1, 1), N) # regressorM =hcat(X1, Matrix{Float64}(I, N, N)) # We orthogonalize X2 with respect to X1 using QR decompositionQ =qr(M).QX2 = Q[:, 2] # orthogonal regressorY =2* X1 +3* X2 +rand(Normal(0, 1), N) # regressor# Fit OLS modelsX_full = [X1 X2]β_full = (X_full'X_full) \ (X_full'Y)X_reduced = X1β_reduced = (X_reduced'X_reduced) \ (X_reduced'Y)# Display estimatesdisplay(["Estimate of β1 (full): $(β_full[1])" ;"Estimate of β1 (reduced): $(β_reduced[1])"])
2-element Vector{String}:
"Estimate of β1 (full): 2.0434141520775038"
"Estimate of β1 (reduced): 2.0434141520775047"
In the output, we can see that the estimates of \(\beta_1\) are the same whether we include \(X_2\) in the regression or not, illustrating the result in Lemma 4.1.
In the code, we used QR decomposition to orthogonalise \(X_2\) with respect to \(X_1\). This ensures that \(X_1\) and \(X_2\) are orthogonal, satisfying the conditions of the lemma. Note that we could have generated \(X_2\) in many other ways, as long as it is orthogonal to \(X_1\).
4.3.2 Partialling Out Regressors
As a consequence of Lemma 4.1, if we could somehow remove the part of \(X_2\) that is correlated with \(X_1\), then we could estimate \(\beta_1\) without including \(X_2\) in the regression. This is accomplished by partialling out\(X_1\) from \(X_2\) using the projection matrix \(M_{1} = M_{X_1} = I - P_{X_1}\).
Hence, the estimator of \(\beta_2\) from the regression:
\[
Y = M_{1}X_2\beta_2 + U,
\]
are numerically identical to the estimator of \(\beta_2\) from the complete regression.
Nonetheless, the residuals from this regression are not the same as the residuals from the full regression \(Y = X_1\beta_1 + X_2\beta_2 + U\). This is because the residuals in the complete regression live in the space that contains \(X_1\) and \(X_2\), while the residuals in the partialled-out regression live in the space that contains only \(X_2\).
4.3.3 The FWL Theorem
To recover the same residuals as in the full regression, we need to partial out \(X_1\) from \(Y\) as well. This gives rise to the Frisch-Waugh-Lovell theorem.
Theorem 4.1 (Frisch-Waugh-Lovell) The OLS estimates of \(\beta_2\) in the regressions \[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] and \[
M_{1}Y = M_{1}X_2\beta_2 + U,
\] are numerically identical.
Moreover, the residuals in both regressions are numerically identical.
Proof. The estimate of \(\beta_2\) in \(M_{1}Y = M_{1}X_2\beta_2 + U\) is given by \[
\hat{\beta}_2 = (X_2'M_{1}X_2)^{-1}(X_2'M_{1}Y).
\]
On the other hand, OLS in the full regression gives: \[
Y = P_X Y+M_X Y = X_1\hat{\beta}_1+X_2\hat{\beta}_2+M_X Y.
\]
Premultiplying by \(X_2' M_{1}\) we obtain: \[
X_2' M_{1}Y = X_2' M_{1}X_2\hat{\beta}_2,
\] where \(X_2'M_{1}M_{X} = X_2' M_X = 0\). Solving for \(\hat{\beta}_2\) we obtain the same estimate as before.
Note that we have used the fact that \(M_1M_X = M_X\). This follows from the fact that \(M_1\) is a projection onto the orthogonal space to \(X_1\), while \(M_X\) is a projection onto the orthogonal space to both \(X_1\) and \(X_2\). Hence, \(M_X\) projects into a smaller space than \(M_1\), and applying \(M_1\) after \(M_X\) does not change anything.
To prove the second part of the theorem, we premultiply the complete regression by \(M_1\) to obtain \(M_{1}Y = M_{1}X_2\hat{\beta}_2 + M_X U\), where we use the fact that \(M_{1}M_X = M_X\). Hence, the residuals are \(M_X U\) in both regressions, where we have already shown that the \(\hat{\beta}_2\) estimates are the same. \(\square\)
4.4 Applications of the FWL Theorem: Demeaning, Detrending, and Deseasonalizing
The FWL theorem has several applications in practice. We discuss three of the most used ones: demeaning, detrending, and deseasonalizing data. They are all special cases of the FWL theorem and sometimes are used without knowing it.
4.4.1 Demeaning Data
If the regression includes a constant (\(\iota\) a vector of ones), \[
Y = \iota\beta_0 + X\beta_1 + U.
\]
The FWL theorem shows that the estimator for \(\beta_1\) is the same if we instead run the regression: \[
M_{\iota}Y = M_{\iota}X\beta_1 + U.
\]
Hence, we obtain the same estimates by demeaning \(Y\) and \(X\) before running the regression or by including a constant in the regression, see Exercise 4.2.
Example 4.3 We illustrate the FWL theorem by comparing the estimates obtained from a regression that includes a constant with those obtained from a regression using demeaned data. We consider the estimation of a quadratic trend for temperature data, see Example 3.1.
The code to obtain the estimates is available in the snippet below.
Code
usingDataFrames, CSV, Downloads, Statisticsurl ="https://raw.githubusercontent.com/everval/Global-Temperature-Anomalies/refs/heads/main/data/HadCRUT5_global_monthly_average.csv"# link to HadCRUT5 datasettemp_data = CSV.read(Downloads.download(url), DataFrame) # read data into DataFrameT =nrow(temp_data)## Regression with constantX = [ones(T) 1:T (1:T).^2] # design matrix with intercept, time, and time^2Y = temp_data.Temp # dependent variable # OLS estimatorβ̂ = (X'X) \ (X'Y)## Regression with demeaned datamean_temp =mean(temp_data.Temp) # mean temperatureY_demeaned = temp_data.Temp .- mean_temp # demean temperatureX = [1:T (1:T).^2] # design matrix with time, and time^2X_demeaned =zeros(T, 2) # design matrix with time, and time^2X_demeaned[:, 1] = X[:, 1] .-mean(X[:, 1]) # demean regressorsX_demeaned[:, 2] = X[:, 2] .-mean(X[:, 2]) # demean regressors# OLS estimatorβ̂_demeaned = (X_demeaned'X_demeaned) \ (X_demeaned'Y_demeaned)display(["Estimate of β1 (with constant): $(β̂[2])" ;"Estimate of β1 (demeaned): $(β̂_demeaned[1])"])
2-element Vector{String}:
"Estimate of β1 (with constant): -0.0004936316112256622"
"Estimate of β1 (demeaned): -0.0004936316112256785"
In the output, we can see that the estimates of \(\beta_1\) are the same whether we include a constant in the regression or use demeaned data, illustrating the FWL theorem.
4.4.2 Detrending Data
Some variables commonly used contain time trends so that we may consider a regression like: \[
Y = \alpha_0 \iota + \alpha_1 t + X\beta + U,
\] where \(t' = [1,2,\cdots,N]\) is a vector of time periods to capture the time trend.
The FWL theorem shows that we obtain the same estimates if we instead run the regression using detrended data.
4.4.3 Deseasonalizing Data
Furthermore, some variables may show a seasonal behavior. We can model seasonality using seasonal dummy variables \[
Y = \alpha_1 s_1 + \alpha_2 s_2 + \alpha_3 s_3 + \alpha_4 s_4 + X\beta + U,
\] where \(s_i\) are the seasonal dummy variables. In the equation above, we assume quarterly data, but the same idea applies to monthly or weekly data as well, just by changing the number of seasonal dummies.
The FWL theorem tells us that we can estimate \(\beta\) using deseasonalized data, as long as the seasonality is captured by linear methods, such as seasonal dummies.
4.5 Application of the FWL Theorem: Goodness of Fit
Another common application of the FWL theorem is in the definition of the goodness of fit of a regression. Similar to the examples above, the FWL theorem is sometimes used without explicitly mentioning it.
4.5.1 Uncentered \(R^2\)
We start by defining the uncentered coefficient of determination, or plain \(R^2\).
Recalling that \(P_XY\) is what \(X\) can explain from \(Y\) motivates the following definition.
Definition 4.1 ((Uncentered) \(R^2\)) The coefficient of determination or (uncentered) \(R^2\) is defined as \[
R^2 = \frac{||P_XY||^2}{||Y||^2}.
\]
The (uncentered) \(R^2\) has some useful properties:
By Equation 4.4, \(0 \leq R^2 \leq 1\). 0 means that \(X\) explains nothing of \(Y\), while 1 means that \(X\) perfectly explains \(Y\).
It is invariant to (nonsingular) linear transformations of \(X\) and to changes in the scale of \(Y\), see Exercise 4.3.
Nonetheless, the (uncentered) \(R^2\) is not invariant to translations.
To see the latter, consider \(\tilde{Y} := Y + \alpha \iota\) in a regression where \(X\) includes a constant, then, \[
P_X(Y + \alpha \iota) = P_XY + \alpha P_X\iota = P_XY + \alpha \iota,
\] where we use the fact that \(P_X\iota = \iota\) if \(X\) includes a constant.
Replacing \(Y\) by \(\tilde{Y}\) in the definition of \(R^2\) we obtain:
which depends on \(\alpha\). Making \(\alpha\) be very large (in absolute value), we can make \(R^2\) as close to 1 as we want without changing the relationship between \(Y\) and \(X\).
4.5.2 Centered \(R^2\)
To avoid the translation problem of the (uncentered) \(R^2\), we can use the FLW theorem to demean \(Y\) and \(X\) before calculating the \(R^2\) (for regressions that include a constant). The FWL theorem tells us that the estimates and residuals do not change if we demean \(Y\) and \(X\) before running the regression.
This gives rise to the (centered) \(R^2\) defined as follows.
Definition 4.2 ((Centered) \(R^2\)) The (centered) coefficient of determination or (centered) \(R^2\) is defined as \[
R^2 = \frac{||P_XM_\iota Y||^2}{||M_\iota Y||^2}.
\]
The centered \(R^2\) has the same properties as the uncentered \(R^2\), but it is also invariant to translations of \(Y\) given that any translation is removed by demeaning.
NoteNote on R-squared
From now on, when we refer to \(R^2\) without qualification, we mean the centered \(R^2\).
4.5.3 Adjusted \(R^2\)
As shown below, another possible issue with the \(R^2\) is that it always increases when we add more regressors to the specification, regardless of whether the new regressors are relevant or not.
Proposition 4.1 The \(R^2\) is non-decreasing in the number of regressors.
The intuition behind this result is that adding more regressors can only increase the space spanned by the columns of \(X\), and hence, it can only increase the length of the projection of \(Y\) onto that space. This can be a problem when comparing models with different numbers of regressors.
Hence, another common variant of the \(R^2\) is the adjusted \(R^2\), which penalizes the inclusion of additional regressors. It is defined as follows.
Definition 4.3 (Adjusted \(R^2\)) The adjusted coefficient of determination or adjusted \(R^2\) is defined as \[
\bar{R}^2 = 1 - (1 - R^2)\frac{N - 1}{N - K},
\] where \(R^2\) is the (centered) coefficient of determination, \(N\) is the sample size, and \(K\) is the number of regressors.
The adjusted \(R^2\) can decrease when adding more regressors, and it can be negative. A negative adjusted \(R^2\) indicates that the model is worse than a model that only includes a constant.
Given this last property, adjusted \(R^2\) is better suited for comparing models with different numbers of regressors.
4.6 Notes on the Precision of the OLS Estimator
Next, we discuss some factors that affect the precision of the OLS estimator.
As noted earlier, the variance of the OLS estimator is given by: \[
Var(\hat{\beta}) = \sigma^2 (X'X)^{-1},
\tag{4.7}\] where \(\sigma^2\) is the variance of the error term.
It can be shown that the variance of the OLS estimator depends on three factors, described next.
Lemma 4.2 Under the standard OLS assumptions, the variance of the OLS estimator is affected by:
The variance of the error term, \(\sigma^2\).
The sample size, \(N\).
The relationship between the regressors, \(X\).
Proof.
The dependence on the variance of the error term, \(\sigma^2\), is straightforward, as it is a multiplicative factor in Equation 4.7. The larger the variance of the error term, the larger the variance of the OLS estimator.
The dependence on the sample size can be seen if we write, \[
Var(\hat{\beta}) = \sigma^2(X'X)^{-1} = \left(\frac{1}{n}\sigma^2\right)\left(\frac{1}{n}X'X\right)^{-1},
\] and assuming, as before, that \(plim_{n\to\infty} \left(\frac{1}{n}X'X\right)^{-1}=S_{XX}\). Then, as \(n\) increases, \(\frac{1}{n}\sigma^2\) decreases and \(\frac{1}{n}X'X\) converges to \(S_{XX}\). Hence, the variance of the OLS estimator decreases.
The dependence on the relationship between the regressors is more subtle and requires the use of the FWL theorem.
Consider the regression \[
Y = X_1\beta_1+X_2\beta_2+U,
\] where \(X=[X_1,\ X_2]\), and \(X_2\) is a column vector.
From the FWL theorem, \(\hat{\beta_2}\) can be estimated from \[M_{1}Y = M_{1}X_2\beta_2+V.\]
Looking at the denominator, if \(X_1\) and \(X_2\) are orthogonal, then the variance of the OLS estimator of \(\beta_2\) is minimized since \(X_2'M_{1}X_2\) is maximized. On the other hand, if \(X_1\) and \(X_2\) are highly correlated, then the variance of the OLS estimator of \(\beta_2\) is increased since \(X_2'M_{1}X_2\) is decreased. This is the numerical phenomenon known as multicollinearity. \(\square\)
Example 4.4 The following example illustrates the effect of multicollinearity on the precision of the OLS estimator. We generate data from a regression model where two regressors are either uncorrelated or correlated. We then estimate the model using OLS and compare the distribution of the estimates of one of the coefficients in both cases.
The code to generate the data and plot the distributions is available in the snippet below.
The histogram shows the distribution of the estimates of \(\beta_1\) when \(X_1\) and \(X_2\) are uncorrelated (in blue) and when they are correlated (in orange). The red dashed line is the normal density with mean 1 and standard deviation \(1/\sqrt{N}\), which is the asymptotic distribution of the OLS estimator when the regressors are uncorrelated.
Note that the distribution of the estimates is much wider when the regressors are correlated, indicating that the precision of the OLS estimator is lower in this case. This illustrates the effect of multicollinearity on the precision of the OLS estimator.
The multicollinearity problem is not solved by increasing the sample size. Hence, it is important to consider the correlation structure of the regressors when designing a regression model.
To diagnose multicollinearity, it is common to look at the variance inflation factors of the regressors.
Definition 4.4 The variance inflation factor (VIF) for regressor \(X_j\) is defined as: \[
VIF_j = \frac{1}{1 - R_j^2},
\] where \(R_j^2\) is the (centered) \(R^2\) obtained by regressing \(X_j\) on all the other regressors.
High VIFs indicate that the regressors are highly correlated and that multicollinearity may be a problem. A common rule of thumb is that a VIF above 10 indicates a multicollinearity problem, although this threshold is somewhat arbitrary and depends on the context. In particular, VIFs are not a formal statistical test for multicollinearity.
The VIF can be interpreted as the factor by which the variance of the OLS estimator of \(\beta_j\) is increased due to the correlation between \(X_j\) and the other regressors. That is, the variance of the OLS estimator of \(\beta_j\) can be written as:
where \(Var(X_j)\) is the sample variance of \(X_j\). The reader is asked to show Equation 4.8 in Exercise 4.5.
Note that the VIF for regressor \(X_j\) depends on all the other regressors included in the regression, \(X_{-j}\). Different sets of regressors may lead to different VIFs for the same regressor. Hence, it is important to consider the VIFs of all the regressors when diagnosing multicollinearity.
4.7 Exercises
Exercise 4.1 Show that \(P_X\) and \(M_X\) are complementary projection matrices. That is, show:
\(P_X = P_X'\),
\(M_X = M_X'\),
\(P_X^2 = P_X\),
\(M_X^2 = M_X\),
\(P_XM_X = 0\),
\(P_X + M_X = I\),
\(P_XX = X\), and
\(M_XX = 0\).
Hint: Use the fact that \((A')^{-1} = (A^{-1})'\).
Exercise 4.2 Let \(\iota\) be a vector of ones and \(M_{\iota} = I - P_{\iota}\) be the maker of residuals when the regressors include only a constant.
Show that \(M_{\iota}Y = Y - \bar{Y}\), where \(\bar{Y} = \frac{1}{N}\sum_{t=1}^N Y_t\).
Exercise 4.3 Let \(A\) be a nonsingular matrix. Show that the (uncentered) \(R^2\) is invariant to replacing \(X\) by \(XA\).
Furthermore, show that the (uncentered) \(R^2\) is invariant to replacing \(Y\) by \(\alpha Y\), where \(\alpha \neq 0\).
Exercise 4.4 You are going to show that the \(R^2\) is non-decreasing in the number of regressors. Consider the regressions given by: \[
Y = X\beta + U,
\] and \[
Y = X\beta + Z\gamma + V.
\]
Show that the \(R^2\) from the second regression is greater than or equal to the \(R^2\) from the first regression. That is, show that: \[
\frac{||P_{[X\ Z]}Y||^2}{||Y||^2} \geq \frac{||P_XY||^2}{||Y||^2}.
\]
Exercise 4.5 Show that the variance of the OLS estimator of \(\beta_j\) can be written as in Equation 4.8, that is, \[
Var(\hat{\beta}_j) = \frac{\sigma^2}{(n-1)Var(X_j)}VIF_j,
\]
where \(Var(X_j)\) is the sample variance of \(X_j\).
4.8 Solution to Selected Exercises
Solution 4.1 (Solution to Exercise 4.3.). Note that if \(X\) is replaced by \(XA\) then, \[
\begin{align*}
P_{XA}Y &= XA((XA)'XA)^{-1}(XA)'Y \\
&= XA(A'X'XA)^{-1}A'X'Y = XA(A)^{-1}(X'X)^{-1}(A')^{-1}A'X'Y \\
& = P_XY,
\end{align*}
\] where we use the fact that \(A\) is invertible and that \((A')^{-1} = (A^{-1})'\). Replacing this in the definition of \(R^2\) shows the invariance to linear transformations of \(X\).
Furthermore, if \(Y\) is replaced by \(\alpha Y\) then, \[
P_X(\alpha Y) = \alpha P_XY, \quad ||\alpha Y||^2 = \alpha^2 ||Y||^2.
\]
Solution 4.2 (Solution to Exercise 4.4.). Hint: Show that the difference between the two \(R^2\) can be written as \(Y' (P_{[X\ Z]} - P_X) Y / ||Y||^2\) and use the properties of projection matrices to show that \(P_{[X\ Z]} - P_X\) has a quadratic form, hence, it is non-negative.
Solution 4.3 (Solution to Exercise 4.5.). Note that \[
\begin{align*}
1-R^2_{X_j|X_{-j}} &= 1-\frac{||P_{-j}X_j||^2}{||X_j||^2} = 1-\frac{X_j'P_{-j}X_j}{X_j'X_j} \\
&= \frac{X_j'X_j-X_j'P_{-j}X_j}{X_j'X_j} =\frac{X_j'M_{-j}X_j}{X_j'X_j},
\end{align*}
\] where \(P_{-j}\) is the projection matrix onto the space spanned by all the regressors except \(X_j\), and \(M_{-j} = I - P_{-j}\). And we have used the properties of projection matrices.
Hence, \[X_j'M_{-j}X_j = (X_j'X_j)(1-R^2_{X_j|X_{-j}})=(n-1)Var(X_j)(1-R^2_{X_j|X_{-j}}),\] where we have used that \(Var(X_j)=\frac{1}{n-1}X_j'X_j\).