This chapter discusses some key properties of the Ordinary Least Squares (OLS) estimator. We will focus on the properties that are most relevant for understanding the behavior of the OLS estimator in practice. Furthermore, this chapter introduces one of the most important theorems in Econometrics, the Frisch-Waugh-Lovell theorem, which provides a useful way to employ OLS in practice.
4.2 Orthogonal Projections
The OLS estimator can be interpreted in terms of orthogonal projections. OLS decomposes the dependent variable \(Y\) into two components: one that is explained by the regressors \(X\) and another that is not. We can write this decomposition as:
\[
Y = P_XY + (I - P_X)Y = P_X Y + M_X Y = X\hat{\beta} + \hat{U},
\tag{4.1}\]
where \(I\) is the identity matrix, \(P_XY = X\hat{\beta}\) is the part of \(Y\) that is explained by \(X\), and \((I - P_X)Y = \hat{U}\) is the part of \(Y\) that is not explained by \(X\).
Above, the matrix \(P_X\) is given by: \[
P_X = X(X'X)^{-1}X',
\tag{4.2}\]
which is called the projection matrix onto the column space of \(X\). The matrix \(P_X\) projects \(Y\) into the space spanned by the columns of \(X\).
Furthermore, we define the matrix \(M_X\) or maker of residuals as:
\[
M_X := (I - P_X),
\tag{4.3}\]
which is a projection matrix onto the orthogonal space to the columns of \(X\).
4.2.1 Properties of Projection Matrices
As the descriptions above suggest, \(P_X\) and \(M_X\) are projection matrices. They are symmetric and idempotent, as the reader is asked to verify in Exercise 4.1.
Moreover, they are complementary in the sense that \(P_XM_X = 0\) and \(P_X + M_X = I\). Hence, they are orthogonal projections.
Finally, note that \(P_XX = X\) and \(M_XX = 0\). The last two properties are intuitive, as \(P_X\) projects into the space spanned by the columns of \(X\), while \(M_X\) projects into the orthogonal space to the columns of \(X\). We can explain \(X\) perfectly with itself, and there is no part of \(X\) that is orthogonal to itself.
Exercise 4.1 Show that \(P_X\) and \(M_X\) are complementary projection matrices. That is, show:
\(P_X = P_X'\),
\(M_X = M_X'\),
\(P_X^2 = P_X\),
\(M_X^2 = M_X\),
\(P_XM_X = 0\),
\(P_X + M_X = I\),
\(P_XX = X\), and
\(M_XX = 0\).
Hint: Use the fact that \((A')^{-1} = (A^{-1})'\).
4.2.2 Consequences of the Decomposition
From Equation 4.1 note that the \(P_XY\) and \(M_XY\) projections can be represented by a right-angled triangle, where the hypotenuse is \(Y\), the projection of \(Y\) onto \(X\) is \(P_XY\), and the projection of \(Y\) onto the orthogonal space to \(X\) is \(M_XY\).
By Pythagoras’ Theorem, a direct consequence of this decomposition is that the length of the explained part is smaller or equal to the length of the dependent variable:
\[
||P_XY||^2 \leq ||Y||^2.
\tag{4.4}\]
The inequality above is the basis for the coefficient of determination, which we will discuss later.
Moreover, by construction, given that they minimise Equation 3.5, the length of the residuals is smaller or equal to the length of the error term: \[
||\hat{U}||^2 \leq ||U||^2.
\tag{4.5}\]
This difference is further explored when finding an unbiased estimator for the variance of the error term.
A final property of the projections is that the residuals are orthogonal to the regressors:
\[
X'\hat{U} = X'M_XY = 0,
\tag{4.6}\]
where the last equality follows from the fact that \(X'M_X = 0\).
The orthogonality of the residuals to the regressors is a key property of the OLS estimator and does not depend on the true error term being orthogonal to the regressors. That is, Equation 4.6 holds regardless of whether the regressors are exogenous or not.
A direct consequence of this is that if the regressors include a constant, then the residuals sum to zero, \(\sum_{t=1}^N \hat{U}_t = 0\). Hence, the residuals are demeaned, regardless of the expected value of the error term.
4.3 The Frisch-Waugh-Lovell Theorem
The Frisch-Waugh-Lovell (FWL) theorem provides a useful way to understand the OLS estimator in the presence of multiple regressors. The theorem states that the OLS estimator of a subset of regressors can be obtained by regressing the dependent variable and the subset of regressors on the remaining regressors and then regressing the residuals of the dependent variable on the residuals of the subset of regressors. It is a powerful result that allows us to understand the OLS estimator in terms of partial correlations.
4.3.1 Orthogonal Regressors
We are interested in analyzing the effect that partitioning the regressors have on the estimators. Assume we broke up the regressors into two groups \(X = [X_1\ X_2]\) so that \[
Y = X_1\beta_1 + X_2\beta_2 + U.
\]
In general, the OLS estimator of \(\beta_1\) depends on \(X_2\). In fact, the OLS estimator of \(\beta_1\) when \(X_2\) is not included in the regression may be biased and inconsistent if \(X_1\) and \(X_2\) are correlated, see XXXX.
Nevertheless, we can show that in the special case that \(X_1\) is orthogonal to \(X_2\) we obtain the same OLS estimate for \(\beta_1\) using the complete specification than the one using the reduced specification.
Lemma 4.1 Let \(X = [X_1\ X_2]\) be a partition of the regressors such that \(X_1'X_2 = 0\); that is, \(X_1\) is orthogonal to \(X_2\). Then, the OLS estimator of \(\beta_1\) in the regression: \[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] is identical to the OLS estimator of \(\beta_1\) in the regression: \[
Y = X_1\beta_1 + V.
\]
Now, using the formula for the inverse of a block diagonal matrix we have that, \[
(X'X)^{-1} = \begin{bmatrix} (X_1'X_1)^{-1} & 0 \\ 0 & (X_2'X_2)^{-1} \end{bmatrix}.
\]
where \(\hat{\beta}_1^{(r)}\) is the OLS estimator of \(\beta_1\) in the regression \(Y = X_1\beta_1 + V\). \(\square\)
Corollary 4.1 In the same conditions as in Lemma 4.1, the OLS estimator of \(\beta_2\) in the regression:
\[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] is identical to the OLS estimator of \(\beta_2\) in the regression:
\[
Y = X_2\beta_2 + V.
\]
4.3.2 Partialling Out Regressors
As a consequence of Lemma 4.1, if we could somehow remove the part of \(X_2\) that is correlated with \(X_1\), then we could estimate \(\beta_1\) without including \(X_2\) in the regression. This is accomplished by partialling out\(X_1\) from \(X_2\) using the projection matrix \(M_{1} = M_{X_1} = I - P_{X_1}\).
Hence, the estimator of \(\beta_2\) from the regression:
\[
Y = M_{1}X_2\beta_2 + U,
\]
are numerically identical to the estimator of \(\beta_2\) from the complete regression.
Nonetheless, the residuals from this regression are not the same as the residuals from the full regression \(Y = X_1\beta_1 + X_2\beta_2 + U\). This is because the residuals in the complete regression live in the space that contains \(X_1\) and \(X_2\), while the residuals in the partialled-out regression live in the space that contains only \(X_2\).
4.3.3 The FWL Theorem
To recover the same residuals as in the full regression, we need to partial out \(X_1\) from \(Y\) as well. This gives rise to the Frisch-Waugh-Lovell theorem.
Theorem 4.1 (Frisch-Waugh-Lovell) The OLS estimates of \(\beta_2\) in the regressions \[
Y = X_1\beta_1 + X_2\beta_2 + U,
\] and \[
M_{1}Y = M_{1}X_2\beta_2 + U,
\] are numerically identical.
Moreover, the residuals in both regressions are numerically identical.
Proof. The estimate of \(\beta_2\) in \(M_{1}Y = M_{1}X_2\beta_2 + U\) is given by \[
\hat{\beta}_2 = (X_2'M_{1}X_2)^{-1}(X_2'M_{1}Y).
\]
On the other hand, OLS in the full regression gives: \[
Y = P_X Y+M_X Y = X_1\hat{\beta}_1+X_2\hat{\beta}_2+M_X Y.
\]
Premultiplying by \(X_2' M_{1}\) we obtain: \[
X_2' M_{1}Y = X_2' M_{1}X_2\hat{\beta}_2,
\] where \(X_2'M_{1}M_{X} = X_2' M_X = 0\). Solving for \(\hat{\beta}_2\) we obtain the same estimate as before.
Note that we have used the fact that \(M_1M_X = M_X\). This follows from the fact that \(M_1\) is a projection onto the orthogonal space to \(X_1\), while \(M_X\) is a projection onto the orthogonal space to both \(X_1\) and \(X_2\). Hence, \(M_X\) projects into a smaller space than \(M_1\), and applying \(M_1\) after \(M_X\) does not change anything.
To prove the second part of the theorem, we premultiply the complete regression by \(M_1\) to obtain \(M_{1}Y = M_{1}X_2\hat{\beta}_2 + M_X U\), where we use the fact that \(M_{1}M_X = M_X\). Hence, the residuals are \(M_X U\) in both regressions, where we have already shown that the \(\hat{\beta}_2\) estimates are the same. \(\square\)
4.3.4 Applications of the FWL Theorem: Demeaning, Detrending, and Deseasonalizing
The FWL theorem has several applications in practice. We discuss three of the most used ones: demeaning, detrending, and deseasonalizing data. They are all special cases of the FWL theorem and sometimes are used without knowing it.
4.3.4.0.1 Demeaning Data
If the regression includes a constant (\(\iota\) a vector of ones), \[
Y = \iota\beta_0 + X\beta_1 + U.
\]
The FWL theorem shows that the estimator for \(\beta_1\) is the same if we instead run the regression: \[
M_{\iota}Y = M_{\iota}X\beta_1 + U.
\]
Hence, we obtain the same estimates by demeaning \(Y\) and \(X\) before running the regression (see Exercise 4.2) or by including a constant in the regression.
Exercise 4.2 Let \(\iota\) be a vector of ones and \(M_{\iota} = I - P_{\iota}\) be the maker of residuals when the regressors include only a constant.
Show that \(M_{\iota}Y = Y - \bar{Y}\), where \(\bar{Y} = \frac{1}{N}\sum_{t=1}^N Y_t\).
4.3.4.1 Detrending Data
Some variables commonly used contain time trends so that we may consider a regression like: \[
Y = \alpha_0 \iota + \alpha_1 t + X\beta + U,
\] where \(t' = [1,2,\cdots,N]\) is a vector of time periods to capture the time trend.
The FWL theorem shows that we obtain the same estimates if we instead run the regression using detrended data.
4.3.4.2 Deseasonalizing Data
Furthermore, some variables may show a seasonal behavior. We can model seasonality using seasonal dummy variables \[
Y = \alpha_1 s_1 + \alpha_2 s_2 + \alpha_3 s_3 + \alpha_4 s_4 + X\beta + U,
\] where \(s_i\) are the seasonal dummy variables. In the equation above, we assume quarterly data, but the same idea applies to monthly or weekly data as well, just by changing the number of seasonal dummies.
The FWL theorem tells us that we can estimate \(\beta\) using deseasonalized data.
4.3.5 Application of the FWL Theorem: Goodness of Fit
Another common application of the FWL theorem is in the definition of the goodness of fit of a regression. Similar to the examples above, the FWL theorem is sometimes used without explicitly mentioning it.
4.3.5.1 Uncentered \(R^2\)
We start by defining the uncentered coefficient of determination, or plain \(R^2\).
Recalling that \(P_XY\) is what \(X\) can explain from \(Y\) motivates the following definition.
Definition 4.1 ((Uncentered) \(R^2\)) The coefficient of determination or (uncentered) \(R^2\) is defined as \[
R^2 = \frac{||P_XY||^2}{||Y||^2}.
\]
The (uncentered) \(R^2\) has some useful properties:
By Equation 4.4, \(0 \leq R^2 \leq 1\). 0 means that \(X\) explains nothing of \(Y\), while 1 means that \(X\) perfectly explains \(Y\).
It is invariant to (nonsingular) linear transformations of \(X\) and to changes in the scale of \(Y\) (see Exercise 4.3).
Exercise 4.3 Let \(A\) be a nonsingular matrix. Show that the (uncentered) \(R^2\) is invariant to replacing \(X\) by \(XA\).
Furthermore, show that the (uncentered) \(R^2\) is invariant to replacing \(Y\) by \(\alpha Y\), where \(\alpha \neq 0\).
Nonetheless, the (uncentered) \(R^2\) is not invariant to translations. Consider \(\tilde{Y} := Y + \alpha \iota\) in a regression where \(X\) includes a constant, then, \[
P_X(Y + \alpha \iota) = P_XY + \alpha P_X\iota = P_XY + \alpha \iota,
\] where we use the fact that \(P_X\iota = \iota\) if \(X\) includes a constant.
So that, if we replace \(Y\) by \(\tilde{Y}\) in the definition of \(R^2\) we obtain:
which depends on \(\alpha\). Making \(\alpha\) be very large (in absolute value), we can make \(R^2\) as close to 1 as we want without changing the relationship between \(Y\) and \(X\).
4.3.5.2 Centered \(R^2\)
To avoid the translation problem of the (uncentered) \(R^2\), we can use the FLW theorem to demean \(Y\) and \(X\) before calculating the \(R^2\) (for regressions that include a constant). The FWL theorem tells us that the estimates and residuals do not change if we demean \(Y\) and \(X\) before running the regression.
This gives rise to the (centered) \(R^2\) defined as follows.
Definition 4.2 ((Centered) \(R^2\)) The (centered) coefficient of determination or (centered) \(R^2\) is defined as \[
R^2 = \frac{||P_XM_\iota Y||^2}{||M_\iota Y||^2}.
\]
The centered \(R^2\) has the same properties as the uncentered \(R^2\), but it is also invariant to translations of \(Y\) given that any translation is removed by demeaning.
Note on R-squared
From now on, when we refer to \(R^2\) without qualification, we mean the centered \(R^2\).
4.3.5.3 Adjusted \(R^2\)
As shown below, another possible issue with the \(R^2\) is that it always increases when we add more regressors to the specification, regardless of whether the new regressors are relevant or not.
Proposition 4.1 The \(R^2\) is non-decreasing in the number of regressors.
Exercise 4.4 You are going to show that the \(R^2\) is non-decreasing in the number of regressors. Consider the regressions given by: \[
Y = X\beta + U,
\] and \[
Y = X\beta + Z\gamma + V.
\]
Show that the \(R^2\) from the second regression is greater than or equal to the \(R^2\) from the first regression. That is, show that: \[
\frac{||P_{[X\ Z]}Y||^2}{||Y||^2} \geq \frac{||P_XY||^2}{||Y||^2}.
\]
The intuition behind this result is that adding more regressors can only increase the space spanned by the columns of \(X\), and hence, it can only increase the length of the projection of \(Y\) onto that space. This can be a problem when comparing models with different numbers of regressors.
Hence, another common variant of the \(R^2\) is the adjusted \(R^2\), which penalizes the inclusion of additional regressors. It is defined as follows.
Definition 4.3 (Adjusted \(R^2\)) The adjusted coefficient of determination or adjusted \(R^2\) is defined as \[
\bar{R}^2 = 1 - (1 - R^2)\frac{N - 1}{N - K},
\] where \(R^2\) is the (centered) coefficient of determination, \(N\) is the sample size, and \(K\) is the number of regressors.
The adjusted \(R^2\) can decrease when adding more regressors, and it can be negative. A negative adjusted \(R^2\) indicates that the model is worse than a model that only includes a constant.
Given this last property, adjusted \(R^2\) is better suited for comparing models with different numbers of regressors.
4.4 Notes on the Precision of the OLS Estimator
Next, we discuss some factors that affect the precision of the OLS estimator.
As noted earlier, the variance of the OLS estimator is given by: \[
Var(\hat{\beta}) = \sigma^2 (X'X)^{-1},
\] where \(\sigma^2\) is the variance of the error term.
It can be shown that the variance of the OLS estimator depends on three factors, described next.
Lemma 4.2 Under the standard OLS assumptions, the variance of the OLS estimator is affected by:
The variance of the error term, \(\sigma^2\).
The sample size, \(N\).
The relationship between the regressors, \(X\).
Proof.
The dependence on the variance of the error term, \(\sigma^2\), is straightforward, as it is a multiplicative factor in Equation 3.12. The larger the variance of the error term, the larger the variance of the OLS estimator.
The dependence on the sample size can be seen if we write, \[
Var(\hat{\beta}) = \sigma^2(X'X)^{-1} = \left(\frac{1}{n}\sigma^2\right)\left(\frac{1}{n}X'X\right)^{-1},
\] and assuming, as before, that \(plim_{n\to\infty} \left(\frac{1}{n}X'X\right)^{-1}=S_{XX}\). Then, as \(n\) increases, \(\frac{1}{n}\sigma^2\) decreases and \(\frac{1}{n}X'X\) converges to \(S_{XX}\). Hence, the variance of the OLS estimator decreases.
The dependence on the relationship between the regressors is more subtle and requires the use of the FWL theorem.
Consider the regression \[
Y = X_1\beta_1+X_2\beta_2+U,
\] where \(X=[X_1,\ X_2]\), and \(X_2\) is a column vector.
From the FWL theorem, \(\hat{\beta_2}\) can be estimated from \[M_{1}Y = M_{1}X_2\beta_2+V.\]
Looking at the denominator, if \(X_1\) and \(X_2\) are orthogonal, then the variance of the OLS estimator of \(\beta_2\) is minimized since \(X_2'M_{1}X_2\) is maximized. On the other hand, if \(X_1\) and \(X_2\) are highly correlated, then the variance of the OLS estimator of \(\beta_2\) is increased since \(X_2'M_{1}X_2\) is decreased. This is the numerical phenomenon known as multicollinearity. \(\square\)
4.4.1 Example: Effect of correlation between regressors on precision
The multicollinearity problem is not solved by increasing the sample size. Hence, it is important to consider the correlation structure of the regressors when designing a regression model.
To diagnose multicollinearity, it is common to look at the variance inflation factors of the regressors.
Definition 4.4 The variance inflation factor (VIF) for regressor \(X_j\) is defined as: \[
VIF_j = \frac{1}{1 - R_j^2},
\] where \(R_j^2\) is the (centered) \(R^2\) obtained by regressing \(X_j\) on all the other regressors.
High VIFs indicate that the regressors are highly correlated and that multicollinearity may be a problem. A common rule of thumb is that a VIF above 10 indicates a multicollinearity problem, although this threshold is somewhat arbitrary and depends on the context. In particular, VIFs are not a formal statistical test for multicollinearity.
The following exercise asks the reader to show that the variance of the OLS estimator of \(\beta_j\) can be written in terms of the VIF.
Exercise 4.5 Show that the variance of the OLS estimator of \(\beta_j\) can be written as: \[
Var(\hat{\beta}_j) = \frac{\sigma^2}{(n-1)Var(X_j)}VIF_j,
\] where \(Var(X_j)\) is the sample variance of \(X_j\).
Note that the VIF for regressor \(X_j\) depends on all the other regressors included in the regression, \(X_{-j}\). Different sets of regressors may lead to different VIFs for the same regressor. Hence, it is important to consider the VIFs of all the regressors when diagnosing multicollinearity.
Solution 4.1 (Solution to Exercise 4.3.). Note that if \(X\) is replaced by \(XA\) then, \[
\begin{align*}
P_{XA}Y &= XA((XA)'XA)^{-1}(XA)'Y \\
&= XA(A'X'XA)^{-1}A'X'Y = XA(A)^{-1}(X'X)^{-1}(A')^{-1}A'X'Y \\
& = P_XY,
\end{align*}
\] where we use the fact that \(A\) is invertible and that \((A')^{-1} = (A^{-1})'\). Replacing this in the definition of \(R^2\) shows the invariance to linear transformations of \(X\).
Furthermore, if \(Y\) is replaced by \(\alpha Y\) then, \[
P_X(\alpha Y) = \alpha P_XY, \quad ||\alpha Y||^2 = \alpha^2 ||Y||^2.
\]
Solution 4.2 (Solution to Exercise 4.4.). Hint: Show that the difference between the two \(R^2\) can be written as \(Y' (P_{[X\ Z]} - P_X) Y / ||Y||^2\) and use the properties of projection matrices to show that \(P_{[X\ Z]} - P_X\) has a quadratic form, hence, it is non-negative.
Solution 4.3 (Solution to Exercise 4.5.). Note that \[
\begin{align*}
1-R^2_{X_j|X_{-j}} &= 1-\frac{||P_{-j}X_j||^2}{||X_j||^2} = 1-\frac{X_j'P_{-j}X_j}{X_j'X_j} \\
&= \frac{X_j'X_j-X_j'P_{-j}X_j}{X_j'X_j} =\frac{X_j'M_{-j}X_j}{X_j'X_j},
\end{align*}
\] where \(P_{-j}\) is the projection matrix onto the space spanned by all the regressors except \(X_j\), and \(M_{-j} = I - P_{-j}\). And we have used the properties of projection matrices.
Hence, \[X_j'M_{-j}X_j = (X_j'X_j)(1-R^2_{X_j|X_{-j}})=(n-1)Var(X_j)(1-R^2_{X_j|X_{-j}}),\] where we have used that \(Var(X_j)=\frac{1}{n-1}X_j'X_j\).