This chapter discusses the assumptions needed for the Ordinary Least Squares (OLS) estimator in relation to its properties. We discuss each assumption in detail, providing intuition and examples. Each assumptions is presented in the order that it is needed to understand the OLS estimator. The chapter finishes with a summary of all the assumptions and the Gauss-Markov theorem, proving that OLS is the Best Linear Unbiased Estimator (BLUE).
3.2 Linear model
We are interested in assessing the effect that a set of explanatory variables, typically denoted by \(X\), has on the dependent variable, denoted by \(Y\). For this, we assume that there is a linear relationship between them.
Assumption: Correct specification
The model is correctly specified if the true model is linear in the parameters and it is given by \[
Y = X\beta + U,
\tag{3.1}\]
where \(\beta\) is a vector of parameters that we want to estimate, and \(U\) is the error term. The error term captures the effect of all other factors that affect \(Y\) but cannot be explained by \(X\).
Explanatory variables are also called regressors or independent variables. The dependent variable is also called the regressand or response variable. The term regression comes from the seminal work of Francis Galton on the relationship between parents’ and children’s heights (Galton 1886). He observed that tall parents tend to have children not as tall as themselves, and short parents tend to have children not as short as themselves. He called this phenomenon regression to the mean.
Equation 3.1 is a linear model in the sense that it is linear in the parameters \(\beta\). No parameter is raised to a power other than one, nor is it multiplied by another parameter, nor any other nonlinear transformation is applied to it. However, the model may not be linear in the variables. The matrix \(X\) may contain nonlinear transformations of the explanatory variables.
A classical example of a nonlinear model in the explanatory variables is the quadratic model for the relationship between earnings and education. One formulation of this model is:
In Equation 3.2, the dependent variable is \(earnings\), and the independent variable is \(education\). The model assumes that \(earnings\) depend on \(education\) in a nonlinear way, where \(education^2\) is a nonlinear transformation meant to capture the diminishing returns to \(education\).
That is, the marginal effect of \(education\) on \(earnings\) decreases as \(education\) increases. While one extra year of \(education\) has a positive effect on \(earnings\) for everyone, the effect is larger for someone with less \(education\) than for someone with more \(education\). This can be seen by computing the marginal effect of \(education\) on \(earnings\):
where \(\beta_2\) is typically found to be negative, capturing the diminishing returns to education.
Note however that Equation 3.2 is linear in the parameters \(\beta_0\), \(\beta_1\), and \(\beta_2\). Hence, it can be estimated using OLS.
3.3 Estimation
The OLS estimator is based on minimizing the sum of squared residuals, which are the differences between the observed values of the dependent variable and the values predicted by the model.
From Equation 3.1, we can write the error term as:
\[
U = Y - X\beta,
\tag{3.3}\]
which depends on the true unknown parameter \(\beta\).
We cannot compute the error term, as we do not know the true value of \(\beta\). However, we can compute the residuals, which are the differences between the observed values of \(Y\) and the values predicted by the model using an estimate of \(\beta\), denoted by \(\hat{\beta}\). That is, the residuals are given by:
\[
\hat{U} = Y - X\hat{\beta}.
\tag{3.4}\]
This last point is important, so we highlight it in a callout.
Distinction between errors and residuals
The error term \(U\) is the difference between the observed values of \(Y\) and the values predicted by the model using the true parameter \(\beta\). The residuals \(\hat{U}\) are the difference between the observed values of \(Y\) and the values predicted by the model using an estimate of \(\beta\), denoted by \(\hat{\beta}\).
The residuals are observable, while the error term is not. This has important implications for the properties of the OLS estimator, as we will see later.
The residual sum of squares (\(RSS\)) is given by:
Above, we needed the matrix \(X'X\) to be invertible. This is guaranteed in the case that the regressors are linearly independent. This is another of the assumptions that underlie the OLS estimator.
Assumption: Linear independence
Linear dependence means that the matrix of regressors is full rank. No regressor is a linear combination of the others.
Furthermore, we note that it is indeed a minimum by computing the second derivative:
where we have used the linearity of the expectation operator and the correct specification assumption (Equation 3.1).
Hence, to show that OLS is unbiased we require that:
\[
E[(X'X)^{-1}X'U] = 0.
\tag{3.10}\]
There are two possibilities for this to hold:
\(X\) is nonstochastic and \(E[U]=0\).
\(X\) is exogenous, i.e., \(E_X[U] = 0\), where \(E_X[\cdot]\) denotes the conditional expectation given \(X\).
In the second case, we use the law of iterated expectations to show Equation 3.10.
The conditions above constitute the next assumption for the OLS estimator.
Assumption: Exogeneity
Exogeneity means that the regressors are uncorrelated with the error term. Alternatively, we could assume that the regressors are nonstochastic and the error term has mean zero.
The nonstochastic assumption may be sensible when we can control the inputs, like in an experimental setting. Nonetheless, it is often not reasonable in applied econometric work.
On the other hand, the exogeneity assumption is reasonable for cross-sectional data where each observation corresponds to an individual. Note that the assumption implies that the error term for one individual is uncorrelated with the regressors for all individuals: one person’s unobserved characteristics are unlikely to be correlated with another person’s observed characteristics.
However, as shown in , exogeneity is a strong assumption for time-series data. It imposes the restrictions that all errors are uncorrelated to all past and future values of the regressors.
3.4.0.1 Example: Unbiasedness of OLS
Figure 3.1 illustrates the unbiasedness of the OLS estimator under exogeneity. The figure shows the distribution of the OLS estimator for the slope coefficient in an autoregressive model, where the regressor is generated independently of the error term and hence is exogenous. The true value of the slope coefficient is 0.5, and the mean of the estimated coefficients is very close to this value, illustrating that the OLS estimator is unbiased for a small sample size of 100.
The code for generating the figure is available in the code snippet below.
Code
usingStatsPlots, Distributions, RandomRandom.seed!(123) # for reproducibilityR =1000; # number of replicationsN =100; # sample sizeβ =0.5beta =zeros(R, 1) # vector to store the estimated coefficientsfor ii in1:R V =rand(Normal(0, 1), N ) # error term X =rand(Normal(0, 1), N) # regressor Y = X * β + V # regressand beta[ii] = (X'* X) \ (X'* Y) # OLS estimatorendtheme(:dracula)boxplot(beta, label="Estimated regressor", orientation=:horizontal, color=4)vline!([mean(beta)], label="Mean estimate", color=3, lw=3, legend=:topleft)vline!([0.5], label="True value", color=1, lw=3, linestyle =:dash)plot!(fontfamily="Computer Modern", titlefontfamily="Computer Modern", legendfontfamily="Computer Modern", tickfontfamily="Computer Modern",legendfontsize=10, xlabelfontsize=10, ylabelfontsize=10,xlabel="", ylabel="")
Figure 3.1: Bias of the OLS estimator under autoregressive model
3.5 Precision
The OLS precision is measured by its covariance matrix, which depend on the error term’s second moments.
Computing the variance of the OLS estimator we obtain: \[
Var(\hat{\beta}) = E[(\hat{\beta}-\beta)(\hat{\beta}-\beta)'] = E[(X'X)^{-1}X'U U'X(X'X)^{-1}],
\] where we have substituted \(\hat{\beta}\) (Equation 3.7) and used the correct specification assumption (Equation 3.1).
Similar to the bias case, we consider two cases: X is nonstochastic or exogenous. In both cases, we can write the variance as:
Equation 3.11 shows that the variance of the OLS estimator depends on the second moment of the error term, \(E(UU')\), either conditional on the regressors or unconditionally. Without further assumptions, we cannot simplify this expression further, which has the sandwich form.
In general \(E(UU')\) is a \(n\times n\) matrix, which can be very complex. To simplify this expression, we need to make further assumptions about the error term.
Assumption: No autocorrelation
The error terms are uncorrelated across observations.
Assumption: Homoskedasticity
The variance of the error term is constant across observations.
No autocorrelation means that \(E(UU')\) is a diagonal matrix. Homoskedasticity means that the diagonal elements are constant.
Under these two assumptions, we can write \(Var[U]=E[UU'] = \sigma^2 I\), so that Equation 3.11 can be simplified to obtain: \[
Var(\hat{\beta}) = \sigma^2(X'X)^{-1}.
\tag{3.12}\]
As shown below, the variance of the OLS estimator under the asssumptions above (Equation 3.12) is the smallest possible variance for a linear unbiased estimator. This is the content of the Gauss-Markov theorem.
3.6 The Gauss-Markov Theorem
This section presents the Gauss-Markov theorem, which states that the OLS estimator is the Best Linear Unbiased Estimator (BLUE) under certain assumptions.
Theorem 3.1 (Gauss-Markov) In a regression under correct specification, exogenous regressors, homoskedastic and no-autocorrelated errors, the OLS estimator is more efficient than any other linear unbiased estimator.
In other words, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
Proof. Let \(\tilde{\beta}\) be another linear unbiased estimator. That means there exists a matrix \(A\) such that \(\tilde{\beta} = A Y\).
Given linearity, we can write the estimator as
\[
\tilde{\beta} = A Y = ((X'X)^{-1}X'+C)Y = \hat{\beta}+CY,
\tag{3.13}\]
where \(C = A-(X'X)^{-1}X'\).
Given that both estimators are unbiased, we have that:
where the last equality follows from the exogeneity assumption.
Equation 3.14 implies that \(CX = 0\), given that \(E[CU] = 0\) by the zero mean of the error term and since it holds for any \(\beta\). Hence, \(CY\) has mean zero.
In turn, Equation 3.13 implies that \(\tilde{\beta}\) can be written as the sum of the OLS estimator and a random variable with mean zero.
Computing the variance of \(\tilde{\beta}\) we obtain:
where we have used that \(CX = 0\) and that the variance-covariance matrix of the error term is given by \(\sigma^2I\). That is, the error term is homoskedastic and uncorrelated.
Replacing this in Equation 3.15 we obtain that \(Var[\tilde{\beta}] \geq Var[\hat{\beta}]\), since we have \(Var[CY] \geq 0\) given the properties of variance, which concludes the proof.
Remark on Normality
Note that the Gauss-Markov theorem does not require the error term to be normally distributed. This is a common misconception, as OLS is often equated with the Maximum Likelihood Estimator (MLE) under normality. MLE requires normality, while OLS does not.
3.7 Distribution of the OLS estimator
To derive the distribution of the OLS estimator, we need to make further assumptions about the error term.
The simplest assumption is that the error term is normally distributed.
Theorem 3.2 (Normal Distribution of OLS Estimator) Under correct specification; exogenous regressors; homoskedastic, no-autocorrelated, and normally-distributed errors, the OLS estimator follows a normal distribution with mean \(\beta\) and variance \(\sigma^2(X'X)^{-1}\).
Proof. Given normality of the error term, the estimator follows a normal distribution. We only need to know the mean and variance, which we have already computed in Equation 3.9 and Equation 3.12.
Note that the assumptions in Theorem 3.2 are stronger than those in the Gauss-Markov theorem. The additional assumption of normality is needed to derive the distribution of the OLS estimator. Under these assumptions, the OLS estimator is equivalent to the Maximum Likelihood Estimator (MLE).
3.7.1 Example: Distribution of OLS estimator
3.8 OLS Assumptions (Wrap-up)
The assumptions needed for the OLS estimator to have the properties discussed above are summarized below. You should be able to explain each assumption and its role in the properties of the OLS estimator.
Definition 3.1 (OLS Assumptions)
Correct specification
Linear independence
Exogeneity
Homoskedasticity
No autocorrelation
Normality (for distribution of the estimator)
The assumptions in Definition 3.1 are the standard ones in econometrics. They are necessary to ensure that the OLS estimator is unbiased, efficient, and consistent.
The first three assumptions are needed for unbiasedness and consistency. The next two are needed for efficiency, and the last one is needed to derive the distribution of the estimator. The last assumption can be relaxed if we are willing to assume large sample sizes, as the Central Limit Theorem (CLT) can be used to derive the asymptotic distribution of the OLS estimator. This is discussed in the next chapter.
3.9 Exercises
(Normality of OLS Estimator) In this exercise, you are going to conduct a Monte Carlo simulation to show graphically that the OLS estimator follows a normal distribution under all the assumptions. That is, you are going to replicate the plot in Section 3.7.1.
Hence, you will need to follow the steps below:
Set the sample size \(n=100\).
Then, for \(R=1000\) repetitions, do the following:
Generate a regressor \(x\) from a normal distribution with mean 0 and variance 1.
Generate an error term \(u\) from a normal distribution with mean 0 and variance \(\sigma^2\) of your choosing.
Generate a dependent variable \(y\) from the following model: \[y = \beta_1 x + u,\] for \(\beta_1 = 1\).
Why we dont need to generate an intercept (constant term)?
Estimate the model above and store the OLS estimator \(\hat{\beta}_1\) in a vector of size \(R\).
Plot the histogram of \(\hat{\beta}_1\).
Compare the histogram with the normal distribution with mean \(\beta_1\) and variance \(\sigma^2(X'X)^{-1}\).
What is \(plim (\frac{1}{n}X'X)^{-1}\) for this model?
Increase the sample size and comment on the results.
Hint: You can use the function randn() to simulate both the regressor \(x\) and the error term \(u\).
Galton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.”The Journal of the Anthropological Institute of Great Britain and Ireland 15: 246–63. http://www.jstor.org/stable/2841583.