3 Ordinary Least Squares Assumptions

3.1 Introduction

This chapter discusses the assumptions needed for the Ordinary Least Squares (OLS) estimator in relation to its properties. We discuss each assumption in detail, providing intuition and examples. Each assumptions is presented in the order that it is needed to understand the OLS estimator. The chapter finishes with a summary of all the assumptions and the Gauss-Markov theorem, proving that OLS is the Best Linear Unbiased Estimator (BLUE).

3.2 Linear model

We are interested in assessing the effect that a set of explanatory variables, typically denoted by \(X\), has on the dependent variable, denoted by \(Y\). For this, we assume that there is a linear relationship between them.

Assumption: Correct specification

The model is correctly specified if the true model is linear in the parameters and it is given by \[ Y = X\beta + U, \tag{3.1}\]

where \(\beta\) is a vector of parameters that we want to estimate, and \(U\) is the error term. The error term captures the effect of all other factors that affect \(Y\) but cannot be explained by \(X\).

Explanatory variables are also called regressors or independent variables. The dependent variable is also called the regressand or response variable. The term regression comes from the seminal work of Francis Galton on the relationship between parents’ and children’s heights (Galton 1886). He observed that tall parents tend to have children not as tall as themselves, and short parents tend to have children not as short as themselves. He called this phenomenon regression to the mean.

Equation 3.1 is a linear model in the sense that it is linear in the parameters \(\beta\). No parameter is raised to a power other than one, nor is it multiplied by another parameter, nor any other nonlinear transformation is applied to it. However, the model may not be linear in the variables. The matrix \(X\) may contain nonlinear transformations of the explanatory variables.

Example 3.1 One example of a nonlinear transformation of the explanatory variables is the quadratic model. This model includes a squared term of at least one of the explanatory variables. We could have a model with one explanatory variable \(x\) and its square \(x^2\). For example, we could model the global average temperature anomaly as a function of time using a quadratic model:

\[ temp_t = \beta_0 + \beta_1 t + \beta_2 t^2 + u_t, \tag{3.2}\]

where \(temp_t\) is the global average temperature anomaly at time \(t\), and \(\beta_0\), \(\beta_1\), and \(\beta_2\) are the parameters to be estimated.

3.3 Estimation

The OLS estimator is based on minimizing the sum of squared residuals, which are the differences between the observed values of the dependent variable and the values predicted by the model.

From Equation 3.1, we can write the error term as:

\[ U = Y - X\beta, \tag{3.3}\]

which depends on the true unknown parameter \(\beta\).

We cannot compute the error term, as we do not know the true value of \(\beta\). However, we can compute the residuals, which are the differences between the observed values of \(Y\) and the values predicted by the model using an estimate of \(\beta\), denoted by \(\hat{\beta}\). That is, the residuals are given by:

\[ \hat{U} = Y - X\hat{\beta}. \tag{3.4}\]

This last point is important, so we highlight it in a callout.

Distinction between errors and residuals

The error term \(U\) is the difference between the observed values of \(Y\) and the values predicted by the model using the true parameter \(\beta\). The residuals \(\hat{U}\) are the difference between the observed values of \(Y\) and the values predicted by the model using an estimate of \(\beta\), denoted by \(\hat{\beta}\).

The residuals are observable, while the error term is not. This has important implications for the properties of the OLS estimator, as we will see later.

The residual sum of squares (\(RSS\)) is given by:

\[ RSS(\hat{\beta}) = \hat{U}'\hat{U} = (Y - X\hat{\beta})'(Y - X\hat{\beta}). \tag{3.5}\]

The OLS estimator for \(\beta\) is the one that minimizes \(RSS(\beta)\).

To minimize \(RSS(\beta)\), we take its derivative with respect to \(\beta\) and set it equal to zero:

\[ \frac{\partial RSS(\beta)}{\partial \beta} = -2X'Y + 2X'X\beta = 0. \tag{3.6}\]

Solving for \(\beta\) gives us the OLS estimator:

\[ \hat{\beta} = (X'X)^{-1}X'Y. \tag{3.7}\]

Above, we needed the matrix \(X'X\) to be invertible. This is guaranteed in the case that the regressors are linearly independent. This is another of the assumptions that underlie the OLS estimator.

Assumption: Linear independence

Linear dependence means that the matrix of regressors is full rank. No regressor is a linear combination of the others.

Furthermore, we note that it is indeed a minimum by computing the second derivative:

\[ \frac{\partial^2 RSS(\beta)}{\partial \beta \partial \beta'} = 2X'X. \tag{3.8}\]

So that OLS is a minimum if \(X'X\) is positive definite, which is the case if the regressors are linearly independent.

Example 3.2 Consider the model in Equation 3.2, where we want to estimate the effect of time on global temperatures. We can write the model in matrix form as:

\[ Y = X\beta + U, \]

where \[Y = \begin{bmatrix} temp_1 \\ temp_2 \\ \vdots \\ temp_T \end{bmatrix}, \quad X = \begin{bmatrix} 1 & time_1 & time_1^2 \\ 1 & time_2 & time_2^2 \\ \vdots & \vdots & \vdots \\ 1 & time_T & time_T^2 \end{bmatrix}, \]

furthermore, \[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix}, \quad U = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_T \end{bmatrix}. \]

In the following, we estimate the model using OLS in Julia. The data is available in the Global temperature anomalies repository and quarto manuscript.

To download the data, you can use the Downloads standard library. And to read the CSV file, you can use the CSV.jl package along with DataFrames.jl to import the data into a DataFrame. If the latter packages are not installed, you can install them using the Julia package manager; see the commands below. For more details, see the installation chapter for more details.

using Pkg
Pkg.add(["DataFrames", "CSV" ])

Once the packages are installed, you can download and read the data as follows.

The repository contains a data folder with the data in CSV format. Four datasets are available, corresponding to different sources of temperature data. We use the HadCRUT5 dataset (Morice et al. 2021), which is a widely used dataset for global temperature anomalies.

There are three columns: Date, RawTemperature, Temp. The first column contains the month in YYYY-MM-DD format, the second column contains the temperature anomalies in relation to the dataset’s base period (typically 1951-1980 or 1961-1990), while the last column contains the temperature anomalies in relation to the pre-industrial period (1850-1900). We can look at the last n rows of the dataset to see its structure using the last(•,n) function from DataFrames.jl.

using DataFrames, CSV, Downloads

url = "https://raw.githubusercontent.com/everval/Global-Temperature-Anomalies/refs/heads/main/data/HadCRUT5_global_monthly_average.csv" # link to HadCRUT5 dataset
temp_data = CSV.read(Downloads.download(url), DataFrame) # read data into DataFrame
last(temp_data, 5) # show last 5 rows

5×3 DataFrame

Row	Date	RawTemperature	Temp
	Date	Float64	Float64
1	2025-02-01	1.11127	1.47117
2	2025-03-01	1.15991	1.51981
3	2025-04-01	1.09048	1.45038
4	2025-05-01	0.952362	1.31227
5	2025-06-01	0.88233	1.24223

Once the data is loaded, we can create the variables needed for the regression. We create the design matrix X and the dependent variable Y.

The design matrix contains an intercept generated using a column of ones with the command ones(T), a time trend generated using (1:T), and a squared time trend generated using (1:T).^2. Note that the size of the data is obtained using the nrow() function, which returns the number of rows in the DataFrame. The time variable is simply a sequence from 1 to T, generated using Julia’s range operator, 1:T. Finally, the squared time trend is generated using the element-wise power operator .^, which raises each element of the time vector to the power of 2. In Julia, most operations can be done element-wise using the dot operator . before the function or operator.

The dependent variable is the temperature anomaly temp_data.Temp. We use the Temp column, which is the one typically used in climate change studies to assess the increase in global temperatures in accordance with the Paris Agreement.

T = nrow(temp_data)

X = [ones(T) 1:T (1:T).^2] # design matrix with intercept, time, and time^2
Y = temp_data.Temp # dependent variable 

# OLS estimator
β̂ = (X'X) \ (X'Y)

3-element Vector{Float64}:
  0.0875045862526649
 -0.0004936316112256622
  4.929507110729392e-7

The estimated coefficients are given by β̂, which contains the estimates for the intercept, time trend, and squared time trend, respectively. The estimated coefficients indicate that global temperatures have increased over time, with a positive quadratic trend, suggesting that the rate of increase in temperatures is accelerating [Vera-Valdes26].

Figure 3.1 shows the global temperature anomalies and the fitted values using the estimated coefficients. The code to generate the plot is available in the folded code snippet below

Code

using Plots # for plotting, install if needed

theme(:dracula) # set dark theme
plot(1:T, Y, label="Temperature Anomaly", color=4) # plot data
plot!(1:T, X*β̂, label="Fitted values", color=3, lw=3) # plot fitted values
plot!(fontfamily="Computer Modern", titlefontfamily="Computer Modern", legendfontfamily="Computer Modern", tickfontfamily="Computer Modern",
legendfontsize=10, xlabelfontsize=10, ylabelfontsize=10,
xlabel="Time", ylabel="Temperature Anomaly (°C)") # add labels and fonts

Figure 3.1: Global temperature anomalies and fitted values using OLS in a quadratic model

3.4 Bias

To show that the OLS estimator is unbiased, we need to prove that the expected value of the estimator is equal to the true value of the parameter:

\[ E[\hat{\beta}] = \beta. \tag{3.9}\]

Replacing the expression for \(\hat{\beta}\) (Equation 3.7), we have:

\[ \begin{align*} E[\hat{\beta}] &= E[(X'X)^{-1}X'Y] \\ &= E[(X'X)^{-1}X'(X\beta+U)] \\ &= \beta + E[(X'X)^{-1}X'U], \end{align*} \]

where we have used the linearity of the expectation operator and the correct specification assumption (Equation 3.1).

Hence, to show that OLS is unbiased we require that:

\[ E[(X'X)^{-1}X'U] = 0. \tag{3.10}\]

There are two possibilities for this to hold:

\(X\) is nonstochastic and \(E[U]=0\).
\(X\) is exogenous, i.e., \(E_X[U] = 0\), where \(E_X[\cdot]\) denotes the conditional expectation given \(X\).

In the second case, we use the law of iterated expectations to show Equation 3.10.

The conditions above constitute the next assumption for the OLS estimator.

Assumption: Exogeneity

Exogeneity means that the regressors are uncorrelated with the error term. Alternatively, we could assume that the regressors are nonstochastic and the error term has mean zero.

The nonstochastic assumption may be sensible when we can control the inputs, like in an experimental setting. Nonetheless, it is often not reasonable in applied econometric work.

On the other hand, the exogeneity assumption is reasonable for cross-sectional data where each observation corresponds to an individual. Note that the assumption implies that the error term for one individual is uncorrelated with the regressors for all individuals: one person’s unobserved characteristics are unlikely to be correlated with another person’s observed characteristics.

However, as shown in , exogeneity is a strong assumption for time-series data. It imposes the restrictions that all errors are uncorrelated to all past and future values of the regressors.

Example 3.3 To illustrate the unbiasedness of the OLS estimator under exogeneity, we conduct a Monte Carlo simulation. We generate a simple model with one regressor, where the regressor is generated independently of the error term and hence is exogenous. The true value of the slope coefficient is 0.5, and we estimate the model using OLS for a small sample size of 100. We use 1000 replications to obtain the distribution of the estimator, illustrated using a histogram. The results are shown in Figure 3.2.

The code for generating the figure is available in the code snippet below.

Code

using StatsPlots, Distributions, Random # for plotting and random number generation 
Random.seed!(123) # for reproducibility

R = 1000; # number of replications
N = 100; # sample size
β = 0.5
beta = zeros(R, 1) # vector to store the estimated coefficients

for ii in 1:R

    V = rand(Normal(0, 1), N ) # error term
    X = rand(Normal(0, 1), N) # regressor

    Y = X * β + V # regressand

    beta[ii] = (X' * X) \ (X' * Y) # OLS estimator
end

theme(:dracula)
histogram(beta, label="Estimated regressor", color=4)
vline!([mean(beta)], label="Mean estimate", color=3, lw=3, legend=:topleft)
vline!([0.5], label="True value", color=1, lw=3, linestyle = :dash)
plot!(fontfamily="Computer Modern", titlefontfamily="Computer Modern", legendfontfamily="Computer Modern", tickfontfamily="Computer Modern",
legendfontsize=10, xlabelfontsize=10, ylabelfontsize=10,
xlabel="", ylabel="")

Figure 3.2: Unbiasedness of OLS estimator under exogeneity

3.5 Precision

The OLS precision is measured by its covariance matrix, which depend on the error term’s second moments.

Computing the variance of the OLS estimator we obtain: \[ Var(\hat{\beta}) = E[(\hat{\beta}-\beta)(\hat{\beta}-\beta)'] = E[(X'X)^{-1}X'U U'X(X'X)^{-1}], \] where we have substituted \(\hat{\beta}\) (Equation 3.7) and used the correct specification assumption (Equation 3.1).

Similar to the bias case, we consider two cases: X is nonstochastic or exogenous. In both cases, we can write the variance as:

\[ Var(\hat{\beta}) = (X'X)^{-1}X'E(UU')X(X'X)^{-1}, \tag{3.11}\]

where \(E(UU')\) is either conditional on \(X\) or unconditional, depending on whether \(X\) is exogenous or nonstochastic, respectively.

Equation 3.11 shows that the variance of the OLS estimator depends on the second moment of the error term, \(E(UU')\), either conditional on the regressors or unconditionally. Without further assumptions, we cannot simplify this expression further, which has the sandwich form.

In general \(E(UU')\) is a \(n\times n\) matrix, which can be very complex. To simplify this expression, we need to make further assumptions about the error term.

Assumption: No autocorrelation

The error terms are uncorrelated across observations.

Assumption: Homoskedasticity

The variance of the error term is constant across observations.

No autocorrelation means that \(E(UU')\) is a diagonal matrix. Homoskedasticity means that the diagonal elements are constant.

Under these two assumptions, we can write \(Var[U]=E[UU'] = \sigma^2 I\), so that Equation 3.11 can be simplified to obtain: \[ Var(\hat{\beta}) = \sigma^2(X'X)^{-1}. \tag{3.12}\]

As shown below, the variance of the OLS estimator under the assumptions above (Equation 4.7) is the smallest possible variance for a linear unbiased estimator. This is the content of the Gauss-Markov theorem.

3.6 The Gauss-Markov Theorem

This section presents the Gauss-Markov theorem, which states that the OLS estimator is the Best Linear Unbiased Estimator (BLUE) under certain assumptions.

Theorem 3.1 (Gauss-Markov) In a regression under correct specification, exogenous regressors, homoskedastic and no-autocorrelated errors, the OLS estimator is more efficient than any other linear unbiased estimator.

In other words, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).

Proof. Let \(\tilde{\beta}\) be another linear unbiased estimator. That means there exists a matrix \(A\) such that \(\tilde{\beta} = A Y\).

Given linearity, we can write the estimator as

\[ \tilde{\beta} = A Y = ((X'X)^{-1}X'+C)Y = \hat{\beta}+CY, \tag{3.13}\]

where \(C = A-(X'X)^{-1}X'\).

Given that both estimators are unbiased, we have that:

\[ \begin{align*} \beta &= E[\tilde{\beta}] = E[((X'X)^{-1}X'+C)Y] \\ & = \beta + E[CY] = \beta + CX\beta + E[CU], \end{align*} \tag{3.14}\]

where the last equality follows from the exogeneity assumption.

Equation 3.14 implies that \(CX = 0\), given that \(E[CU] = 0\) by the zero mean of the error term and since it holds for any \(\beta\). Hence, \(CY\) has mean zero.

In turn, Equation 3.13 implies that \(\tilde{\beta}\) can be written as the sum of the OLS estimator and a random variable with mean zero.

Computing the variance of \(\tilde{\beta}\) we obtain:

\[ \begin{align*} Var[\tilde{\beta}] &= Var[\hat{\beta}+CY] \\ &= Var[\hat{\beta}]+Var[CY]+2Cov[\hat{\beta},CY]. \end{align*} \tag{3.15}\]

Now, the covariance term is given by:

\[ \begin{align*} Cov[\hat{\beta},CY] &= E[(\hat{\beta}-\beta)(CY)'] = E[(X'X)^{-1}X'U(CX\beta+CU)']\\ &= E[(X'X)^{-1}X'UU'C] = \sigma^2(X'X)^{-1}X'C' = 0, \end{align*} \]

where we have used that \(CX = 0\) and that the variance-covariance matrix of the error term is given by \(\sigma^2I\). That is, the error term is homoskedastic and uncorrelated.

Replacing this in Equation 3.15 we obtain that \(Var[\tilde{\beta}] \geq Var[\hat{\beta}]\), since we have \(Var[CY] \geq 0\) given the properties of variance, which concludes the proof.

Remark on Normality

Note that the Gauss-Markov theorem does not require the error term to be normally distributed. This is a common misconception, as OLS is often equated with the Maximum Likelihood Estimator (MLE) under normality. MLE requires normality, while OLS does not.

3.7 Distribution of the OLS estimator

To derive the distribution of the OLS estimator, we need to make further assumptions about the error term.

The simplest assumption is that the error term is normally distributed.

Theorem 3.2 (Normal Distribution of OLS Estimator) Under correct specification; exogenous regressors; homoskedastic, no-autocorrelated, and normally-distributed errors, the OLS estimator follows a normal distribution with mean \(\beta\) and variance \(\sigma^2(X'X)^{-1}\).

\[ \hat{\beta} \sim N(\beta, \sigma^2(X'X)^{-1}). \]

Proof. Given normality of the error term, the estimator follows a normal distribution. We only need to know the mean and variance, which we have already computed in Equation 3.9 and Equation 4.7.

Note that the assumptions in Theorem 3.2 are stronger than those in the Gauss-Markov theorem. The additional assumption of normality is needed to derive the distribution of the OLS estimator. Under these assumptions, the OLS estimator is equivalent to the Maximum Likelihood Estimator (MLE).

Example 3.4 In this example, we illustrate the distribution of the OLS estimator using a Monte Carlo simulation. We generate a simple model with one regressor, where the regressor is generated independently of the error term and hence is exogenous. The true value of the slope coefficient is 1, and we estimate the model using OLS for a small sample size of 100. We use 1000 replications to obtain the distribution of the estimator, illustrated using a histogram. We also overlay the normal density with mean equal to the true value of the parameter and variance equal to \(\sigma^2(X'X)^{-1}\). The results are shown in Figure 3.3.

Code

using StatsPlots, Distributions, Random
Random.seed!(123)

R = 1000
N = 100

beta = zeros(R, 1)

for ii = 1:R
    U = rand(Normal(0, 1), N ) # error term
    # Uncorrelated regressors
    X = rand(Normal(0, 1), N) # regressor
    Y = X + U # regressand
    betasample = (X'*X) \ (X'*Y) # OLS estimator
    beta[ii] = betasample
end

theme(:dracula)
histogram(beta, label="Estimated beta", legend=:topleft, fillalpha = 0.25,normalize=true)
x = range(0.5, stop=1.5, length=1000)
plot!(x, pdf.(Normal(1, 1/sqrt(N)), x), label="Normal density", color="red", ls=:dash, lw=2)
plot!(fontfamily="Computer Modern", legendfontsize=16, tickfontsize=16, titlefontfamily="Computer Modern", legendfontfamily="Computer Modern", tickfontfamily="Computer Modern", ylabelfontsize=16, xlabelfontsize=16, titlefontsize=20, xlabel="", ylabel="")

Figure 3.3: Distribution of OLS estimator under normality of errors

3.8 OLS Assumptions (Wrap-up)

The assumptions needed for the OLS estimator to have the properties discussed above are summarized below. You should be able to explain each assumption and its role in the properties of the OLS estimator.

Definition 3.1 (OLS Assumptions)

Correct specification
Linear independence
Exogeneity
Homoskedasticity
No autocorrelation
Normality (for distribution of the estimator)

The assumptions in Definition 3.1 are the standard ones in econometrics. They are necessary to ensure that the OLS estimator is unbiased, efficient, and consistent.

The first three assumptions are needed for unbiasedness and consistency. The next two are needed for efficiency, and the last one is needed to derive the distribution of the estimator. The last assumption can be relaxed if we are willing to assume large sample sizes, as the Central Limit Theorem (CLT) can be used to derive the asymptotic distribution of the OLS estimator. This is discussed in the next chapter.

3.9 Exercises

Exercise 3.1 (Normality of OLS Estimator) In this exercise, you are going to conduct a Monte Carlo simulation to show graphically that the OLS estimator follows a normal distribution under all the assumptions. That is, you are going to replicate the plot in Example 3.4.

Hence, you will need to follow the steps below:

Set the sample size \(n=100\).
Then, for \(R=1000\) repetitions, do the following:
1. Generate a regressor \(x\) from a normal distribution with mean 0 and variance 1.
2. Generate an error term \(u\) from a normal distribution with mean 0 and variance \(\sigma^2\) of your choosing.
3. Generate a dependent variable \(y\) from the following model: \[y = \beta_1 x + u,\] for \(\beta_1 = 1\).
- Why we dont need to generate an intercept (constant term)?
1. Estimate the model above and store the OLS estimator \(\hat{\beta}_1\) in a vector of size \(R\).
Plot the histogram of \(\hat{\beta}_1\).
Compare the histogram with the normal distribution with mean \(\beta_1\) and variance \(\sigma^2(X'X)^{-1}\).
- What is \(plim (\frac{1}{n}X'X)^{-1}\) for this model?
Increase the sample size and comment on the results.

Hint: You can use the function randn() to simulate both the regressor \(x\) and the error term \(u\).

Exercise 3.2

A classical example of a nonlinear model in the explanatory variables is the quadratic model for the relationship between earnings and education. One formulation of this model is:

\[ earnings = \beta_0 + \beta_1 education + \beta_2 education^2 + U. \tag{3.16}\]

In Equation 3.16, the dependent variable is \(earnings\), and the independent variable is \(education\). The model assumes that \(earnings\) depend on \(education\) in a nonlinear way, where \(education^2\) is a nonlinear transformation meant to capture the diminishing returns to \(education\).

That is, the marginal effect of \(education\) on \(earnings\) decreases as \(education\) increases. While one extra year of \(education\) has a positive effect on \(earnings\) for everyone, the effect is larger for someone with less \(education\) than for someone with more \(education\). This can be seen by computing the marginal effect of \(education\) on \(earnings\):

\[ \frac{\partial earnings}{\partial education} = \beta_1 + 2\beta_2 education, \]

where \(\beta_2\) is typically found to be negative, capturing the diminishing returns to education.

Note however that Equation 3.16 is linear in the parameters \(\beta_0\), \(\beta_1\), and \(\beta_2\). Hence, it can be estimated using OLS.