## Loading wide data
<- read.csv("m_wage.csv", header=TRUE, stringsAsFactors=FALSE)
min_wage
<- min_wage[,c("nj","wage_st","emptot","kfc", "wendys","co_owned")]
min_wage_feb <- min_wage[,c("nj","wage_st2","emptot2","kfc", "wendys","co_owned")]
min_wage_nov
## Create a treatment period indicator
$treatment <- 0
min_wage_feb$treatment <- 1
min_wage_nov
## Make sure the two data.frames have the same column names
colnames(min_wage_nov) <- colnames(min_wage_feb)
## Stack the data.frames on top of one another
<- rbind(min_wage_feb, min_wage_nov)
mwl
## Convert the variables to factors
$treatment <- as.factor(mwl$treatment)
mwl$nj <- as.factor(mwl$nj)
mwl$kfc <- as.factor(mwl$kfc)
mwl$wendys <- as.factor(mwl$wendys)
mwl$co_owned <- as.factor(mwl$co_owned)
mwl
## Delete the temporary data.frames
rm(min_wage,min_wage_feb,min_wage_nov)
Minimum Wage Analysis with MICE
Introduction
The aim of this vignette is to replicate and enhance Card and Krueger’s (1994)1 analysis of the effect of an increase in the minimum wage on unemployment by using Multivariate Imputation by Chained Equations (MICE)2.
The aim is to enhance our understanding of multiple imputation and its use for econometric studies. Therefore, this vignette will show how to multiply impute Card and Krueger’s dataset and how to use the imputed data for the econometric analysis. The main objective is to increase your knowledge and understanding on applications of multiple imputation. We use R’s mice
package for the analysis3.
Return to Portfolio
The Economic Analysis
Background: Card and Krueger notorious analysis dealt with analysing the effect that an increase in the minimum wage has on employment levels. We are going to replicate some of their results while using multiple imputation to deal with the missing data.
Experiment: On April 1, 1992, the minimum wage in New Jersey was raised from $4.25 to $5.05. In the neighbouring state of Pennsylvania, however, the minimum wage remained constant at $4.25. Card and Krueger analysed the impact of the minimum wage increase on employment in the fast-food industry, a sector which employs many low-wage workers.
Data: The authors collected data on the number of employees in 331 fast-food restaurants in New Jersey and 79 in Pennsylvania. The survey was conducted in February 1992 (before the minimum wage was raised) and in November 1992 (after the minimum wage was raised).
The dataset necessary to replicate Card and Krueger’s analysis is available at the author’s website, and here. The dataset is stored in a “wide” format, i.e. there is a single row for each unit (restaurant), and different columns for the outcomes and covariates in different years. The dataset includes the following variables (as well as others which we will not use):
Variable name | Description |
---|---|
nj | dummy equal to 1 if the restaurant is located in NJ |
emptot | total number of full-time employed |
wage\_st | average starting wage in the restaurant |
co\_owned | dummy variable equal to 1 if restaurant was co-owned |
bk | dummy variable equal to 1 if restaurant was a Burger King |
kfc | dummy variable equal to 1 if restaurant was a KFC |
wendys | dummy variable equal to 1 if restaurant was Wendys |
treatment | dummy variable equal to 1 if post-treatment |
For the analysis, the data is presented in a long format. The dataset contains 820 observations on 7 variables: nj, wage_st, emptot, kfc, wendys, co_owned, and treatment.
Return to Portfolio
Missing Data
Loading the package and setting the seed
We load the mice
package for multiple imputation. We also set the seed for reproducibility.
library(mice)
set.seed(123)
Start by exploring the data
Get an overview of the data by the summary()
command.
summary(mwl)
nj wage_st emptot kfc wendys co_owned treatment
0:158 Min. :4.250 Min. : 0.00 0:660 0:700 0:538 0:410
1:662 1st Qu.:4.500 1st Qu.:14.50 1:160 1:120 1:282 1:410
Median :5.000 Median :20.00
Mean :4.806 Mean :21.03
3rd Qu.:5.050 3rd Qu.:25.50
Max. :6.250 Max. :85.00
NA's :41 NA's :26
Note that there are 41 missing values in wage_st and 26 missing values in emptot.
Using the head()
function, we show the first 6 observations.
head(mwl)
nj wage_st emptot kfc wendys co_owned treatment
1 0 NA 40.50 0 0 0 0
2 0 NA 13.75 1 0 0 0
3 0 NA 8.50 1 0 1 0
4 0 5.0 34.00 0 1 1 0
5 0 5.5 24.00 0 1 1 0
6 0 5.0 20.50 0 1 1 0
The missing values are not randomly distributed across the data set. For example, the missing values in wage_st are concentrated in the post-treatment period, as shown in the first 3 observations. The missing values in emptot are concentrated in the pre-treatment period. This is a common pattern in longitudinal data sets.
Inspecting the missing data pattern
Check the missingness pattern for the dataset using md.pattern
.
md.pattern(mwl)
nj kfc wendys co_owned treatment emptot wage_st
760 1 1 1 1 1 1 1 0
34 1 1 1 1 1 1 0 1
19 1 1 1 1 1 0 1 1
7 1 1 1 1 1 0 0 2
0 0 0 0 0 26 41 67
The plot and the table confirm that there are 41 observations with missing values in wage_st and 26 observations with missing values in emptot. Note that there are only 2 observations with both wage_st and emptot missing. The rest of missing values are concentrated in one of the two variables. This means that for most of the observations, we can use the information in one variable to impute the missing values in the other variable. Nonetheless, as described below, MICE can handle the two (or more) missing variables simultaneously.
Estimating the effect with incomplete data
Effect on employment: In the Difference-in-difference literature4, we estimate the effect of the increase in minimum wage on employment by fitting the model: emptot_{it} = \beta_0 + \beta_1 nj_i + \beta_2 treatment_t + \beta_3 nj_i \times treatment_t + \epsilon_{it}. The effect is then captured by the \beta_3 parameter.
We fit the model with the original dataset. Note that we only need to type nj*treatment in R so that it includes the three regressors (nj,treatment,nj X treatment), they are called the interaction terms.
We fit the model using the with()
function. We will cover the use of with()
in more detail in the next section. For now, note that with()
allows us to fit the linear model using the variables in the desired data frame.
Moreover, we use the lm()
function to fit the model. The lm()
function is a generic function that can be used to fit a variety of linear models. The summary()
function is also generic and can be used to obtain a summary of the fitted model.
<- with(mwl, lm(emptot ~ nj*treatment))
fit summary(fit)
Call:
lm(formula = emptot ~ nj * treatment)
Residuals:
Min 1Q Median 3Q Max
-21.166 -6.439 -1.027 4.473 64.561
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.331 1.072 21.767 <2e-16 ***
nj1 -2.892 1.194 -2.423 0.0156 *
treatment1 -2.166 1.516 -1.429 0.1535
nj1:treatment1 2.754 1.688 1.631 0.1033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.406 on 790 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.007401, Adjusted R-squared: 0.003632
F-statistic: 1.964 on 3 and 790 DF, p-value: 0.118
The estimated effect of the increase in minimum wage on employment is 2.7536058, signalling an increase in total employment in New Jersey after the increase in minimum wage. The standard error is 1.6884091. The effect is not statistically significant at the 5% level. Note that we have used the original dataset with missing values. Hence, we have lost 26 observations.
Effect on wages: We can also estimate the effect on wages by changing the dependent variable in the regression: wage\_st_{it} = \alpha_0 + \alpha_1 nj_i + \alpha_2 treatment_t + \alpha_3 nj_i \times treatment_t + \epsilon_{it}. Analogous to before, the effect is then captured by the \alpha_3 parameter.
<- with(mwl, lm(wage_st ~ nj*treatment))
fit summary(fit)
Call:
lm(formula = wage_st ~ nj * treatment)
Residuals:
Min 1Q Median 3Q Max
-0.38013 -0.11213 -0.03085 0.13787 1.63254
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.63013 0.03171 146.005 <2e-16 ***
nj1 -0.01800 0.03534 -0.509 0.611
treatment1 -0.01267 0.04563 -0.278 0.781
nj1:treatment1 0.48138 0.05065 9.503 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2765 on 775 degrees of freedom
(41 observations deleted due to missingness)
Multiple R-squared: 0.4073, Adjusted R-squared: 0.405
F-statistic: 177.5 on 3 and 775 DF, p-value: < 2.2e-16
The estimated effect is 0.4813823, and the standard error is 0.0506547. The effect is significant at the 1% level, signalling that the policy had the desired effect of increasing wages in New Jersey. Note that we have used the original dataset with missing values. Hence, we have lost 41 observations in this specification.
Return to Portfolio
Single Imputation Methods
Basic mean imputation
Imputing the missing data with mean imputation and re-estimating the models.
As a first attempt, we use the mice
function, specifying the method = "mean"
option. The m
option specifies the number of imputations while the maxit
option specifies the number of iterations. We use 1 imputation to obtain a single imputed data set. Moreover, we use the print = FALSE
option to avoid printing the progress of the algorithm.
<- mice(mwl, method = "mean", m = 1, maxit = 1, print = FALSE) imp
Notice that for the case of imputation with the mean, using mice
is extravagant. We could have used the mean()
function to compute the mean and the is.na()
function to identify the missing values. However, we use mice
to illustrate the use of the package.
We can obtain the completed data set using the complete()
function that returns a data frame. Here we look at the first 6 observations of the imputed data set.
<- complete(imp, 1)
c1 head(c1)
nj wage_st emptot kfc wendys co_owned treatment
1 0 4.805713 40.50 0 0 0 0
2 0 4.805713 13.75 1 0 0 0
3 0 4.805713 8.50 1 0 1 0
4 0 5.000000 34.00 0 1 1 0
5 0 5.500000 24.00 0 1 1 0
6 0 5.000000 20.50 0 1 1 0
Notice that the missing values have been replaced by the mean of the variable. Hence, the 3 first observations have the same value for wage_st, this reduces the variability of the data.
Once we have the imputed data set, we fit the model as before using the with()
function.
<- with(imp, lm(emptot ~ nj*treatment))
fit summary(fit)
# A tibble: 4 × 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 23.3 1.04 22.3 1.09e-86 820
2 nj1 -2.82 1.16 -2.43 1.53e- 2 820
3 treatment1 -2.11 1.47 -1.43 1.52e- 1 820
4 nj1:treatment1 2.68 1.64 1.64 1.02e- 1 820
The estimated effect is 2.6810103, and the standard error is 1.638996.
One of the main drawbacks of using mean imputation in relation to linear regression is that using same value for all missing observations reduces data variability which produces broader confidence intervals. Moreover, mean imputation does not take into account the uncertainty associated with the imputed values. This is reflected in the standard error of the estimated effect, which is smaller than the standard error obtained with the original data.
Imputation with stochastic regression
One way to add variability to the imputations is to use the method = "norm"
option to perform Bayesian normal regression. The idea for the imputation is to fit a Bayesian linear regression with the column with missing data as the dependent variable and using the other variables as regressors. We then use the regression’s estimates to draw a value from the parameters’ posterior distribution to predict the missing value. Furthermore, we add a draw from the estimated distribution of the error term to add variability. The goal is to obtain a different imputed value for each missing observation, which will increase data variability. Nonetheless, note that we have only used 1 imputation.
<- mice(mwl, method = "norm", m = 1, maxit = 1, print = FALSE)
imp <- complete(imp, 1)
c1 head(c1)
nj wage_st emptot kfc wendys co_owned treatment
1 0 4.759822 40.50 0 0 0 0
2 0 4.226937 13.75 1 0 0 0
3 0 4.221437 8.50 1 0 1 0
4 0 5.000000 34.00 0 1 1 0
5 0 5.500000 24.00 0 1 1 0
6 0 5.000000 20.50 0 1 1 0
Note that the imputed values for the 3 first observations of wage_st are now different. This increases the variability of the data.
After imputing the missing values, we fit the model once again.
<- with(imp, lm(emptot ~ nj*treatment))
fit summary(fit)
# A tibble: 4 × 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 23.3 1.06 22.1 6.71e-85 820
2 nj1 -2.79 1.18 -2.37 1.81e- 2 820
3 treatment1 -2.15 1.50 -1.43 1.52e- 1 820
4 nj1:treatment1 2.73 1.66 1.64 1.02e- 1 820
The estimated effect is 2.7291743, and the standard error is 1.6648918. Note that the standard error has increased given the variability in the imputed values. However, we have only used 1 imputation, abstracting from the uncertainty associated with the imputed values. Hence, the standard error misrepresents the uncertainty associated with the fact that we have imputed the missing values.
Return to Portfolio
Multiple imputation
The motivation behind the use of multiple imputation is to incorporate the uncertainty associated with the imputed values. The idea is to impute the missing values multiple times to obtain multiple imputed data sets. We then fit the model in each imputed data set and combine the results to obtain valid estimates of the parameters of interest.
Suppose that we have m imputed data sets, Y_{miss}^{(1)},\cdots,Y_{miss}^{(m)}, where each is used to obtain an estimate of the effect of interest, \hat{\beta}_i. The goal is to combine the point estimates and the standard errors from the different imputed data sets to obtain confidence valid estimates.
Posterior mean
Assuming that the estimate is unbiased on the complete data, \bar{\beta}_m defined as \bar{\beta}_m = \frac{1}{m}\sum_{i=1}^m \hat{\beta}_i, is an unbiased estimator for the true effect \beta.
Hence, we can pool the different estimates to obtain \bar{\beta}_m as our point estimate for \beta. All that is left to do is to obtain the standard error of \bar{\beta}_m.
Posterior variance: Three sources of variation
By the Law of Total Variance, the posterior variance of P(\beta|Y_{obs}) is Var(\beta|Y_{obs}) = E\left[Var(\beta|Y_{obs},Y_{miss})\right|Y_{obs}] + Var\left[E(\beta|Y_{obs},Y_{miss})\right|Y_{obs}].
The first term is called the within-imputation variance, and the second term is the between-imputation variance.
The within-imputation variance is the expectation of the variance of the estimate if we use a complete dataset conditional on the current set of imputed values. It is estimated by: \bar{U}_m = \frac{1}{m}\sum_{i=1}^m \hat{U}_i, where \hat{U}_i is the within-imputation variance of \hat{\beta}_i.
The between-imputation variance is estimated by: \bar{B}_m = \frac{1}{m-1}\sum_{i=1}^m \left(\hat{\beta}_i - \bar{\beta}_m\right)\left(\hat{\beta}_i - \bar{\beta}_m\right)', which can be interpreted as the variance of the point estimates across the different imputed data sets.
We further need to incorporate the variance due to the fact that we are estimating \bar{\beta}_m for finite m. This is called the simulation variance and it can be estimated by \bar{B}_m/m.
In summary, the total variance T_m of \bar{\beta} is decomposed as T_m = \bar{U}_m + \bar{B}_m + \frac{\bar{B}_m}{m}, where:
- \bar{U}_m, the variance given that we are taking a sample rather than observing the entire population;
- \bar{B}_m, the variance given that there are missing values in the sample;
- \bar{B}_m/m, the variance given that we use a finite amount of imputations.
Note that we must use T_m to obtain the standard error of \bar{\beta}_m and compute confidence intervals with valid coverage. Hence, we use multiple imputation to account for all the sources of variation.
The mechanism to combine the estimates and obtain valid tests and confidence intervals are called Rubin’s rules. We will describe how to use Rubin’s rules in mice
below.
Multiple imputations using mice
Obtaining multiple imputations is straightforward in mice
by modifying the m
option. Using the complete()
function shows the imputed data. Here, we examine the second imputed data set using md.pattern()
.
<- mice(mwl, method = "norm", m = 5, maxit=5, print = FALSE)
imp <- complete(imp, 2)
c2 md.pattern(c2)
/\ /\
{ `---' }
{ O O }
==> V <== No need for mice. This data set is completely observed.
\ \|/ /
`-----'
nj wage_st emptot kfc wendys co_owned treatment
820 1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0
As expected, there are not more missing values in the imputed data set. The missing values have been replaced by imputed values. Analogously, we can examine the additional imputed data sets using complete()
.
How many imputations?
As shown above, the number of imputations directly influences the total variance of the estimates. As we have done before, we can modify the number of imputations by using the m
option. The question then arises as to how many imputations are needed.
The answer is that we need enough imputations to obtain stable estimates. The number of imputations should be large enough to obtain stable estimates, typically m\geq 5. The default in mice
is m= 5. More imputations are needed when the proportion of missing data is high. Nowadays, computational power is not a limitation so we can use m=20 or m=100.
Additionally, some authors suggest to repeat the analysis with different seeds. If there are large differences between experiments, this could indicate that the observed data contain little information about the effects of interest.
Chained equations
The mice()
function uses a Markov Chain Monte Carlo (MCMC) type of algorithm to obtain the imputations. The algorithm is called chained equations because it imputes the missing values one variable at a time connecting the imputations in a chain. The algorithm is as follows:
- Start with a simple imputation method to generate a complete data set (Y_{obs},Y_{imp}^{(1)}).
- For each variable in the dataset Y_{imp}^{(i)}, impute the missing values using the other variables in the complete dataset. This produces new imputed values Y_{imp}^{(i+1)}.
- Repeat steps 1-2 several times, up to a maxit number of times, to obtain stable estimates in one complete dataset.
- Repeat steps 1-3 m times to obtain multiple imputed data sets.
Given the MCMC nature of the method, we should always inspect the convergence of the algorithm. We can look at the trace lines generated by the algorithm to study convergence.
plot(imp)
There is one line for each imputed dataset through the iterations. The plots here show 5 iterations, the default value for maxit
. The idea is that the algorithm has converged when the trace lines are stable and do not show any trend. Here, the standard deviation for the imputed values of wage_st is perhaps not stable as it seems to be increasing.
One way to deal with the lack of convergence is to increase the number of iterations. We can do this by changing the value passed to the maxit
option. Here, we increase the number of iterations to 20 to see if the trace lines become stable.
<- mice(mwl, method = "norm", m = 5, maxit = 20, print = FALSE)
imp plot(imp)
Note that the algorithm has converged for all variables. The trace lines are stable and do not show any trend.
Diagnostics
Another important consideration regarding the imputed values is that they should be valid. That is, the imputed values should be plausible given the observed data and physical/logical restrictions.
We can use the stripplot()
function to compare the imputed values to the observed values. The function plots the imputed values, in red, together with the observed values, in blue, for each imputed data set. The first column shows only the observed data for comparison purposes. The idea is that the imputed values should be valid in the sense that they are similar to the observed ones.
stripplot(imp, pch = 20, cex = 2)