This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

12.1 The IV Estimator with a Single Regressor and a Single Instrument

Consider the simple regression model

\[\begin{align} Y_i = \beta_0 + \beta_1 X_i + u_i \ \ , \ \ i=1,\dots,n \tag{12.1} \end{align}\]

where the error term \(u_i\) is correlated with the regressor \(X_i\) (\(X\) is endogenous) such that OLS is inconsistent for the true \(\beta_1\). In the most simple case, IV regression uses a single instrumental variable \(Z\) to obtain a consistent estimator for \(\beta_1\).

\(Z\) must satisfy two conditions to be a valid instrument:

1. Instrument relevance condition:

\(X\) and its instrument \(Z\) must be correlated: \(\rho_{Z_i,X_i} \neq 0\).

2. Instrument exogeneity condition:

The instrument \(Z\) must not be correlated with the error term \(u\): \(\rho_{Z_i,u_i} = 0\).

The Two-Stage Least Squares Estimator

As can be guessed from its name, TSLS proceeds in two stages. In the first stage, the variation in the endogenous regressor \(X\) is decomposed into a problem-free component that is explained by the instrument \(Z\) and a problematic component that is correlated with the error \(u_i\). The second stage uses the problem-free component of the variation in \(X\) to estimate \(\beta_1\).

The first stage regression model is \[X_i = \pi_0 + \pi_1 Z_i + \nu_i,\] where \(\pi_0 + \pi_1 Z_i\) is the component of \(X_i\) that is explained by \(Z_i\) while \(\nu_i\) is the component that cannot be explained by \(Z_i\) and exhibits correlation with \(u_i\).

Using the OLS estimates \(\widehat{\pi}_0\) and \(\widehat{\pi}_1\) we obtain predicted values \(\widehat{X}_i, \ \ i=1,\dots,n\). If \(Z\) is a valid instrument, the \(\widehat{X}_i\) are problem-free in the sense that \(\widehat{X}\) is exogenous in a regression of \(Y\) on \(\widehat{X}\) which is done in the second stage regression. The second stage produces \(\widehat{\beta}_0^{TSLS}\) and \(\widehat{\beta}_1^{TSLS}\), the TSLS estimates of \(\beta_0\) and \(\beta_1\).

For the case of a single instrument one can show that the TSLS estimator of \(\beta_1\) is

\[\begin{align} \widehat{\beta}_1^{TSLS} = \frac{s_{ZY}}{s_{ZX}} = \frac{\frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})(Z_i - \overline{Z})}{\frac{1}{n-1}\sum_{i=1}^n(X_i - \overline{X})(Z_i - \overline{Z})}, \tag{12.2} \end{align}\]

which is nothing but the ratio of the sample covariance between \(Z\) and \(Y\) to the sample covariance between \(Z\) and \(X\).

As shown in Appendix 12.3 of the book, (12.2) is a consistent estimator for \(\beta_1\) in (12.1) under the assumption that \(Z\) is a valid instrument. Just as for every other OLS estimator we have considered so far, the CLT implies that the distribution of \(\widehat{\beta}_1^{TSLS}\) can be approximated by a normal distribution if the sample size is large. This allows us to use \(t\)-statistics and confidence intervals which are also computed by certain R functions. A more detailed argument on the large-sample distribution of the TSLS estimator is sketched in Appendix 12.3 of the book.

Application to the Demand For Cigarettes

The relation between the demand for and the price of commodities is a simple yet widespread problem in economics. Health economics is concerned with the study of how health-affecting behavior of individuals is influenced by the health-care system and regulation policy. Probably the most prominent example in public policy debates is smoking as it is related to many illnesses and negative externalities.

It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption. Economists use elasticities to answer this kind of question. Since the price elasticity for the demand of cigarettes is unknown, it must be estimated. As discussed in the box Who Invented Instrumental Variables Regression presented in Chapter 12.1 of the book, an OLS regression of log quantity on log price cannot be used to estimate the effect of interest since there is simultaneous causality between demand and supply. Instead, IV regression can be used.

We use the data set CigarettesSW which comes with the package AER. It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. Following the book, we consider data for the cross section of states in 1995 only.

We start by loading the package, attaching the data set and getting an overview.

# load the data set and get an overview
#>      state      year         cpi          population           packs       
#>  AL     : 2   1985:48   Min.   :1.076   Min.   :  478447   Min.   : 49.27  
#>  AR     : 2   1995:48   1st Qu.:1.076   1st Qu.: 1622606   1st Qu.: 92.45  
#>  AZ     : 2             Median :1.300   Median : 3697472   Median :110.16  
#>  CA     : 2             Mean   :1.300   Mean   : 5168866   Mean   :109.18  
#>  CO     : 2             3rd Qu.:1.524   3rd Qu.: 5901500   3rd Qu.:123.52  
#>  CT     : 2             Max.   :1.524   Max.   :31493524   Max.   :197.99  
#>  (Other):84                                                                
#>      income               tax            price             taxs       
#>  Min.   :  6887097   Min.   :18.00   Min.   : 84.97   Min.   : 21.27  
#>  1st Qu.: 25520384   1st Qu.:31.00   1st Qu.:102.71   1st Qu.: 34.77  
#>  Median : 61661644   Median :37.00   Median :137.72   Median : 41.05  
#>  Mean   : 99878736   Mean   :42.68   Mean   :143.45   Mean   : 48.33  
#>  3rd Qu.:127313964   3rd Qu.:50.88   3rd Qu.:176.15   3rd Qu.: 59.48  
#>  Max.   :771470144   Max.   :99.00   Max.   :240.85   Max.   :112.63  

Use ?CigarettesSW for a detailed description of the variables.

We are interested in estimating \(\beta_1\) in

\[\begin{align} \log(Q_i^{cigarettes}) = \beta_0 + \beta_1 \log(P_i^{cigarettes}) + u_i, \tag{12.3} \end{align}\]

where \(Q_i^{cigarettes}\) is the number of cigarette packs per capita sold and \(P_i^{cigarettes}\) is the after-tax average real price per pack of cigarettes in state \(i\).

The instrumental variable we are going to use for instrumenting the endogenous regressor \(\log(P_i^{cigarettes})\) is \(SalesTax\), the portion of taxes on cigarettes arising from the general sales tax. \(SalesTax\) is measured in dollars per pack. The idea is that \(SalesTax\) is a relevant instrument as it is included in the after-tax average price per pack. Also, it is plausible that \(SalesTax\) is exogenous since the sales tax does not influence quantity sold directly but indirectly through the price.

We perform some transformations in order to obtain deflated cross section data for the year 1995.

We also compute the sample correlation between the sales tax and price per pack. The sample correlation is a consistent estimator of the population correlation. The estimate of approximately \(0.614\) indicates that \(SalesTax\) and \(P_i^{cigarettes}\) exhibit positive correlation which meets our expectations: higher sales taxes lead to higher prices. However, a correlation analysis like this is not sufficient for checking whether the instrument is relevant. We will later come back to the issue of checking whether an instrument is relevant and exogenous.

# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)

#  compute the sales tax
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)

# check the correlation between sales tax and price
cor(CigarettesSW$salestax, CigarettesSW$price)
#> [1] 0.6141228

# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == "1995")

The first stage regression is \[\log(P_i^{cigarettes}) = \pi_0 + \pi_1 SalesTax_i + \nu_i.\] We estimate this model in R using lm(). In the second stage we run a regression of \(\log(Q_i^{cigarettes})\) on \(\widehat{\log(P_i^{cigarettes})}\) to obtain \(\widehat{\beta}_0^{TSLS}\) and \(\widehat{\beta}_1^{TSLS}\).

# perform the first stage regression
cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)

coeftest(cig_s1, vcov = vcovHC, type = "HC1")
#> t test of coefficients:
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept) 4.6165463  0.0289177 159.6444 < 2.2e-16 ***
#> salestax    0.0307289  0.0048354   6.3549 8.489e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first stage regression is \[\widehat{\log(P_i^{cigarettes})} = \underset{(0.03)}{4.62} + \underset{(0.005)}{0.031} SalesTax_i,\] which predicts the relation between sales tax price per cigarettes to be positive. How much of the observed variation in \(\log(P^{cigarettes})\) is explained by the instrument \(SalesTax\)? This can be answered by looking at the regression’s \(R^2\) which states that about \(47\%\) of the variation in after tax prices is explained by the variation of the sales tax across states.

# inspect the R^2 of the first stage regression
#> [1] 0.4709961

We next store \(\widehat{\log(P_i^{cigarettes})}\), the fitted values obtained by the first stage regression cig_s1, in the variable lcigp_pred.

# store the predicted values
lcigp_pred <- cig_s1$fitted.values

Next, we run the second stage regression which gives us the TSLS estimates we seek.

# run the stage 2 regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC)
#> t test of coefficients:
#>             Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)  9.71988    1.70304  5.7074 7.932e-07 ***
#> lcigp_pred  -1.08359    0.35563 -3.0469  0.003822 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus estimating the model (12.3) using TSLS yields

\[\begin{align} \widehat{\log(Q_i^{cigarettes})} = \underset{(1.70)}{9.72} - \underset{(0.36)}{1.08} \log(P_i^{cigarettes}), \tag{12.4} \end{align}\]

where we write \(\log(P_i^{cigarettes})\) instead of \(\widehat{\log(P_i^{cigarettes})}\) for consistency with the book.

The function ivreg() from the package AER carries out TSLS procedure automatically. It is used similarly as lm(). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is log(packs) ~ log(rprice) | salestax.

# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)

coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")
#> t test of coefficients:
#>             Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)  9.71988    1.52832  6.3598 8.346e-08 ***
#> log(rprice) -1.08359    0.31892 -3.3977  0.001411 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that the coefficient estimates coincide for both approaches.

Two notes on the computation of TSLS standard errors

  1. We have demonstrated that running the individual regressions for each stage of TSLS using lm() leads to the same coefficient estimates as when using ivreg(). However, the standard errors reported for the second-stage regression (e.g.,coeftest() or summary()) are invalid because they do not account for the use of predictions from the first-stage regression as regressors in the second-stage regression. Fortunately, ivreg() performs the necessary adjustment automatically. This is another advantage over manual step-by-step estimation which we have done above for demonstrating the mechanics of the procedure.

  2. Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC().

The TSLS estimate for \(\beta_1\) in (12.4) suggests that an increase in cigarette prices by one percent reduces cigarette consumption by roughly \(1.08\) percentage points, which is fairly elastic. However, we should keep in mind that this estimate might not be trustworthy even though we used IV estimation: there still might be a bias due to omitted variables. Thus a multiple IV regression approach is needed.