12.2 The General IV Regression Model

This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

The simple IV regression model is easily extended to a multiple regression model which we refer to as the general IV regression model. In this model we distinguish between four types of variables: the dependent variable, included exogenous variables, included endogenous variables and instrumental variables. Key Concept 12.1 summarizes the model and the common terminology. See Chapter 12.2 of the book for a more comprehensive discussion of the individual components of the general model.

Key Concept 12.1

The General Instrumental Variables Regression Model and Terminology

\[\begin{align} Y_i = \beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki} + \beta_{k+1} W_{1i} + \dots + \beta_{k+r} W_{ri} + u_i, \tag{12.5} \end{align}\]

with \(i=1,\dots,n\) is the general instrumental variables regression model where

\(Y_i\) is the dependent variable,
\(\beta_0,\dots,\beta_{k+1}\) are \(1+k+r\) unknown regression coefficients,
\(X_{1i},\dots,X_{ki}\) are \(k\) endogenous regressors ,
\(W_{1i},\dots,W_{ri}\) are \(r\) exogenous regressors which are uncorrelated with \(u_i\),
\(u_i\) is the error term,
\(Z_{1i},\dots,Z_{mi}\) are \(m\) instrumental variables.

The coefficients are overidentified if \(m>k\). If \(m<k\), the coefficients are underidentified and when \(m=k\) they are exactly identified. For estimation of the IV regression model we require exact identification or overidentification.

While the computation of both stages of TSLS individually in the context of a simple regression model with a single endogenous regressor (12.1) is not a big deal, Key Concept 12.2 clarifies why employing TSLS functions such as ivreg() becomes more advantageous when dealing with a larger set of potentially endogenous regressors and instruments.

Estimating regression models with TSLS using multiple instruments by means of ivreg() is straightforward. There are, however, some subtleties in correctly specifying the regression formula.

Assume that you want to estimate the model \[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + W_{1i} + u_i,\] where \(X_{1i}\) and \(X_{2i}\) are endogenous regressors that shall be instrumented by \(Z_{1i}\), \(Z_{2i}\) and \(Z_{3i}\) and \(W_{1i}\) is an exogenous regressor. The corresponding data is available in a data.frame with column names y, x1, x2, w1, z1, z2 and z3. It might be tempting to specify the argument formula in your call of ivreg() as y ~ x1 + x2 + w1 | z1 + z2 + z3 which is wrong. As explained in the documentation of ivreg() (see ?ivreg), it is necessary to list all exogenous variables as instruments too, that is joining them by +’s on the right of the vertical bar: y ~ x1 + x2 + w1 | w1 + z1 + z2 + z3 where w1 is “instrumenting itself”.

If there is a large number of exogenous variables it may be convenient to provide an update formula with a . (this includes all variables except for the dependent variable) right after the | and to exclude all endogenous variables using a -. For example, if there is one exogenous regressor w1 and one endogenous regressor x1 with instrument z1, the appropriate formula would be y ~ w1 + x1 | w1 + z1 which is equivalent to y ~ w1 + x1 | . - x1 + z1.

Key Concept 12.2

Two-Stage Least Squares

Similarly to the simple IV regression model, the general IV model (12.5) can be estimated using the two-stage least squares estimator:

First-stage regression(s):

Run an OLS regression for each of the endogenous variables (\(X_{1i},\dots,X_{ki}\)) on all instrumental variables (\(Z_{1i},\dots,Z_{mi}\)), all exogenous variables (\(W_{1i},\dots,W_{ri}\)) and an intercept. Compute the fitted values (\(\widehat{X}_{1i},\dots,\widehat{X}_{ki}\)).
Second-stage regression:

Regress the dependent variable on the predicted values of all endogenous regressors, all exogenous variables and an intercept using OLS. This gives \(\widehat{\beta}_{0}^{TSLS},\dots,\widehat{\beta}_{k+r}^{TSLS}\), the TSLS estimates of the model coefficients.

In the general IV regression model, the instrument relevance and instrument exogeneity assumptions are the same as in the simple regression model with a single endogenous regressor and only one instrument. See Key Concept 12.3 for a recap using the terminology of general IV regression.

Key Concept 12.3

Two Conditions for Valid Instruments

For \(Z_{1i},\dots,Z_{mi}\) to be a set of valid instruments, the following two conditions must be fulfilled:

Instrument Relevance:

if there are \(k\) endogenous variables, \(r\) exogenous variables and \(m\geq k\) instruments \(Z\) and the \(\widehat{X}_{1i}^*,\dots,\widehat{X}_{ki}^*\) are the predicted values from the \(k\) population first stage regressions, it must hold that \[(\widehat{X}_{1i}^*,\dots,\widehat{X}_{ki}^*, W_{1i}, \dots, W_{ri},1)\] are not perfectly multicollinear. \(1\) denotes the constant regressor which equals \(1\) for all observations.

Note: If there is only one endogenous regressor \(X_i\), there must be at least one non-zero coefficient on the \(Z\) and the \(W\) in the population regression for this condition to be valid: if all of the coefficients are zero, all the \(\widehat{X}^*_i\) are just the mean of \(X\) such that there is perfect multicollinearity.
Instrument Exogeneity:

All \(m\) instruments must be uncorrelated with the error term,

\[\rho_{Z_{1i},u_i} = 0,\dots,\rho_{Z_{mi},u_i} = 0.\]

One can show that if the IV regression assumptions presented in Key Concept 12.4 hold, the TSLS estimator in (12.5) is consistent and normally distributed when the sample size is large. Appendix 12.3 of the book deals with a proof in the special case with a single regressor, a single instrument and no exogenous variables. The reasoning behind this carries over to the general IV model. Chapter 18 of the book proves a more complicated explanation for the general case.

For our purposes it is sufficient to bear in mind that validity of the assumptions stated in Key Concept 12.4 allows us to obtain valid statistical inference using R functions which compute \(t\)-Tests, \(F\)-Tests and confidence intervals for model coefficients.

Key Concept 12.4

The IV Regression Assumptions

For the general IV regression model in Key Concept 12.1 we assume the following:

\(E(u_i\vert W_{1i}, \dots, W_{ri}) = 0.\)
\((X_{1i},\dots,X_{ki},W_{1i},\dots,W_{ri},Z_{1i},\dots,Z_{mi})\) are i.i.d. draws from their joint distribution.
All variables have nonzero finite fourth moments, i.e., outliers are unlikely.
The \(Z\)s are valid instruments (see Key Concept 12.3).

Application to the Demand for Cigarettes

The estimated elasticity of the demand for cigarettes in (12.1) is \(1.08\). Although (12.1) was estimated using IV regression it is plausible that this IV estimate is biased: in this model, the TSLS estimator is inconsistent for the true \(\beta_1\) if the instrument (the real sales tax per pack) correlates with the error term. This is likely to be the case since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend to generate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model.

\[\begin{align} \log(Q_i^{cigarettes}) = \beta_0 + \beta_1 \log(P_i^{cigarettes}) + \beta_2 \log(income_i) + u_i \tag{12.6} \end{align}\]

Before estimating (12.6) using ivreg() we define \(income\) as real per capita income rincome and append it to the data set CigarettesSW.

# add rincome to the dataset
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + 
                    salestax, data = c1995)

coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")
#> 
#> t test of coefficients:
#> 
#>              Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)   9.43066    1.25939  7.4883 1.935e-09 ***
#> log(rprice)  -1.14338    0.37230 -3.0711  0.003611 ** 
#> log(rincome)  0.21452    0.31175  0.6881  0.494917    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We obtain

\[\begin{align} \widehat{\log(Q_i^{cigarettes})} = \underset{(1.26)}{9.42} - \underset{(0.37)}{1.14} \log(P_i^{cigarettes}) + \underset{(0.31)}{0.21} \log(income_i). \tag{12.7} \end{align}\]

Following the book we add the cigarette-specific taxes (\(cigtax_i\)) as a further instrumental variable and estimate again using TSLS.

# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 
                    log(rincome) + salestax + cigtax, data = c1995)

coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")
#> 
#> t test of coefficients:
#> 
#>              Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)   9.89496    0.95922 10.3157 1.947e-13 ***
#> log(rprice)  -1.27742    0.24961 -5.1177 6.211e-06 ***
#> log(rincome)  0.28040    0.25389  1.1044    0.2753    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the two instruments \(salestax_i\) and \(cigtax_i\) we have \(m=2\) and \(k=1\) so the coefficient on the endogenous regressor \(\log(P_i^{cigarettes})\) is overidentified. The TSLS estimate of (12.6) is

\[\begin{align} \widehat{\log(Q_i^{cigarettes})} = \underset{(0.96)}{9.89} - \underset{(0.25)}{1.28} \log(P_i^{cigarettes}) + \underset{(0.25)}{0.28} \log(income_i). \tag{12.8} \end{align}\]

Should we trust the estimates presented in (12.7) or rather rely on (12.8)? The estimates obtained using both instruments are more precise since in (12.8) all standard errors reported are smaller than in (12.7). In fact, the standard error for the estimate of the demand elasticity is only two thirds of the standard error when the sales tax is the only instrument used. This is due to more information being used in estimation (12.8). If the instruments are valid, (12.8) can be considered more reliable.

However, without insights regarding the validity of the instruments it is not sensible to make such a statement. This stresses why checking instrument validity is essential. Chapter 12.3 briefly discusses guidelines in checking instrument validity and presents approaches that allow to test for instrument relevance and exogeneity under certain conditions. These are then used in an application to the demand for cigarettes in Chapter 12.4.