This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

12.6 Exercises

1. The College Distance Data

There are many studies in labor economics which deal with the issue of estimating human capital earnings functions which state how wage income is determined by education and working experience. A prominent example is Card (1993) who investigates the economic return to schooling and uses college proximity as an instrumental variable.

The exercises in this chapter deal with the dataset CollegeDistance which is similar to the data used by Card (1993). It stems from a survey of high school graduates with variables coded for wages, education, average tuition and a number of socio-economic measures. The data set also includes the distance from a college while the survey participants were in high school. CollegeDistance comes with the AER package.


  • Attach the AER package and load the CollegeDistance data.

  • Get an overview over the data set.

  • The variable distance (the distance to the closest 4-year college in 10 miles) will serve as an instrument in later exercises. Use a histogram to visualize the distribution of distance.


  • Use data() to attach the data set.

  • The function hist() can be used to generate histograms.

2. The Selection Problem

Regressing wage on education and control variables to estimate the human capital earnings function is problematic because education is not randomly assigned across the surveyed: individuals make their own education choices and so measured differences in earnings between individuals with different levels of education depend on how these choices are made. In the literature this is referred to as a selection problem. This selection problem implies that education is endogenous so the OLS estimate will be biased and we cannot make valid inference regarding the true coefficient.

In this exercise you are asked to estimate two regressions which both do not yield trustworthy estimates of the coefficient on education due to the issue sketched above. Later you will compare the results to those obtained using the instrumental variables approach applied by Card (1993).

The AER package has been attached. The data set CollegeDistance is available in your global environment.


  • Regress the logarithm of wage on education, that is, estimate the model \[\log(wage_i) = \beta_0 + \beta_1 education_i + u_i\] Save the result to wage_mod_1.

  • Augment the model by including the regressors unemp, ethnicity, gender and urban. Save the result to wage_mod_2

  • Obtain summaries on the estimated coefficients in both models.

3. Instrumental Variables Regression Approaches — I

The above discussed selection problem renders the regression estimates in Exercise 2 implausible which is why Card (1993) suggests instrumental variables regression that uses college distance as an instrument for education.

Why use college distance as an instrument? The logic behind this is that distance from a college will be correlated to the decision to pursue a college degree (relevance) but may not predict wages apart from increased education (exogeneity) so college proximity could be considered a valid instrument (recall the definition of a valid instrument stated at the beginning of Chapter 12.1).

The AER package has been attached. The data set CollegeDistance is available in your global environment.


  • Compute the correlations of the instrument distance with the endogenous regressor education and the dependent variable wage.

  • How much of the variation in education is explained by the first-stage regression which uses distance as a regressor? Save the result to R2.

  • Repeat Exercise 2 with IV regression, i.e., employ distance as an instrument for education in both regressions using ivreg(). Save the results to wage_mod_iv1 and wage_mod_iv2. Obtain robust coefficient summaries for both models.

4. Instrumental Variables Regression Approaches — II

Convince yourself that ivreg() works as expected by implementing the TSLS algorithm presented in Key Concept 12.2 for a single instrument, see Chapter 12.2.


  • Complete the function TSLS() such that it implements the TSLS estimator.

  • Use TSLS() to reproduce the coefficient estimates obtained using ivreg() for both models of Exercise 3.


  • Completion of the function boils down to replacing the . . . by appropriate arguments.

  • Besides the data set (data), the function expects the dependent variable (Y), exogenous regressors (W), the endogenous regressors (X) and an instrument (Z) as arguments. All of these should be of class character.

  • Including W = NULL in the head of the function definition ensures that the set of exogenous variables is empty, by default.

5. Should we trust the Results?

This is not a real code exercise (there are no submission correctness tests for checking your code). Instead we would like you to use the widget below to compare the results obtained using the OLS regressions of Exercise 2 with those of the IV regressions of Exercise 3.

The data set CollegeDistance and all model objects from Exercises 2 and 3 are available in the global environment.


Convince yourself of the following:

  1. It is likely that the bias of the estimated coefficient on education in the simple regression model wage_mod_1 is subtantial because the regressor is endogenous due to omitting variables from the model which correlate with education and impact wage income.

  2. Due to the selection problem described in Exercise 2, the estimate of the coefficient of interest is not trustworthy even in the multiple regression model wage_mod_2 which includes several socio-economic control variables. The coeffiecient on education is not significant and its estimate is close to zero).

  3. Instrumenting education by the college distance as done in wage_mod_iv1 yields the IV estimate of the coefficient of interest. The result should, however, not be considered reliable because this simple model probably suffers from omitted variables bias just as the multiple regression model wage_mod_2 from Exercise 2, see 1. Again, the coefficient on education is not significant as its estimate is quite small.

  4. wage_mod_iv2, the multiple regression model where we include demographic control variables and instrumend education by distance delivers the most reliable estimate of the impact of education on wage income among all the models considered. The coefficient is highly significant and the estimate is about \(0.067\). Following Key Concept 8.2, the interpretation is that an additional year of schooling is expected to increases wage income by roughly \(0.067 \cdot 100\% = 6.7\%\).

  5. Is the estimate of the coefficient on education reported by wage_mod_iv2 trustworthy? This question is not easy to answer. In any case, we should bear in mind that using an instrumental variables approach is problematic when the instrument is weak. This could be the case here: Families with strong preference for education may move into neighborhoods close to colleges. Furthermore, neighborhoods close to colleges may have stronger job markets reflected by higher incomes. Such features would render the instrument invalid as they introduce unobserved variables which influence earnings but cannot be captured by years of schooling, our measure of education.


Card, D. 1993. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” National Bureau of Economic Research.