**Open Review**. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

## 12.6 Exercises

#### 1. The College Distance Data

There are many studies in labor economics which deal with the issue of estimating human capital earnings functions which state how wage income is determined by education and working experience. A prominent example is Card (1993) who investigates the economic return to schooling and uses college proximity as an instrumental variable.

The exercises in this chapter deal with the dataset `CollegeDistance` which is similar to the data used by Card (1993). It stems from a survey of high school graduates with variables coded for wages, education, average tuition and a number of socio-economic measures. The data set also includes the distance from a college while the survey participants were in high school. `CollegeDistance` comes with the `AER` package.

**Instructions:**

Attach the

`AER`package and load the`CollegeDistance`data.Get an overview over the data set.

The variable

`distance`(the distance to the closest 4-year college in 10 miles) will serve as an instrument in later exercises. Use a histogram to visualize the distribution of`distance`.

**Hints:**

Use

`data()`to attach the data set.The function

`hist()`can be used to generate histograms.

#### 2. The Selection Problem

Regressing `wage` on `education` and control variables to estimate the human capital earnings function is problematic because education is not randomly assigned across the surveyed: individuals make their own education choices and so measured differences in earnings between individuals with different levels of education depend on how these choices are made. In the literature this is referred to as a *selection problem*. This selection problem implies that `education` is *endogenous* so the OLS estimate will be biased and we cannot make valid inference regarding the true coefficient.

In this exercise you are asked to estimate two regressions which both do not yield trustworthy estimates of the coefficient on education due to the issue sketched above. Later you will compare the results to those obtained using the instrumental variables approach applied by Card (1993).

The `AER` package has been attached. The data set `CollegeDistance` is available in your global environment.

**Instructions:**

Regress the

*logarithm*of`wage`on`education`, that is, estimate the model \[\log(wage_i) = \beta_0 + \beta_1 education_i + u_i\] Save the result to`wage_mod_1`.Augment the model by including the regressors

`unemp`,`ethnicity`,`gender`and`urban`. Save the result to`wage_mod_2`Obtain summaries on the estimated coefficients in both models.

#### 3. Instrumental Variables Regression Approaches — I

The above discussed selection problem renders the regression estimates in Exercise 2 implausible which is why Card (1993) suggests instrumental variables regression that uses college distance as an instrument for education.

Why use college distance as an instrument? The logic behind this is that distance from a college will be correlated to the decision to pursue a college degree (relevance) but may not predict wages apart from increased education (exogeneity) so college proximity could be considered a valid instrument (recall the definition of a valid instrument stated at the beginning of Chapter 12.1).

The `AER` package has been attached. The data set `CollegeDistance` is available in your global environment.

**Instructions:**

Compute the correlations of the instrument

`distance`with the endogenous regressor`education`and the dependent variable`wage`.How much of the variation in

`education`is explained by the*first-stage regression*which uses`distance`as a regressor? Save the result to`R2`.Repeat Exercise 2 with IV regression, i.e., employ

`distance`as an instrument for`education`in both regressions using`ivreg()`. Save the results to`wage_mod_iv1`and`wage_mod_iv2`. Obtain robust coefficient summaries for both models.

#### 4. Instrumental Variables Regression Approaches — II

Convince yourself that `ivreg()` works as expected by implementing the TSLS algorithm presented in Key Concept 12.2 for a single instrument, see Chapter 12.2.

**Instructions:**

Complete the function

`TSLS()`such that it implements the TSLS estimator.Use

`TSLS()`to reproduce the coefficient estimates obtained using`ivreg()`for both models of Exercise 3.

**Hints:**

Completion of the function boils down to replacing the

`. . .`by appropriate arguments.Besides the data set (

`data`), the function expects the dependent variable (`Y`), exogenous regressors (`W`), the endogenous regressors (`X`) and an instrument (`Z`) as arguments. All of these should be of class`character`.Including

`W = NULL`in the head of the function definition ensures that the set of exogenous variables is empty, by default.

#### 5. Should we trust the Results?

This is not a real code exercise (there are no submission correctness tests for checking your code). Instead we would like you to use the widget below to compare the results obtained using the OLS regressions of Exercise 2 with those of the IV regressions of Exercise 3.

The data set `CollegeDistance` and all model objects from Exercises 2 and 3 are available in the global environment.

**Instructions:**

Convince yourself of the following:

It is likely that the bias of the estimated coefficient on

`education`in the simple regression model`wage_mod_1`is subtantial because the regressor is endogenous due to omitting variables from the model which correlate with`education`and impact wage income.Due to the selection problem described in Exercise 2, the estimate of the coefficient of interest is not trustworthy even in the multiple regression model

`wage_mod_2`which includes several socio-economic control variables. The coeffiecient on`education`is not significant and its estimate is close to zero).Instrumenting education by the college distance as done in

`wage_mod_iv1`yields the IV estimate of the coefficient of interest. The result should, however, not be considered reliable because this simple model probably suffers from omitted variables bias just as the multiple regression model`wage_mod_2`from Exercise 2, see 1. Again, the coefficient on`education`is not significant as its estimate is quite small.`wage_mod_iv2`, the multiple regression model where we include demographic control variables and instrumend`education`by`distance`delivers the most reliable estimate of the impact of education on wage income among all the models considered. The coefficient is highly significant and the estimate is about \(0.067\). Following Key Concept 8.2, the interpretation is that an additional year of schooling is expected to increases wage income by roughly \(0.067 \cdot 100\% = 6.7\%\).Is the estimate of the coefficient on education reported by

`wage_mod_iv2`trustworthy? This question is not easy to answer. In any case, we should bear in mind that using an instrumental variables approach is problematic when the instrument is*weak*. This could be the case here: Families with strong preference for education may move into neighborhoods close to colleges. Furthermore, neighborhoods close to colleges may have stronger job markets reflected by higher incomes. Such features would render the instrument invalid as they introduce unobserved variables which influence earnings but cannot be captured by years of schooling, our measure of education.