This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

8.5 Exercises

1. Correlation and (Non)linearity I

Consider the estimated simple linear regression model \[\widehat{medv_i} = 34.554 - 0.95\times lstat_i,\]

with medv (the median house value in the suburb) and lstat (the percent of households with low socioeconomic status in the suburb) being variables from the already known Boston dataset.

The lm() object for the above model is available as mod in your working environment. The package MASS has been loaded.


  • Compute the correlation coefficient between medv and lstat and save it to corr.

  • Plot medv against lstat and add the regression line using the model object mod. What do you notice?


  • You can use cor() to compute the correlation between variables.

  • You may use plot() and abline() to visualize regression results.

2. Correlation and (Non)linearity II

In the previous exercise we saw an example where the correlation between the dependent variable medv and the regressor medv is not useful for choosing the functional form of the regression since correlation captures only the linear relationship.

As an alternative, consider the nonlinear specification

\[medv_i = \beta_0 + \beta_1\times\log(lstat_i) + u_i.\]

The package MASS has been loaded.


  • Conduct the regression from above and assign the result to log_mod.

  • Visualize your results using a scatterplot and add the regression line. In comparison to the previous exercise, what do you notice now?


  • Use lm() to conduct the regression.

  • Use plot() and abline() to visualize regression results.

3. The Optimal Polynomial Order — Sequential Testing

Recall the following model from the previous exercise \[medv_i = \beta_0 + \beta_1\times\log(lstat_i) + u_i.\]

We saw that this model specification seems to be a reasonable choice. However, a higher order polynomial in \(\log(lstat_i)\) may be more suited for explaining \(medv\).

The packages AER and MASS have been loaded.


  • Determine the optimal order of a polylog model using sequential testing. Use a maximum polynomial order of \(r=4\) and the significance level \(\alpha=0.05\). We would like you to use a for() loop and recommend the following approach:

    1. Estimate a model, say mod, which starts with the highest polynomial order
    2. Save the \(p\)-value (use robust standard errors) of the relevant parameter and compare it to the significance level \(\alpha\)
    3. If you cannot reject the null, repeat steps 1 and 2 for the next lowest polynomial order, otherwise stop the loop and print out the polynomial order
  • Compute the \(R^2\) of the selected model and assign it to R2.


  • The index for the for() loop should start at 4 and end at 1.

  • Using poly() in the argument formula of lm() is a generic way to incorporate higher orders of a certain variable in the model. Besides the variable, you have to specify the degree of the polynomial via the argument degree and set raw = TRUE.

  • Use coeftest() together with the argument vcov. to obtain \(p\)-values (use robust standard errors!). Use the structure of the resulting object to extract the relevant \(p\)-value.

  • An if() statement may be useful to check whether the condition for acceptance of the null in step 3 is fulfilled.

  • A for() loop is stopped using break.

  • Use summary() to obtain the \(R^2\). You may extract it by appending $r.squared to the function call.

4. The Estimated Effect of a Unit Change

Reconsider the polylog model from the previous exercise that was selected by the sequential testing approach. As this model is logarithmic and of quadratic form, we cannot simply read off the estimated effect of a unit change (that is, one percent) in lstat from the coefficient summary because this effect depends on the level of lstat. We may compute this manually.

The selected polylog model mod_pl is available in your working environment. The package MASS has been loaded.


Assume that we are interested in the effect on medv of an increase in lstat from \(10\%\) to \(11\%\).

  • Set up a data.frame with the relevant observations of lstat.

  • Use the new observations to predict the corresponding values of medv.

  • Compute the expected effect with the help of diff().


  • You may use predict() together with the new data to obtain the predicted values of medv. Note that the column names of the data.frame must match the names of the regressors when using predict().

  • diff() expects a vector. It computes the differences between all entries of this vector.

5. Interactions between Independent Variables I

Consider the following regression model

\[medv_i=\beta_0+\beta_1\times chas_i+\beta_2\times old_i+\beta_3\times (chas_i\cdot old_i)+u_i\]

where \(chas_i\) and \(old_i\) are dummy variables. The former takes the value \(1\), if the Charles River (a short river in the proximity of Boston) passes through suburb \(i\) and is \(0\) otherwise. The latter indicates for a high proportion of old buildings and is constructed as

\[\begin{align} old_i = & \, \begin{cases} 1 & \text{if $age_i\geq 95$},\\ 0 & \text{else}, \end{cases} \end{align}\]

with \(age_i\) being the proportion of owner-occupied units built prior to 1940 in suburb \(i\).

The packages MASS and AER have been loaded.


  • Generate and append the binary variable old to the dataset Boston.

  • Conduct the regression stated above and assign the result to mod_bb.

  • Obtain a robust coefficient summary of the model. How do you interpret the results?


  • The operator >= can be used to generate a logical vector. Transform a logical vector to the numeric type via as.numeric().

  • In lm() there are two ways to include interaction terms using the argument formula:

    1. Var1*Var2 to add Var1, Var2 and the corresponding interaction term at once

    2. Var1:Var2 to manually add the interaction term (which of course requires you to add the remaining terms manually as well)

6. Interactions between Independent Variables II

Now consider the regression model

\[medv_i=\beta_0+\beta_1\times indus_i+\beta_2\times old_i+\beta_3\times (indus_i\cdot old_i)+u_i\]

with \(old_i\) defined as in the previous exercise and \(indus_i\) being the proportion of non-retail business acres in suburb \(i\).

The vector old from the previous exercise has been appended to the dataset. The package MASS has been loaded.


  • Estimate the above regression model and assign the result to mod_bc.

  • Extract the estimated coefficients of the model and assign them to params.

  • Plot medv against indus and add the regression lines for both states of the binary variable \(old\).


  • Make use the structure of mod_bc the output generated by coef() to extract the estimated coefficients.

  • Apart from passing an lm() object to abline() one may also specify intercept and slope manually using the arguments a and b, respectively.