**Open Review**. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

## 8.5 Exercises

#### 1. Correlation and (Non)linearity I

Consider the estimated simple linear regression model \[\widehat{medv_i} = 34.554 - 0.95\times lstat_i,\]

with `medv` (the median house value in the suburb) and `lstat` (the percent of households with low socioeconomic status in the suburb) being variables from the already known `Boston` dataset.

The `lm()` object for the above model is available as `mod` in your working environment. The package `MASS` has been loaded.

**Instructions:**

Compute the correlation coefficient between

`medv`and`lstat`and save it to`corr`.Plot

`medv`against`lstat`and add the regression line using the model object`mod`. What do you notice?

**Hints:**

You can use

`cor()`to compute the correlation between variables.You may use

`plot()`and`abline()`to visualize regression results.

#### 2. Correlation and (Non)linearity II

In the previous exercise we saw an example where the correlation between the dependent variable `medv` and the regressor `medv` is not useful for choosing the functional form of the regression since correlation captures only the linear relationship.

As an alternative, consider the nonlinear specification

\[medv_i = \beta_0 + \beta_1\times\log(lstat_i) + u_i.\]

The package `MASS` has been loaded.

**Instructions:**

Conduct the regression from above and assign the result to

`log_mod`.Visualize your results using a scatterplot and add the regression line. In comparison to the previous exercise, what do you notice now?

**Hints:**

Use

`lm()`to conduct the regression.Use

`plot()`and`abline()`to visualize regression results.

#### 3. The Optimal Polynomial Order — Sequential Testing

Recall the following model from the previous exercise \[medv_i = \beta_0 + \beta_1\times\log(lstat_i) + u_i.\]

We saw that this model specification seems to be a reasonable choice. However, a higher order polynomial in \(\log(lstat_i)\) may be more suited for explaining \(medv\).

The packages `AER` and `MASS` have been loaded.

**Instructions:**

Determine the optimal order of a polylog model using sequential testing. Use a maximum polynomial order of \(r=4\) and the significance level \(\alpha=0.05\). We would like you to use a

`for()`loop and recommend the following approach:- Estimate a model, say
`mod`, which starts with the highest polynomial order - Save the \(p\)-value (use robust standard errors) of the relevant parameter and compare it to the significance level \(\alpha\)
- If you cannot reject the null, repeat steps 1 and 2 for the next lowest polynomial order, otherwise stop the loop and print out the polynomial order

- Estimate a model, say
Compute the \(R^2\) of the selected model and assign it to

`R2`.

**Hints:**

The index for the

`for()`loop should start at 4 and end at 1.Using

`poly()`in the argument`formula`of`lm()`is a generic way to incorporate higher orders of a certain variable in the model. Besides the variable, you have to specify the degree of the polynomial via the argument`degree`and set`raw = TRUE`.Use

`coeftest()`together with the argument`vcov.`to obtain \(p\)-values (use robust standard errors!). Use the structure of the resulting object to extract the relevant \(p\)-value.An

`if()`statement may be useful to check whether the condition for acceptance of the null in step 3 is fulfilled.A

`for()`loop is stopped using`break`.Use

`summary()`to obtain the \(R^2\). You may extract it by appending`$r.squared`to the function call.

#### 4. The Estimated Effect of a Unit Change

Reconsider the polylog model from the previous exercise that was selected by the sequential testing approach. As this model is logarithmic and of quadratic form, we cannot simply read off the estimated effect of a unit change (that is, one percent) in `lstat` from the coefficient summary because this effect depends on the level of `lstat`. We may compute this manually.

The selected polylog model `mod_pl` is available in your working environment. The package `MASS` has been loaded.

**Instructions:**

Assume that we are interested in the effect on `medv` of an increase in `lstat` from \(10\%\) to \(11\%\).

Set up a

`data.frame`with the relevant observations of`lstat`.Use the new observations to predict the corresponding values of

`medv`.Compute the expected effect with the help of

`diff()`.

**Hints:**

You may use

`predict()`together with the new data to obtain the predicted values of`medv`. Note that the column names of the`data.frame`must match the names of the regressors when using`predict()`.`diff()`expects a vector. It computes the differences between all entries of this vector.

#### 5. Interactions between Independent Variables I

Consider the following regression model

\[medv_i=\beta_0+\beta_1\times chas_i+\beta_2\times old_i+\beta_3\times (chas_i\cdot old_i)+u_i\]

where \(chas_i\) and \(old_i\) are dummy variables. The former takes the value \(1\), if the Charles River (a short river in the proximity of Boston) passes through suburb \(i\) and is \(0\) otherwise. The latter indicates for a high proportion of old buildings and is constructed as

\[\begin{align} old_i = & \, \begin{cases} 1 & \text{if $age_i\geq 95$},\\ 0 & \text{else}, \end{cases} \end{align}\]

with \(age_i\) being the proportion of owner-occupied units built prior to 1940 in suburb \(i\).

The packages `MASS` and `AER` have been loaded.

**Instructions:**

Generate and append the binary variable

`old`to the dataset`Boston`.Conduct the regression stated above and assign the result to

`mod_bb`.Obtain a robust coefficient summary of the model. How do you interpret the results?

**Hints:**

The operator

`>=`can be used to generate a logical vector. Transform a logical vector to the numeric type via`as.numeric()`.In

`lm()`there are two ways to include interaction terms using the argument`formula`:`Var1*Var2`to add`Var1`,`Var2`and the corresponding interaction term at once`Var1:Var2`to manually add the interaction term (which of course requires you to add the remaining terms manually as well)

#### 6. Interactions between Independent Variables II

Now consider the regression model

\[medv_i=\beta_0+\beta_1\times indus_i+\beta_2\times old_i+\beta_3\times (indus_i\cdot old_i)+u_i\]

with \(old_i\) defined as in the previous exercise and \(indus_i\) being the proportion of non-retail business acres in suburb \(i\).

The vector `old` from the previous exercise has been appended to the dataset. The package `MASS` has been loaded.

**Instructions:**

Estimate the above regression model and assign the result to

`mod_bc`.Extract the estimated coefficients of the model and assign them to

`params`.Plot

`medv`against`indus`and add the regression lines for both states of the binary variable \(old\).

**Hints:**

Make use the structure of

`mod_bc`the output generated by`coef()`to extract the estimated coefficients.Apart from passing an

`lm()`object to`abline()`one may also specify intercept and slope manually using the arguments`a`and`b`, respectively.