This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

## 5.7 Exercises

#### 1. Testing Two Null Hypotheses Separately

Consider the estimated regression model

$\widehat{TestScore} = \underset{(23.96)}{567.43} - \underset{(0.85)}{7.15} \times STR, \, R^2 = 0.8976, \, SER=15.19$

with standard errors in parentheses.

Instructions:

• Compute the $p$-value for a $t$-test of the hypothesis that the intercept is zero against the two-sided alternative that it is non-zero. Save the result to p_int
• Compute the $p$-value for a $t$-test of the hypothesis that the coefficient of STR is zero against the two-sided alternative that it is non-zero. Save the result to p_STR

Hint:

Both hypotheses can be tested individually using a two-sided test. Use pnorm() to obtain cumulated probabilities for standard normally distributed outcomes.

#### 2. Two Null Hypotheses You Cannot Reject, Can You?

Consider again the estimated regression model

$\widehat{TestScore} = \underset{(23.96)}{567.43} - \underset{(0.85)}{7.15} \times STR, \, R^2 = 0.8976, \,SER=15.19$

Can you reject the null hypotheses discussed in the previous code exercise using individual $t$-tests at the $5\%$ significance level?

The variables t_int and t_STR are the $t$-statistics. Both are available in your working environment.

Instructions:

• Gather t_int and t_STR in a vector test and use logical operators to check whether the corresponding rejection rule applies.

Hints:

• Both tests are two-sided $t$-tests. Key Concept 5.2 recaps how a two-sided $t$-test is conducted.
• Use qnorm() to obtain standard normal critical values.

#### 3. Confidence Intervals

mod, the object of class lm which contains the estimated regression model $\widehat{TestScore} = \underset{(23.96)}{567.43} - \underset{(0.85)}{7.15} \times STR, \, R^2 = 0.8976, \,SER=15.19$ is available in your working environment.

Instructions:

Compute $90\%$ confidence intervals for both coefficients.

Hint:

Use the function confint(), see ?confint. The argument level sets the confidence level to be used.

#### 4. A Confidence Interval for the Mean I

Consider the regression model $Y_i = \beta_1 + u_i$ where $Y_i \sim \mathcal{N}(\mu, \sigma^2)$. Following the discussion preceding equation (5.1), a $95\%$ confidence interval for the mean of the $Y_i$ can be computed as

$CI^{\mu}_{0.95} = \left[\hat\mu - 1.96 \times \frac{\sigma}{\sqrt{n}}; \, \hat\mu + 1.96 \times \frac{\sigma}{\sqrt{n}} \right].$

Instructions:

• Sample $n=100$ observations from a normal distribution with variance $100$ and mean $10$.
• Use the sample to estimate $\beta_1$. Save the estimate in mu_hat.
• Assume that $\sigma^2 = 100$ is known. Replace the NAs in the code below to obtain a $95\%$ confidence interval for the mean of the $Y_i$.

Hint:

Use the function confint(), see ?confint. The argument level sets the confidence level.

#### 5. A Confidence Interval for the Mean II

For historical reasons, some R functions which we use to obtain inference on model parameters, among them confint() and summary(), rely on the $t$-distribution instead of using the large-sample normal approximation. This is why for small sample sizes (and hence small degrees of freedom), $p$-values and confidence intervals reported by these functions deviate from those computed using critical values or cumulative probabilities of the standard normal distribution.

The $95\%$ confidence interval for the mean in the previous exercise is $[9.13, 13.05]$.

Instructions:

100 observations sampled from a normal distribution with $\mu=10$ and $\sigma^2=100$ have been assigned to the vector s which is available in your environment.

Set up a suitable regression model to estimate the mean of the observations in s. Then use confint() to compute a $95\%$ confidence interval for the mean.

(Check that the result is different from the interval reported above.)

#### 6. Regression on a Dummy Variable I

Chapter 5.3 discusses regression when $X$ is a dummy variable. We have used a for() loop to generate a binary variable indicating whether a schooling district in the CASchools data set has a student-teacher ratio below $20$. Though it is instructive to use a loop for this, there are alternate ways to achieve the same with fewer lines of code.

A data.frame DF with $100$ observations of a variable X is available in your working environment.

Instructions:

• Use ifelse() to generate a binary vector dummy indicating whether the observations in X are positive.

• Append dummy to the data.frame DF.

#### 7. Regression on a Dummy Variable II

A data.frame DF with 100 observations on Y and the binary variable D from the previous exercise is available in your working environment.

Instructions:

• Compute the group-specific sample means of the observations in Y: save the mean of observations in Y where dummy == 1 to mu_Y_D1 and assign the mean of those observations with D == 0 to mu_Y_D0.

• Use lm() to regress Y on D, i.e., estimate the coefficients in the model $Y_i = \beta_0 + \beta_1 \times D_i + u_i.$

Also check that the estimates of the coefficients $\beta_0$ and $\beta_1$ reflect specific sample means. Can you tell which (no code submission needed)?

#### 8. Regression on a Dummy Variable III

In this exercise, you have to visualize some of the results from the dummy regression model $\widehat{Y}_i = -0.66 + 1.43 \times D_i$ estimated in the previous exercise.

A data.frame DF with 100 observations on X and the binary variable dummy as well as the model object dummy_mod from the previous exercise are available in your working environment.

Instructions:

• Start by drawing a visually appealing plot of the observations on $Y$ and $D$ based on the code chunk provided in Script.R. Replace the ??? by the correct expressions!

• Add the regression line to the plot.

#### 9. Gender Wage Gap I

The cross-section data set CPS1985 is a subsample from the May 1985 Current Population Survey conducted by the US Census Bureau which contains observations on, among others things, wage and the gender of employees.

CPS1985 is part of the package AER.

Instructions:

• Attach the package AER and load the data set CPS1985.

• Estimate the dummy regression model $wage_i = \beta_0 + \beta_1 \cdot female_i + u_i$ where

\begin{align*} female_i = \begin{cases} 1, & \text{if employee} \, i \, \text{is female,} \\ 0, & \text{if employee} \, i \, \text{is male.} \end{cases} \end{align*}

Save the result in wage_mod.

#### 10. Gender Wage Gap II

The wage regression from the previous exercise yields $\widehat{wage}_i = 9.995 - 2.116 \cdot female_i.$

The model object dummy_mod is available in your working environment.

Instructions:

• Test the hypothesis that the coefficient on $female_i$ is zero against the alternative that it is non-zero. The null hypothesis implies that there is no gender wage gap. Use the heteroskedasticity-robust estimator proposed by White (1980).

Hints:

• vcovHC() computes heteroskedasticity-robust estimates of the covariance matrix of the coefficient estimators for the model supplied. The estimator proposed by White (1980) is computed if you set type = “HC0”.

• The function coeftest() performs significance tests for the coefficients in model objects. A covariance matrix can be supplied using the argument vcov..

#### 11. Computation of Heteroskedasticity-Robust Standard Errors

In the simple regression model, the covariance matrix of the coefficient estimators is denoted

$\begin{equation} \text{Var} \begin{pmatrix} \hat\beta_0 \ \hat\beta_1 \end{pmatrix} = \begin{pmatrix} \text{Var}(\hat\beta_0) & \text{Cov}(\hat\beta_0,\hat\beta_1) \\ \text{Cov}(\hat\beta_0,\hat\beta_1) & \text{Var}(\hat\beta_1) \end{pmatrix} \end{equation}$

The function vcovHC can be used to obtain estimates of this matrix for a model object of interest.

dummy_mod, a model object containing the wage regression dealt with in Exercises 9 and 10 is available in your working environment.

Instructions:

• Compute robust standard errors of the type HC1 for the coefficients estimators in the model object dummy_mod. Store the standard errors in a vector named rob_SEs.

Hints

• The standard errors we seek can be obtained by taking the square root of the diagonal elements of the estimated covariance matrix.
• diag(A) returns the diagonal elements of the matrix A.

#### 12. Robust Confidence Intervals

The function confint() computes confidence intervals for regression models using homoskedasticity-only standard errors so this function is not an option when there is heteroskedasticity.

The function Rob_CI() in script.R is meant to compute and report heteroskedasticity-robust confidence intervals for both model coefficients in a simple regression model.

gender_mod, a model object containing the wage regression dealt with in the previous exercises is available in your working environment.

Instructions:

• Complete the code of Rob_CI() given in Script.R such that lower and upper bounds of $95\%$ robust confidence intervals are returned. Use standard errors of the type HC1.

• Use the function Rob_CI() to obtain $95\%$ confidence intervals for the model coefficients in dummy_mod.

#### 13. A Small Simulation Study — I

Consider the data generating process (DGP) \begin{align} X_i \sim& \, \mathcal{U}[2,10], \notag \\ e_i \sim& \, \mathcal{N}(0, X_i), \notag \\ Y_i =& \, \beta_1 X_i + e_i, \tag{5.4} \end{align} where $\mathcal{U}[2,10]$ denotes the uniform distribution on the interval $[2,10]$ and $\beta_1=2$.

Notice that the errors $e_i$ are heteroskedastic since the variance of the $e_i$ is a function of $X_i$.

Instructions:

• Write a function DGP_OLS that generates a sample $(X_i,Y_i)$, $i=1,…,100$ using the DGP above and returns the OLS estimate of $\beta_1$ based on this sample.

Hint:

runif() can be used to obtain random samples from a uniform distribution, see ?runif.

#### 14. A Small Simulation Study — II

The function DGP_OLS() from the previous exercise is available in your working environment.

Instructions:

• Use replicate() to generate a sample of $1000$ OLS estimates $\widehat{\beta}_1$ using the function DGP_OLS. Store the estimates in a vector named estimates.

• Next, estimate the variance of $\widehat{\beta}_1$ in (5.4): compute the sample variance of the $1000$ OLS estimates in estimates. Store the result in est_var_OLS.

#### 15. A Small Simulation Study — III

According to the the Gauss-Markov theorem, the OLS estimator in linear regression models is no longer the most efficient estimator among the conditionally unbiased linear estimators when there is heteroskedasticity. In other words, the OLS estimator loses the BLUE property when the assumption of homoskedasticity is violated.

It turns out that OLS applied to the weighted observations $(w_i X_i, w_i Y_i)$ where $w_i=\frac{1}{\sigma_i}$ is the BLUE estimator under heteroskedasticity. This estimator is called the weighted least squares (WLS) estimator. Thus, when there is heteroskedasticity, the WLS estimator has lower variance than OLS.

The function DGP_OLS() and the estimated variance est_var_OLS from the previous exercises are available in your working environment.

Instructions:

• Write a function DGP_WLS() that generates $100$ samples using the DGP introduced in Exercise 13 and returns the WLS estimate of $\beta_1$. Treat $\sigma_i$ as known, i.e., set $w_i=\frac{1}{\sqrt{X_i}}$.

• Repeat exercise 14 using DGP_WLS(). Store the variance estimate in est_var_GLS.

• Compare the estimated variances est_var_OLS and est_var_GLS using logical operators (< or >).

Hints:

• DGP_WLS() can be obtained using a modified code of DGP_OLS().

• Remember that functions are objects and you may print the code of a function to the console.

### References

White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica, 48(4), pp. 817–838.