7.2 An Application to Test Scores and the Student-Teacher Ratio

This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

7.2 An Application to Test Scores and the Student-Teacher Ratio

Let us take a look at the regression from Section 6.3 again.

Computing confidence intervals for individual coefficients in the multiple regression model proceeds as in the simple regression model using the function confint().

model <- lm(score ~ STR + english, data = CASchools)
confint(model)
#>                   2.5 %      97.5 %
#> (Intercept) 671.4640580 700.6004311
#> STR          -1.8487969  -0.3537944
#> english      -0.7271113  -0.5724424

To obtain confidence intervals at another level, say \(90\%\), just set the argument level in our call of confint() accordingly.

confint(model, level = 0.9)
#>                     5 %        95 %
#> (Intercept) 673.8145793 698.2499098
#> STR          -1.7281904  -0.4744009
#> english      -0.7146336  -0.5849200

The output now reports the desired \(90\%\) confidence intervals for all coefficients.

One drawback of using confint() is that it doesn’t utilize robust standard errors for calculating the confidence interval. To create large-sample confidence intervals that account for robust standard errors, you can easily do manually using the following approach.

# compute robust standard errors
rob_se <- diag(vcovHC(model, type = "HC1"))^0.5

# compute robust 95% confidence intervals
rbind("lower" = coef(model) - qnorm(0.975) * rob_se,
      "upper" = coef(model) + qnorm(0.975) * rob_se)
#>       (Intercept)        STR    english
#> lower    668.9252 -1.9496606 -0.7105980
#> upper    703.1393 -0.2529307 -0.5889557

# compute robust 90% confidence intervals

rbind("lower" = coef(model) - qnorm(0.95) * rob_se,
      "upper" = coef(model) + qnorm(0.95) * rob_se)
#>       (Intercept)        STR    english
#> lower    671.6756 -1.8132659 -0.7008195
#> upper    700.3889 -0.3893254 -0.5987341

Knowing how to use R to make inference about the coefficients in multiple regression models, you can now answer the following question:

Can the null hypothesis that a change in the student-teacher ratio, STR, has no significant influence on test scores, scores, — if we control for the percentage of students learning English in the district, english, — be rejected at the \(10\%\) and the \(5\%\) level of significance?

The output above shows that zero is not an element of the confidence interval for the coefficient on STR such that we can reject the null hypothesis at significance levels of \(5\%\) and \(10\%\). The same conclusion can be made via the \(p\)-value for STR: \(0.00398 < 0.05 = \alpha\).

Note that rejection at the \(5\%\)-level implies rejection at the \(10\%\) level (why?).

Recall from Chapter 5.2 the \(95\%\) confidence interval computed above does not tell us that a one-unit decrease in the student-teacher ratio has an effect on test scores that lies in the interval with a lower bound of \(-1.9497\) and an upper bound of \(-0.2529\). Once a confidence interval has been computed, a probabilistic statement like this is wrong: either the interval contains the true parameter or it does not. We do not know which is true.

Another Augmentation of the Model

What is the average effect on test scores of reducing the student-teacher ratio when the expenditures per pupil and the percentage of english learning pupils are held constant?

Let us augment our model by an additional regressor that is a measure for expenditure per pupil. Using ?CASchools we find that CASchools contains the variable expenditure, which provides expenditure per student.

Our model now is \[ TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times english + \beta_3 \times expenditure + u \]

with \(expenditure\) being the total amount of expenditure per pupil in the district (thousands of dollars).

Let us now estimate the model:

# scale expenditure to thousands of dollars
CASchools$expenditure <- CASchools$expenditure/1000

# estimate the model
model <- lm(score ~ STR + english + expenditure, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
#> 
#> t test of coefficients:
#> 
#>               Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) 649.577947  15.458344  42.0212  < 2e-16 ***
#> STR          -0.286399   0.482073  -0.5941  0.55277    
#> english      -0.656023   0.031784 -20.6398  < 2e-16 ***
#> expenditure   3.867901   1.580722   2.4469  0.01482 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated effect of a one unit change in the student-teacher ratio on test scores with expenditure and the share of english learning pupils held constant is \(-0.29\), which is rather small. What is more, the coefficient on \(STR\) is not significantly different from zero anymore even at \(10\%\) since \(p\text{-value}=0.55\). Can you come up with an interpretation for these findings (see Chapter 7.1 of the book)? The insignificance of \(\hat\beta_1\) could be due to a larger standard error of \(\hat{\beta}_1\) resulting from adding \(expenditure\) to the model so that we estimate the coefficient on \(size\) less precisely. This illustrates the issue of strongly correlated regressors (imperfect multicollinearity). The correlation between \(STR\) and \(expenditure\) can be computed using cor().

# compute the sample correlation between 'STR' and 'expenditure'
cor(CASchools$STR, CASchools$expenditure)
#> [1] -0.6199822

Altogether, we conclude that the new model provides no evidence that changing the student-teacher ratio, e.g., by hiring new teachers, has any effect on the test scores while keeping expenditures per student and the share of English learners constant.