**Open Review**. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

## 13.3 Experimental Estimates of the Effect of Class Size Reductions

### Experimental Design and the Data Set

The Project *Student-Teacher Achievement Ratio* (STAR) was a large randomized controlled experiment with the aim of asserting whether a class size reduction is effective in improving education outcomes. It has been conducted in 80 Tennessee elementary schools over a period of four years during the 1980s by the State Department of Education.

In the first year, about 6400 students were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher’s aide). Teachers were also randomly assigned to the classes they taught. The interventions were initiated as the students entered school in kindergarten and continued through to third grade. Control and treatment groups across grades are summarized in Table 13.1.

K | 1 | 2 | 3 | |
---|---|---|---|---|

Treatment 1 | Small class | Small class | Small class | Small class |

Treatment 2 | Regular class + aide | Regular class + aide | Regular class + aide | Regular class + aide |

Control | Regular class | Regular class | Regular class | Regular class |

Each year, the students’ learning progress was assessed using the sum of the points scored on the math and reading parts of a standardized test (the Stanford Achievement Test).

The STAR data set is part of the package `AER`.

```
# load the package AER and the STAR dataset
library(AER)
data(STAR)
```

`head(STAR)` shows that there is a variety of factor variables that describe student and teacher characteristics as well as various school indicators, all of which are separately recorded for the four different grades. The data is in *wide format*. That is, each variable has its own column and for each student, the rows contain observations on these variables. Using `dim(STAR)` we find that there are a total of 11598 observations on 47 variables.

```
# get an overview
head(STAR, 2)
#> gender ethnicity birth stark star1 star2 star3 readk read1 read2 read3 mathk math1 math2 math3 lunchk lunch1
#> 1122 female afam 1979 Q3 <NA> <NA> <NA> regular NA NA NA 580 NA NA NA 564 <NA> <NA>
#> 1137 female cauc 1980 Q1 small small small small 447 507 568 587 473 538 579 593 non-free free
#> lunch2 lunch3 schoolk school1 school2 school3 degreek degree1 degree2 degree3 ladderk ladder1 ladder2
#> 1122 <NA> free <NA> <NA> <NA> suburban <NA> <NA> <NA> bachelor <NA> <NA> <NA>
#> 1137 non-free free rural rural rural rural bachelor bachelor bachelor bachelor level1 level1 apprentice
#> ladder3 experiencek experience1 experience2 experience3 tethnicityk tethnicity1 tethnicity2 tethnicity3 systemk
#> 1122 level1 NA NA NA 30 <NA> <NA> <NA> cauc <NA>
#> 1137 apprentice 7 7 3 1 cauc cauc cauc cauc 30
#> system1 system2 system3 schoolidk schoolid1 schoolid2 schoolid3
#> 1122 <NA> <NA> 22 <NA> <NA> <NA> 54
#> 1137 30 30 30 63 63 63 63
dim(STAR)
#> [1] 11598 47
```

```
# get variable names
names(STAR)
#> [1] "gender" "ethnicity" "birth" "stark" "star1" "star2" "star3" "readk"
#> [9] "read1" "read2" "read3" "mathk" "math1" "math2" "math3" "lunchk"
#> [17] "lunch1" "lunch2" "lunch3" "schoolk" "school1" "school2" "school3" "degreek"
#> [25] "degree1" "degree2" "degree3" "ladderk" "ladder1" "ladder2" "ladder3" "experiencek"
#> [33] "experience1" "experience2" "experience3" "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk"
#> [41] "system1" "system2" "system3" "schoolidk" "schoolid1" "schoolid2" "schoolid3"
```

A majority of the variable names contain a suffix (`k`, `1`, `2` or `3`) stating the grade which the respective variable is referring to. This facilitates regression analysis because it allows to adjust the `formula` argument in `lm()` for each grade by simply changing the variables’ suffixes accordingly.

The outcome produced by `head()` shows that some values recorded are `NA` and , i.e., there is no data on this variable for the student under consideration. This lies in the nature of the data: for example, take the first observation

`STAR[1,]`

.In the output of `head(STAR, 2)`

we find that the student entered the experiment in third grade in a regular class, which is why the class size is recorded in `star3` and the other class type indicator variables are . For the same reason there are no recordings of her math and reading score but for the third grade. It is straightforward to only get her non-

`NA`/

`recordings: simply drop the`

`NA`s using

`!is.na()`.

```
# drop NA recordings for the first observation and print to the console
1, !is.na(STAR[1, ])]
STAR[#> gender ethnicity birth star3 read3 math3 lunch3 school3 degree3 ladder3 experience3 tethnicity3 system3 schoolid3
#> 1122 female afam 1979 Q3 regular 580 564 free suburban bachelor level1 30 cauc 22 54
```

`is.na(STAR[1, ])`

returns a logical vector with `TRUE` at positions that correspond to entries for the first observation. The

`!`operator is used to invert the result such that we obtain only non-

`entries for the first observations.`

In general it is not necessary to remove rows with missing data because `lm()` does so by default. Missing data may imply a small sample size and thus may lead to imprecise estimation and wrong inference This is, however, not an issue for the study at hand since, as we will see below, sample sizes lie beyond 5000 observations for each regression conducted.

### Analysis of the STAR Data

As can be seen from Table 13.1 there are two treatment groups in each grade, small classes with only 13 to 17 students and regular classes with 22 to 25 students and a teaching aide. Thus, two binary variables, each being an indicator for the respective treatment group, are introduced for the differences estimator to capture the treatment effect for each treatment group separately. This yields the population regression model \[\begin{align} Y_i = \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + u_i, \tag{13.3} \end{align}\] with test score \(Y_i\), the small class indicator \(SmallClass_i\) and \(RegAide_i\), the indicator for a regular class with aide.

We reproduce the results presented in Table 13.1 of the book by performing the regression (13.3) for each grade separately. For each student, the dependent variable is simply the sum of the points scored in the math and reading parts, constructed using `I()`.

```
# compute differences Estimates for each grades
<- lm(I(readk + mathk) ~ stark, data = STAR)
fmk <- lm(I(read1 + math1) ~ star1, data = STAR)
fm1 <- lm(I(read2 + math2) ~ star2, data = STAR)
fm2 <- lm(I(read3 + math3) ~ star3, data = STAR) fm3
```

```
# obtain coefficient matrix using robust standard errors
coeftest(fmk, vcov = vcovHC, type= "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 918.04289 1.63339 562.0473 < 2.2e-16 ***
#> starksmall 13.89899 2.45409 5.6636 1.554e-08 ***
#> starkregular+aide 0.31394 2.27098 0.1382 0.8901
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm1, vcov = vcovHC, type= "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1039.3926 1.7846 582.4321 < 2.2e-16 ***
#> star1small 29.7808 2.8311 10.5190 < 2.2e-16 ***
#> star1regular+aide 11.9587 2.6520 4.5093 6.62e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm2, vcov = vcovHC, type= "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1157.8066 1.8151 637.8820 < 2.2e-16 ***
#> star2small 19.3944 2.7117 7.1522 9.55e-13 ***
#> star2regular+aide 3.4791 2.5447 1.3672 0.1716
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm3, vcov = vcovHC, type= "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1228.50636 1.68001 731.2483 < 2.2e-16 ***
#> star3small 15.58660 2.39604 6.5051 8.393e-11 ***
#> star3regular+aide -0.29094 2.27271 -0.1280 0.8981
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We gather the results and present them in a table using `stargazer()`.

```
# compute robust standard errors for each model and gather them in a list
<- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
rob_se_1 sqrt(diag(vcovHC(fm1, type = "HC1"))),
sqrt(diag(vcovHC(fm2, type = "HC1"))),
sqrt(diag(vcovHC(fm2, type = "HC1"))))
```

```
library(stargazer)
stargazer(fmk,fm1,fm2,fm3,
title = "Project STAR: Differences Estimates",
header = FALSE,
type = "latex",
model.numbers = F,
omit.table.layout = "n",
digits = 3,
column.labels = c("K", "1", "2", "3"),
dep.var.caption = "Dependent Variable: Grade",
dep.var.labels.include = FALSE,
se = rob_se_1)
```

Dependent Variable: Grade | ||||

K | 1 | 2 | 3 | |

starksmall | 13.90^{***} | |||

(2.45) | ||||

starkregular+aide | 0.31 | |||

(2.27) | ||||

star1small | 29.78^{***} | |||

(2.83) | ||||

star1regular+aide | 11.96^{***} | |||

(2.65) | ||||

star2small | 19.39^{***} | |||

(2.71) | ||||

star2regular+aide | 3.48 | |||

(2.54) | ||||

star3small | 15.59 | |||

star3regular+aide | -0.29 | |||

Constant | 918.04^{***} | 1,039.39^{***} | 1,157.81^{***} | 1,228.51^{***} |

(1.63) | (1.78) | (1.82) | (1.82) | |

Observations | 5,786 | 6,379 | 6,049 | 5,967 |

R^{2} | 0.01 | 0.02 | 0.01 | 0.01 |

Adjusted R^{2} | 0.01 | 0.02 | 0.01 | 0.01 |

Residual Std. Error | 73.49 (df = 5783) | 90.50 (df = 6376) | 83.69 (df = 6046) | 72.91 (df = 5964) |

F Statistic | 21.26^{***} (df = 2; 5783) | 56.34^{***} (df = 2; 6376) | 28.71^{***} (df = 2; 6046) | 30.25^{***} (df = 2; 5964) |

Table 13.2: Project STAR - Differences Estimates

The estimates presented in Table 13.2 suggest that the class size reduction improves student performance. Except for grade 1, the estimates of the coefficient on \(SmallClass\) are roughly of the same magnitude (the estimates lie between 13.90 and 19.39 points) and they are statistically significant at \(1\%\). Furthermore, a teaching aide has little, possibly zero, effect on the performance of the students.

Following the book, we augment the regression model (13.3) by different sets of regressors for two reasons:

- If the additional regressors explain some of the observed variation in the dependent variable, we obtain more efficient estimates of the coefficients of interest.
- If the treatment is not received at random due to failures to follow the treatment protocol (see Chapter 13.3 of the book), the estimates obtained using (13.3) may be biased. Adding additional regressors may solve or mitigate this problem.

In particular, we consider the following student and teacher characteristics

- \(experience\) — Teacher’s years of experience
- \(boy\) — Student is a boy (dummy)
- \(lunch\) — Free lunch eligibility (dummy)
- \(black\) — Student is African-American (dummy)
- \(race\) — Student’s race is other than black or white (dummy)
- \(\text{schoolid}\) — School indicator variables

in the four population regression specifications \[\begin{align} Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + u_i, \tag{13.4} \\ Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + u_i, \tag{13.5} \\ Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + schoolid + u_i, \tag{13.6} \end{align}\] and \[\begin{align} Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + \beta_4 boy + \beta_5 lunch \\ & + \beta_6 black + \beta_7 race + schoolid + u_i. \tag{13.7} \end{align}\]

Prior to estimation, we perform some subsetting and data wrangling using functions from the packages `dplyr` and `tidyr`. These are both part of `tidyverse`, a collection of `R` packages designed for data science and handling big datasets (see the official site for more on `tidyverse` packages). The functions `%>%`, `transmute()` and `mutate()` are sufficient for us here:

`%>%`allows to chain function calls.`transmute()`allows to subset the data set by naming the variables to be kept.`mutate()`is convenient for adding new variables based on existing ones while preserving the latter.

The regression models (13.4) to (13.7) require the variables `gender`, `ethnicity`, `stark`, `readk`, `mathk`, `lunchk`, `experiencek` and `schoolidk`. After dropping the remaining variables using `transmute()`, we use `mutate()` to add three additional binary variables which are derivatives of existing ones: `black`, `race` and `boy`. They are generated using logical statements within the function `ifelse()`.

```
# load packages 'dplyr' and 'tidyr' for data wrangling functionalities
library(dplyr)
library(tidyr)
# generate subset with kindergarten data
<- STAR %>%
STARK transmute(gender,
ethnicity,
stark,
readk,
mathk,
lunchk,
experiencek,%>%
schoolidk) mutate(black = ifelse(ethnicity == "afam", 1, 0),
race = ifelse(ethnicity == "afam" | ethnicity == "cauc", 1, 0),
boy = ifelse(gender == "male", 1, 0))
```

```
# estimate the models
<- lm(I(mathk + readk) ~ stark + experiencek,
gradeK1 data = STARK)
<- lm(I(mathk + readk) ~ stark + experiencek + schoolidk,
gradeK2 data = STARK)
<- lm(I(mathk + readk) ~ stark + experiencek + boy + lunchk
gradeK3 + black + race + schoolidk,
data = STARK)
```

For brevity, we exclude the coefficients for the indicator dummies in `coeftest()`’s output by subsetting the matrices.

```
# obtain robust inference on the significance of coefficients
coeftest(gradeK1, vcov. = vcovHC, type = "HC1")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 904.72124 2.22235 407.1020 < 2.2e-16 ***
#> starksmall 14.00613 2.44704 5.7237 1.095e-08 ***
#> starkregular+aide -0.60058 2.25430 -0.2664 0.7899
#> experiencek 1.46903 0.16929 8.6778 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(gradeK2, vcov. = vcovHC, type = "HC1")[1:4, ]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 925.6748750 7.6527218 120.9602155 0.000000e+00
#> starksmall 15.9330822 2.2411750 7.1092540 1.310324e-12
#> starkregular+aide 1.2151960 2.0353415 0.5970477 5.504993e-01
#> experiencek 0.7431059 0.1697619 4.3773429 1.222880e-05
coeftest(gradeK3, vcov. = vcovHC, type = "HC1")[1:7, ]
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 937.6831330 14.3726687 65.2407117 0.000000e+00
#> starksmall 15.8900507 2.1551817 7.3729516 1.908960e-13
#> starkregular+aide 1.7869378 1.9614592 0.9110247 3.623211e-01
#> experiencek 0.6627251 0.1659298 3.9940097 6.578846e-05
#> boy -12.0905123 1.6726331 -7.2284306 5.533119e-13
#> lunchkfree -34.7033021 1.9870366 -17.4648529 1.437931e-66
#> black -25.4305130 3.4986918 -7.2685776 4.125252e-13
```

We now use `stargazer()` to gather all relevant information in a structured table.

```
# compute robust standard errors for each model and gather them in a list
<- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
rob_se_2 sqrt(diag(vcovHC(gradeK1, type = "HC1"))),
sqrt(diag(vcovHC(gradeK2, type = "HC1"))),
sqrt(diag(vcovHC(gradeK3, type = "HC1"))))
```

```
stargazer(fmk, fm1, fm2, fm3,
title = "Project STAR - Differences Estimates with
Additional Regressors for Kindergarten",
header = FALSE,
type = "latex",
model.numbers = F,
omit.table.layout = "n",
digits = 3,
column.labels = c("(1)", "(2)", "(3)", "(4)"),
dep.var.caption = "Dependent Variable: Test Score in Kindergarten",
dep.var.labels.include = FALSE,
se = rob_se_2)
```

Dependent Variable: Test Score in Kindergarten | ||||

(1) | (2) | (3) | (4) | |

starksmall | 13.899^{***} | 14.006^{***} | 15.933^{***} | 15.890^{***} |

(2.454) | (2.447) | (2.241) | (2.155) | |

starkregular+aide | 0.314 | -0.601 | 1.215 | 1.787 |

(2.271) | (2.254) | (2.035) | (1.961) | |

experiencek | 1.469^{***} | 0.743^{***} | 0.663^{***} | |

(0.169) | (0.170) | (0.166) | ||

boy | -12.091^{***} | |||

(1.673) | ||||

lunchkfree | -34.703^{***} | |||

(1.987) | ||||

black | -25.431^{***} | |||

(3.499) | ||||

race | 8.501 | |||

(12.520) | ||||

Constant | 918.043^{***} | 904.721^{***} | 925.675^{***} | 937.683^{***} |

(1.633) | (2.222) | (7.653) | (14.373) | |

School indicators? | no | no | yes | yes |

Observations | 5,786 | 5,766 | 5,766 | 5,748 |

R^{2} | 0.007 | 0.020 | 0.234 | 0.291 |

Adjusted R^{2} | 0.007 | 0.020 | 0.223 | 0.281 |

Residual Std. Error | 73.490 (df = 5783) | 73.085 (df = 5762) | 65.075 (df = 5684) | 62.663 (df = 5662) |

F Statistic | 21.263^{***} (df = 2; 5783) | 39.861^{***} (df = 3; 5762) | 21.413^{***} (df = 81; 5684) | 27.364^{***} (df = 85; 5662) |

Table 13.3: Project STAR - Differences Estimates with Additional Regressors for Kindergarten

The results in column (1) of Table 13.3 just a repeat the results obtained for (13.3). Columns (2) to (4) reveal that adding student characteristics and school fixed effects does not lead to substantially different estimates of the treatment effects. This result makes it more plausible that the estimates of the effects obtained using model (13.3) do not suffer from failure of random assignment. There is some decrease in the standard errors and some increase in \(\bar{R}^2\), implying that the estimates are more precise.

Because teachers were randomly assigned to classes, inclusion of school fixed effect allows us to estimate the causal effect of a teacher’s experience on test scores of students in kindergarten. Regression (3) predicts the average effect of 10 years experience on test scores to be \(10\cdot 0.74=7.4\) points. Be aware that the other estimates on student characteristics in regression (4) *do not* have causal interpretation due to nonrandom assignment (see Chapter 13.3 of the book for a detailed discussion).

Are the estimated effects presented in Table 13.3 large or small in a practical sense? Let us translate the predicted changes in test scores to units of standard deviation in order to allow for a comparison (see Section 9.4 for a similar argument).

```
# compute the sample standard deviations of test scores
<- c("K" = sd(na.omit(STAR$readk + STAR$mathk)),
SSD "1" = sd(na.omit(STAR$read1 + STAR$math1)),
"2" = sd(na.omit(STAR$read2 + STAR$math2)),
"3" = sd(na.omit(STAR$read3 + STAR$math3)))
# translate the effects of small classes to standard deviations
<- c("K" = as.numeric(coef(fmk)[2]/SSD[1]),
Small "1" = as.numeric(coef(fm1)[2]/SSD[2]),
"2" = as.numeric(coef(fm2)[2]/SSD[3]),
"3" = as.numeric(coef(fm3)[2]/SSD[4]))
# adjust the standard errors
<- c("K" = as.numeric(rob_se_1[[1]][2]/SSD[1]),
SmallSE "1" = as.numeric(rob_se_1[[2]][2]/SSD[2]),
"2" = as.numeric(rob_se_1[[3]][2]/SSD[3]),
"3" = as.numeric(rob_se_1[[4]][2]/SSD[4]))
# translate the effects of regular classes with aide to standard deviations
<- c("K" = as.numeric(coef(fmk)[3]/SSD[1]),
RegAide"1" = as.numeric(coef(fm1)[3]/SSD[2]),
"2" = as.numeric(coef(fm2)[3]/SSD[3]),
"3" = as.numeric(coef(fm3)[3]/SSD[4]))
# adjust the standard errors
<- c("K" = as.numeric(rob_se_1[[1]][3]/SSD[1]),
RegAideSE "1" = as.numeric(rob_se_1[[2]][3]/SSD[2]),
"2" = as.numeric(rob_se_1[[3]][3]/SSD[3]),
"3" = as.numeric(rob_se_1[[4]][3]/SSD[4]))
# gather the results in a data.frame and round
<- t(round(data.frame(
df
Small, SmallSE, RegAide, RegAideSE, SSD),digits = 2))
```

It is fairly easy to turn the `data.frame` `df` into a table.

```
# generate a simple table using stargazer
stargazer(df,
title = "Estimated Class Size Effects
(in Units of Standard Deviations)",
type = "html",
summary = FALSE,
header = FALSE
)
```

K | 1 | 2 | 3 | |

Small | 0.190 | 0.330 | 0.230 | 0.210 |

SmallSE | 0.030 | 0.030 | 0.030 | 0.040 |

RegAide | 0 | 0.130 | 0.040 | 0 |

RegAideSE | 0.030 | 0.030 | 0.030 | 0.030 |

SSD | 73.750 | 91.280 | 84.080 | 73.270 |

Table 13.4: Estimated Class Size Effects (in Units of Standard Deviations)

The estimated effect of a small classes is largest for grade 1. As pointed out in the book, this is probably because students in the control group for grade 1 did poorly on the test for some unknown reason or simply due to random variation. The difference between the estimated effect of being in a small class and being in a regular classes with an aide is roughly 0.2 standard deviations for all grades. This leads to the conclusion that the effect of being in a regular sized class with an aide is zero and the effect of being in a small class is roughly the same for all grades.

The remainder of Chapter 13.3 in the book discusses to what extent these experimental estimates are comparable with observational estimates obtained using data on school districts in California and Massachusetts in Chapter 9. It turns out that the estimates are indeed very similar. Please refer to the aforementioned section in the book for a more detailed discussion.