13.3 Experimental Estimates of the Effect of Class Size Reductions

This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

13.3 Experimental Estimates of the Effect of Class Size Reductions

Experimental Design and the Data Set

The Project Student-Teacher Achievement Ratio (STAR) was a large randomized controlled experiment with the aim of asserting whether a class size reduction is effective in improving education outcomes. It was conducted in 80 Tennessee elementary schools over a period of four years during the 1980s by the State Department of Education.

In the first year, about 6400 students were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher’s aide). Teachers were also randomly assigned to the classes they taught. The interventions were initiated as the students entered school in kindergarten and continued through to third grade. Control and treatment groups across grades are summarized in Table 13.1.

Table 13.1: Control and treatment groups in the STAR experiment
	K	1	2	3
Treatment 1	Small class	Small class	Small class	Small class
Treatment 2	Regular class + aide	Regular class + aide	Regular class + aide	Regular class + aide
Control	Regular class	Regular class	Regular class	Regular class

Each year, the students’ learning progress was assessed using the sum of the points scored on the math and reading parts of a standardized test (the Stanford Achievement Test).

The STAR data set is part of the package AER.

# load the package AER and the STAR dataset
library(AER)
data(STAR)

head(STAR) shows that there is a variety of factor variables that describe student and teacher characteristics as well as various school indicators, all of which are separately recorded for the four different grades. The data is in wide format. That is, each variable has its own column and for each student, the rows contain observations on these variables. Using dim(STAR) we find that there are a total of 11598 observations on 47 variables.

# get an overview
head(STAR, 2)
#>      gender ethnicity   birth stark star1 star2   star3 readk read1 read2 read3
#> 1122 female      afam 1979 Q3  <NA>  <NA>  <NA> regular    NA    NA    NA   580
#> 1137 female      cauc 1980 Q1 small small small   small   447   507   568   587
#>      mathk math1 math2 math3   lunchk lunch1   lunch2 lunch3 schoolk school1
#> 1122    NA    NA    NA   564     <NA>   <NA>     <NA>   free    <NA>    <NA>
#> 1137   473   538   579   593 non-free   free non-free   free   rural   rural
#>      school2  school3  degreek  degree1  degree2  degree3 ladderk ladder1
#> 1122    <NA> suburban     <NA>     <NA>     <NA> bachelor    <NA>    <NA>
#> 1137   rural    rural bachelor bachelor bachelor bachelor  level1  level1
#>         ladder2    ladder3 experiencek experience1 experience2 experience3
#> 1122       <NA>     level1          NA          NA          NA          30
#> 1137 apprentice apprentice           7           7           3           1
#>      tethnicityk tethnicity1 tethnicity2 tethnicity3 systemk system1 system2
#> 1122        <NA>        <NA>        <NA>        cauc    <NA>    <NA>    <NA>
#> 1137        cauc        cauc        cauc        cauc      30      30      30
#>      system3 schoolidk schoolid1 schoolid2 schoolid3
#> 1122      22      <NA>      <NA>      <NA>        54
#> 1137      30        63        63        63        63
dim(STAR)
#> [1] 11598    47

# get variable names
names(STAR)
#>  [1] "gender"      "ethnicity"   "birth"       "stark"       "star1"      
#>  [6] "star2"       "star3"       "readk"       "read1"       "read2"      
#> [11] "read3"       "mathk"       "math1"       "math2"       "math3"      
#> [16] "lunchk"      "lunch1"      "lunch2"      "lunch3"      "schoolk"    
#> [21] "school1"     "school2"     "school3"     "degreek"     "degree1"    
#> [26] "degree2"     "degree3"     "ladderk"     "ladder1"     "ladder2"    
#> [31] "ladder3"     "experiencek" "experience1" "experience2" "experience3"
#> [36] "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk"    
#> [41] "system1"     "system2"     "system3"     "schoolidk"   "schoolid1"  
#> [46] "schoolid2"   "schoolid3"

A majority of the variable names contain a suffix (k, 1, 2 or 3) stating the grade which the respective variable is referring to. This facilitates regression analysis because it allows to adjust the formula argument in lm() for each grade by simply changing the variables’ suffixes accordingly.

The outcome produced by head() shows that some recorded values are NA and thus, there is no data on this variable for the student under consideration. This lies in the nature of the data: for example, take the first observation STAR[1,].

In the output of head(STAR, 2) we find that the student entered the experiment in third grade in a regular class, which is why the class size is recorded in star3 and the other class type indicator variables are NA. Her math and reading scores for the third grade are available; however, recordings for other grades are not present for the same reason. Obtaining only her non-missing (non-NA) recordings is straightforward: simply eliminate the NAs using the !is.na() function.

# drop NA recordings for the first observation and print to the console
STAR[1, !is.na(STAR[1, ])]
#>      gender ethnicity   birth   star3 read3 math3 lunch3  school3  degree3
#> 1122 female      afam 1979 Q3 regular   580   564   free suburban bachelor
#>      ladder3 experience3 tethnicity3 system3 schoolid3
#> 1122  level1          30        cauc      22        54

is.na(STAR[1, ]) returns a logical vector with TRUE at positions that correspond to entries for the first observation. The ! operator is used to invert the result such that we obtain only non- entries for the first observations.

In general it is not necessary to remove rows with missing data because lm() does so by default. Missing data may imply a small sample size and thus may lead to imprecise estimation and wrong inference. This is, however, not an issue for the study at hand since, as we will see below, sample sizes lie beyond 5000 observations for each regression conducted.

Analysis of the STAR Data

As can be seen from Table 13.1 that there are two treatment groups in each grade, small classes with only 13 to 17 students and regular classes with 22 to 25 students and a teaching aide. Thus, two binary variables, each being an indicator for the respective treatment group, are introduced for the differences estimator to capture the treatment effect for each treatment group separately. This yields the population regression model \[\begin{align} Y_i = \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + u_i, \tag{13.3} \end{align}\] with test score \(Y_i\) which is the small class indicator \(SmallClass_i\) and \(RegAide_i\) which is the indicator for a regular class with aide.

We reproduce the results presented in Table 13.1 of the book by performing the regression (13.3) for each grade separately. For each student, the dependent variable is simply the sum of the points scored in the math and reading parts, constructed using I().

# compute differences Estimates for each grades
fmk <- lm(I(readk + mathk) ~ stark, data = STAR)
fm1 <- lm(I(read1 + math1) ~ star1, data = STAR)
fm2 <- lm(I(read2 + math2) ~ star2, data = STAR)
fm3 <- lm(I(read3 + math3) ~ star3, data = STAR)

# obtain coefficient matrix using robust standard errors
coeftest(fmk, vcov = vcovHC, type= "HC1")
#> 
#> t test of coefficients:
#> 
#>                    Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept)       918.04289    1.63339 562.0473 < 2.2e-16 ***
#> starksmall         13.89899    2.45409   5.6636 1.554e-08 ***
#> starkregular+aide   0.31394    2.27098   0.1382    0.8901    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm1, vcov = vcovHC, type= "HC1")
#> 
#> t test of coefficients:
#> 
#>                    Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept)       1039.3926     1.7846 582.4321 < 2.2e-16 ***
#> star1small          29.7808     2.8311  10.5190 < 2.2e-16 ***
#> star1regular+aide   11.9587     2.6520   4.5093  6.62e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm2, vcov = vcovHC, type= "HC1")
#> 
#> t test of coefficients:
#> 
#>                    Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept)       1157.8066     1.8151 637.8820 < 2.2e-16 ***
#> star2small          19.3944     2.7117   7.1522  9.55e-13 ***
#> star2regular+aide    3.4791     2.5447   1.3672    0.1716    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm3, vcov = vcovHC, type= "HC1")
#> 
#> t test of coefficients:
#> 
#>                     Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept)       1228.50636    1.68001 731.2483 < 2.2e-16 ***
#> star3small          15.58660    2.39604   6.5051 8.393e-11 ***
#> star3regular+aide   -0.29094    2.27271  -0.1280    0.8981    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We gather the results and present them in a table using stargazer().

# compute robust standard errors for each model and gather them in a list
rob_se_1 <- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
                 sqrt(diag(vcovHC(fm1, type = "HC1"))),
                 sqrt(diag(vcovHC(fm2, type = "HC1"))),
                 sqrt(diag(vcovHC(fm3, type = "HC1"))))

library(stargazer)

stargazer(fmk,fm1,fm2,fm3,
  title = "Project STAR: Differences Estimates",
  header = FALSE, 
  type = "latex",
  model.numbers = F,
  omit.table.layout = "n",
  digits = 3, 
  column.labels = c("K", "1", "2", "3"),
  dep.var.caption  = "Dependent Variable: Grade",
  dep.var.labels.include = FALSE,
  se = rob_se_1)


	Dependent Variable: Grade

	K	1	2	3

starksmall	13.90^***
	(2.45)

starkregular+aide	0.31
	(2.27)

star1small		29.78^***
		(2.83)

star1regular+aide		11.96^***
		(2.65)

star2small			19.39^***
			(2.71)

star2regular+aide			3.48
			(2.54)

star3small				15.59^***
				(2.40)

star3regular+aide				-0.29
				(2.27)

Constant	918.04^***	1,039.39^***	1,157.81^***	1,228.51^***
	(1.63)	(1.78)	(1.82)	(1.68)


Observations	5,786	6,379	6,049	5,967
R²	0.01	0.02	0.01	0.01
Adjusted R²	0.01	0.02	0.01	0.01
Residual Std. Error	73.49 (df = 5783)	90.50 (df = 6376)	83.69 (df = 6046)	72.91 (df = 5964)
F Statistic	21.26^*** (df = 2; 5783)	56.34^*** (df = 2; 6376)	28.71^*** (df = 2; 6046)	30.25^*** (df = 2; 5964)

Table 13.2: Project STAR - Differences Estimates

The estimates presented in Table 13.2 suggest that the class size reduction improves student performance. Except for grade 1, the estimates of the coefficient on \(SmallClass\) are roughly of the same magnitude (the estimates lie between 13.90 and 19.39 points) and they are statistically significant at \(1\%\). Furthermore, a teaching aide has little, possibly zero, effect on the performance of the students.

Following the book, we augment the regression model (13.3) by different sets of regressors for two reasons:

If the additional regressors explain some of the observed variation in the dependent variable, we obtain more efficient estimates of the coefficients of interest.
If the treatment is not received at random due to failures to follow the treatment protocol (see Chapter 13.3 of the book), the estimates obtained using (13.3) may be biased. Adding additional regressors may solve or mitigate this problem.

In particular, we consider the following student and teacher characteristics

\(experience\) — Teacher’s years of experience
\(boy\) — Student is a boy (dummy)
\(lunch\) — Free lunch eligibility (dummy)
\(black\) — Student is African-American (dummy)
\(race\) — Student’s race is other than black or white (dummy)
\(\text{schoolid}\) — School indicator variables

in the four population regression specifications \[\begin{align} Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + u_i, \tag{13.4} \\ Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + u_i, \tag{13.5} \\ Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + schoolid + u_i, \tag{13.6} \end{align}\] and \[\begin{align} Y_i =& \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + \beta_3 experience_i + \beta_4 boy + \beta_5 lunch \\ & + \beta_6 black + \beta_7 race + schoolid + u_i. \tag{13.7} \end{align}\]

Prior to estimation, we perform some subsetting and data wrangling using functions from the packages dplyr and tidyr. These are both part of tidyverse, a collection of R packages designed for data science and handling big datasets (see the official site for more on tidyverse packages). The functions %>%, transmute() and mutate() are sufficient for us here:

%>% allows to chain function calls.
transmute() allows to subset the data set by naming the variables to be kept.
mutate() is convenient for adding new variables based on existing ones while preserving the latter.

The regression models (13.4) to (13.7) require the variables gender, ethnicity, stark, readk, mathk, lunchk, experiencek and schoolidk. After dropping the remaining variables using transmute(), we use mutate() to add three additional binary variables which are derivatives of existing ones: black, race and boy. They are generated using logical statements within the function ifelse().

# load packages 'dplyr' and 'tidyr' for data wrangling functionalities
library(dplyr)
library(tidyr)

# generate subset with kindergarten data
STARK <- STAR %>% 
      transmute(gender,
                ethnicity,
                stark,
                readk,
                mathk,
                lunchk,
                experiencek,
                schoolidk) %>% 
      mutate(black = ifelse(ethnicity == "afam", 1, 0),
             race = ifelse(ethnicity == "afam" | ethnicity == "cauc", 1, 0),
             boy = ifelse(gender == "male", 1, 0))

# estimate the models 
gradeK1 <- lm(I(mathk + readk) ~ stark + experiencek, 
              data = STARK)

gradeK2 <- lm(I(mathk + readk) ~ stark + experiencek + schoolidk, 
              data = STARK)

gradeK3 <- lm(I(mathk + readk) ~ stark + experiencek + boy + lunchk 
              + black + race + schoolidk, 
              data = STARK)

For brevity, we exclude the coefficients for the indicator dummies in the output of the coeftest()by subsetting the matrices.

# obtain robust inference on the significance of coefficients
coeftest(gradeK1, vcov. = vcovHC, type = "HC1")
#> 
#> t test of coefficients:
#> 
#>                    Estimate Std. Error  t value  Pr(>|t|)    
#> (Intercept)       904.72124    2.22235 407.1020 < 2.2e-16 ***
#> starksmall         14.00613    2.44704   5.7237 1.095e-08 ***
#> starkregular+aide  -0.60058    2.25430  -0.2664    0.7899    
#> experiencek         1.46903    0.16929   8.6778 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(gradeK2, vcov. = vcovHC, type = "HC1")[1:4, ]
#>                      Estimate Std. Error     t value     Pr(>|t|)
#> (Intercept)       925.6748750  7.6527218 120.9602155 0.000000e+00
#> starksmall         15.9330822  2.2411750   7.1092540 1.310324e-12
#> starkregular+aide   1.2151960  2.0353415   0.5970477 5.504993e-01
#> experiencek         0.7431059  0.1697619   4.3773429 1.222880e-05
coeftest(gradeK3, vcov. = vcovHC, type = "HC1")[1:7, ]
#>                      Estimate Std. Error     t value     Pr(>|t|)
#> (Intercept)       937.6831330 14.3726687  65.2407117 0.000000e+00
#> starksmall         15.8900507  2.1551817   7.3729516 1.908960e-13
#> starkregular+aide   1.7869378  1.9614592   0.9110247 3.623211e-01
#> experiencek         0.6627251  0.1659298   3.9940097 6.578846e-05
#> boy               -12.0905123  1.6726331  -7.2284306 5.533119e-13
#> lunchkfree        -34.7033021  1.9870366 -17.4648529 1.437931e-66
#> black             -25.4305130  3.4986918  -7.2685776 4.125252e-13

We now use stargazer() to gather all relevant information in a structured table.

# compute robust standard errors for each model and gather them in a list
rob_se_2 <- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
                 sqrt(diag(vcovHC(gradeK1, type = "HC1"))),
                 sqrt(diag(vcovHC(gradeK2, type = "HC1"))),
                 sqrt(diag(vcovHC(gradeK3, type = "HC1"))))

stargazer(fmk, gradeK1, gradeK2, gradeK3,
  title = "Project STAR - Differences Estimates with 
  Additional Regressors for Kindergarten",
  header = FALSE, 
  type = "latex",
  model.numbers = F,
  omit.table.layout = "n",
  digits = 3, 
  column.labels = c("(1)", "(2)", "(3)", "(4)"),
  dep.var.caption  = "Dependent Variable: Test Score in Kindergarten",
  dep.var.labels.include = FALSE,
  se = rob_se_2)


	Dependent Variable: Test Score in Kindergarten

	(1)	(2)	(3)	(4)

starksmall	13.899^***	14.006^***	15.933^***	15.890^***
	(2.454)	(2.447)	(2.241)	(2.155)

starkregular+aide	0.314	-0.601	1.215	1.787
	(2.271)	(2.254)	(2.035)	(1.961)

experiencek		1.469^***	0.743^***	0.663^***
		(0.169)	(0.170)	(0.166)

boy				-12.091^***
				(1.673)

lunchkfree				-34.703^***
				(1.987)

black				-25.431^***
				(3.499)

race				8.501
				(12.520)

Constant	918.043^***	904.721^***	925.675^***	937.683^***
	(1.633)	(2.222)	(7.653)	(14.373)


School indicators?	no	no	yes	yes
Observations	5,786	5,766	5,766	5,748
R²	0.007	0.020	0.234	0.291
Adjusted R²	0.007	0.020	0.223	0.281
Residual Std. Error	73.490 (df = 5783)	73.085 (df = 5762)	65.075 (df = 5684)	62.663 (df = 5662)
F Statistic	21.263^*** (df = 2; 5783)	39.861^*** (df = 3; 5762)	21.413^*** (df = 81; 5684)	27.364^*** (df = 85; 5662)

Table 13.3: Project STAR - Differences Estimates with Additional Regressors for Kindergarten

The results in column (1) of Table 13.3 same as the results obtained for (13.3). Columns (2) to (4) reveal that adding student characteristics and school fixed effects does not lead to substantially different estimates of the treatment effects. This result makes it more plausible that the estimates of the effects obtained using model (13.3) do not suffer from failure of random assignment. There is some decrease in the standard errors and some increase in \(\bar{R}^2\), implying that the estimates are more precise.

Because teachers were randomly assigned to classes, inclusion of school fixed effect allows us to estimate the causal effect of a teacher’s experience on test scores of students in kindergarten. Regression (3) predicts the average effect of 10 years experience on test scores to be \(10\cdot 0.74=7.4\) points. Be aware that the other estimates on student characteristics in regression (4) do not have causal interpretation due to nonrandom assignment (see Chapter 13.3 of the book for a detailed discussion).

Are the estimated effects presented in Table 13.3 large or small in a practical sense? Let us translate the predicted changes in test scores to units of standard deviation in order to allow for a comparison (see Section 9.4 for a similar argument).

# compute the sample standard deviations of test scores
SSD <- c("K" = sd(na.omit(STAR$readk + STAR$mathk)),
         "1" = sd(na.omit(STAR$read1 + STAR$math1)),
         "2" = sd(na.omit(STAR$read2 + STAR$math2)),
         "3" = sd(na.omit(STAR$read3 + STAR$math3)))

# translate the effects of small classes to standard deviations
Small <- c("K" = as.numeric(coef(fmk)[2]/SSD[1]),
           "1" = as.numeric(coef(fm1)[2]/SSD[2]),
           "2" = as.numeric(coef(fm2)[2]/SSD[3]),
           "3" = as.numeric(coef(fm3)[2]/SSD[4]))

# adjust the standard errors
SmallSE <- c("K" = as.numeric(rob_se_1[[1]][2]/SSD[1]),
             "1" = as.numeric(rob_se_1[[2]][2]/SSD[2]),
             "2" = as.numeric(rob_se_1[[3]][2]/SSD[3]),
             "3" = as.numeric(rob_se_1[[4]][2]/SSD[4]))

# translate the effects of regular classes with aide to standard deviations
RegAide<- c("K" = as.numeric(coef(fmk)[3]/SSD[1]),
            "1" = as.numeric(coef(fm1)[3]/SSD[2]),
            "2" = as.numeric(coef(fm2)[3]/SSD[3]),
            "3" = as.numeric(coef(fm3)[3]/SSD[4]))

# adjust the standard errors
RegAideSE <- c("K" = as.numeric(rob_se_1[[1]][3]/SSD[1]),
               "1" = as.numeric(rob_se_1[[2]][3]/SSD[2]),
               "2" = as.numeric(rob_se_1[[3]][3]/SSD[3]),
               "3" = as.numeric(rob_se_1[[4]][3]/SSD[4]))

# gather the results in a data.frame and round
df <- t(round(data.frame(
                        Small, SmallSE, RegAide, RegAideSE, SSD),
                        digits =  2))

It is fairly easy to turn the data.frame df into a table.

# generate a simple table using stargazer
stargazer(df,
          title = "Estimated Class Size Effects 
          (in Units of Standard Deviations)",
          type = "html", 
          summary = FALSE,
          header = FALSE
          )


	K	1	2	3

Small	0.190	0.330	0.230	0.210
SmallSE	0.030	0.030	0.030	0.030
RegAide	0	0.130	0.040	0
RegAideSE	0.030	0.030	0.030	0.030
SSD	73.750	91.280	84.080	73.270

Table 13.4: Estimated Class Size Effects (in Units of Standard Deviations)

The estimated effect of a small class is largest for grade 1. As pointed out in the book, this is probably because students in the control group for grade 1 did poorly on the test for some unknown reason or simply due to random variation. The difference between the estimated effect of being in a small class and being in a regular class with an aide is roughly 0.2 standard deviations for all grades. This leads to the conclusion that the effect of being in a regular sized class with an aide is zero and the effect of being in a small class is roughly the same for all grades.

The remainder of Chapter 13.3 in the book discusses to what extent these experimental estimates are comparable with observational estimates obtained using data on school districts in California and Massachusetts in Chapter 9. It turns out that the estimates are indeed very similar. Please refer to the aforementioned section in the book for a more detailed discussion.