This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

## 3.6 An Application to the Gender Gap of Earnings

This section discusses how to reproduce the results presented in the box The Gender Gap of Earnings of College Graduates in the United States of the book.

In order to reproduce Table 3.1 of the book you need to download the replication data which are hosted by Pearson and can be downloaded here. This file contains data that range from $1992$ to $2008$ and earnings are reported in prices of $2008$.

There are several ways to import the .xlsx-files into R. Our suggestion is the function read_excel() from the readxl package (Wickham and Bryan 2019). The package is not part of R’s base version and has to be installed manually.

# load the 'readxl' package
library(readxl)

You are now ready to import the dataset. Make sure you use the correct path to import the downloaded file! In our example, the file is saved in a subfolder of the working directory named data. If you are not sure what your current working directory is, use getwd(), see also ?getwd. This will give you the path that points to the place R is currently looking for files to work with.

# import the data into R
cps <- read_excel(path = "data/cps_ch3.xlsx")

Next, install and load the package dyplr (Wickham et al. 2020). This package provides some handy functions that simplify data wrangling a lot. It makes use of the %>% operator.

# load the 'dplyr' package
library(dplyr)

First, get an overview over the dataset. Next, use %>% and some functions from the dplyr package to group the observations by gender and year and compute descriptive statistics for both groups.

# get an overview of the data structure
#> # A tibble: 6 x 3
#>   a_sex  year ahe08
#>   <dbl> <dbl> <dbl>
#> 1     1  1992  17.2
#> 2     1  1992  15.3
#> 3     1  1992  22.9
#> 4     2  1992  13.3
#> 5     1  1992  22.1
#> 6     2  1992  12.2

# group data by gender and year and compute the mean, standard deviation
# and number of observations for each group
avgs <- cps %>%
group_by(a_sex, year) %>%
summarise(mean(ahe08),
sd(ahe08),
n())

# print the results to the console
print(avgs)
#> # A tibble: 10 x 5
#> # Groups:   a_sex [2]
#>    a_sex  year mean(ahe08) sd(ahe08) n()
#>    <dbl> <dbl>         <dbl>       <dbl> <int>
#>  1     1  1992          23.3       10.2   1594
#>  2     1  1996          22.5       10.1   1379
#>  3     1  2000          24.9       11.6   1303
#>  4     1  2004          25.1       12.0   1894
#>  5     1  2008          25.0       11.8   1838
#>  6     2  1992          20.0        7.87  1368
#>  7     2  1996          19.0        7.95  1230
#>  8     2  2000          20.7        9.36  1181
#>  9     2  2004          21.0        9.36  1735
#> 10     2  2008          20.9        9.66  1871

With the pipe operator %>% we simply chain different R functions that produce compatible input and output. In the code above, we take the dataset cps and use it as an input for the function group_by(). The output of group_by is subsequently used as an input for summarise() and so forth.

Now that we have computed the statistics of interest for both genders, we can investigate how the gap in earnings between both groups evolves over time.

# split the dataset by gender
male <- avgs %>% dplyr::filter(a_sex == 1)

female <- avgs %>% dplyr::filter(a_sex == 2)

# rename columns of both splits
colnames(male)   <- c("Sex", "Year", "Y_bar_m", "s_m", "n_m")
colnames(female) <- c("Sex", "Year", "Y_bar_f", "s_f", "n_f")

# estimate gender gaps, compute standard errors and confidence intervals for all dates
gap <- male$Y_bar_m - female$Y_bar_f

gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f^2 / female$n_f)

gap_ci_l <- gap - 1.96 * gap_se

gap_ci_u <- gap + 1.96 * gap_se

result <- cbind(male[,-1], female[,-(1:2)], gap, gap_se, gap_ci_l, gap_ci_u)

# print the results to the console
print(result, digits = 3)
#>   Year Y_bar_m  s_m  n_m Y_bar_f  s_f  n_f  gap gap_se gap_ci_l gap_ci_u
#> 1 1992    23.3 10.2 1594    20.0 7.87 1368 3.23  0.332     2.58     3.88
#> 2 1996    22.5 10.1 1379    19.0 7.95 1230 3.49  0.354     2.80     4.19
#> 3 2000    24.9 11.6 1303    20.7 9.36 1181 4.14  0.421     3.32     4.97
#> 4 2004    25.1 12.0 1894    21.0 9.36 1735 4.10  0.356     3.40     4.80
#> 5 2008    25.0 11.8 1838    20.9 9.66 1871 4.10  0.354     3.41     4.80

We observe virtually the same results as the ones presented in the book. The computed statistics suggest that there is a gender gap in earnings. Note that we can reject the null hypothesis that the gap is zero for all periods. Further, estimates of the gap and bounds of the $95\%$ confidence intervals indicate that the gap has been quite stable in the recent past.