**Open Review**. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

## 3.6 An Application to the Gender Gap of Earnings

This section discusses how to reproduce the results presented in the box *The Gender Gap of Earnings of College Graduates in the United States* of the book.

In order to reproduce Table 3.1 of the book you need to download the replication data which are hosted by Pearson and can be downloaded here. Download the data for Chapter 3 as an excel spreadsheet (`cps_ch3.xlsx`). This dataset contains data that range from \(1992\) to \(2008\) and earnings are reported in prices of \(2008\).

There are several ways to import the `.xlsx`-files into `R`. Our suggestion is the function `read_excel()` from the `readxl` package (Wickham & Bryan, 2018). The package is not part of `R`’s base version and has to be installed manually.

```
# load the 'readxl' package
library(readxl)
```

You are now ready to import the dataset. Make sure you use the correct path to import the downloaded file! In our example, the file is saved in a subfolder of the working directory named `data`. If you are not sure what your current working directory is, use `getwd()`, see also `?getwd`. This will give you the path that points to the place `R` is currently looking for files to work with.

```
# import the data into R
cps <- read_excel(path = 'data/cps_ch3.xlsx')
```

Next, install and load the package `dyplr` (Wickham et al., 2018). This package provides some handy functions that simplify data wrangling a lot. It makes use of the `%>%` operator.

```
# load the 'dplyr' package
library(dplyr)
```

First, get an overview over the dataset. Next, use `%>%` and some functions from the `dplyr` package to group the observations by gender and year and compute descriptive statistics for both groups.

```
# get an overview of the data structure
head(cps)
```

```
## # A tibble: 6 x 3
## a_sex year ahe08
## <dbl> <dbl> <dbl>
## 1 1 1992 17.2
## 2 1 1992 15.3
## 3 1 1992 22.9
## 4 2 1992 13.3
## 5 1 1992 22.1
## 6 2 1992 12.2
```

```
# group data by gender and year and compute the mean, standard deviation
# and number of observations for each group
avgs <- cps %>%
group_by(a_sex, year) %>%
summarise(mean(ahe08),
sd(ahe08),
n())
# print the results to the console
print(avgs)
```

```
## # A tibble: 10 x 5
## # Groups: a_sex [?]
## a_sex year `mean(ahe08)` `sd(ahe08)` `n()`
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 1992 23.3 10.2 1594
## 2 1 1996 22.5 10.1 1379
## 3 1 2000 24.9 11.6 1303
## 4 1 2004 25.1 12.0 1894
## 5 1 2008 25.0 11.8 1838
## 6 2 1992 20.0 7.87 1368
## 7 2 1996 19.0 7.95 1230
## 8 2 2000 20.7 9.36 1181
## 9 2 2004 21.0 9.36 1735
## 10 2 2008 20.9 9.66 1871
```

With the pipe operator `%>%` we simply chain different `R` functions that produce compatible input and output. In the code above, we take the dataset `cps` and use it as an input for the function `group_by()`. The output of `group_by` is subsequently used as an input for `summarise()` and so forth.

Now that we have computed the statistics of interest for both genders, we can investigate how the gap in earnings between both groups evolves over time.

```
# split the dataset by gender
male <- avgs %>% filter(a_sex == 1)
female <- avgs %>% filter(a_sex == 2)
# rename columns of both splits
colnames(male) <- c("Sex", "Year", "Y_bar_m", "s_m", "n_m")
colnames(female) <- c("Sex", "Year", "Y_bar_f", "s_f", "n_f")
# estimate gender gaps, compute standard errors and confidence intervals for all dates
gap <- male$Y_bar_m - female$Y_bar_f
gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f^2 / female$n_f)
gap_ci_l <- gap - 1.96 * gap_se
gap_ci_u <- gap + 1.96 * gap_se
result <- cbind(male[,-1], female[,-(1:2)], gap, gap_se, gap_ci_l, gap_ci_u)
# print the results to the console
print(result, digits = 3)
```

```
## Year Y_bar_m s_m n_m Y_bar_f s_f n_f gap gap_se gap_ci_l gap_ci_u
## 1 1992 23.3 10.2 1594 20.0 7.87 1368 3.23 0.332 2.58 3.88
## 2 1996 22.5 10.1 1379 19.0 7.95 1230 3.49 0.354 2.80 4.19
## 3 2000 24.9 11.6 1303 20.7 9.36 1181 4.14 0.421 3.32 4.97
## 4 2004 25.1 12.0 1894 21.0 9.36 1735 4.10 0.356 3.40 4.80
## 5 2008 25.0 11.8 1838 20.9 9.66 1871 4.10 0.354 3.41 4.80
```

We observe virtually the same results as the ones presented in the book. The computed statistics suggest that there *is* a gender gap in earnings. Note that we can reject the null hypothesis that the gap is zero for all periods. Further, estimates of the gap and bounds of the \(95\%\) confidence intervals indicate that the gap has been quite stable in the recent past.

### References

Wickham, H., & Bryan, J. (2018). readxl: Read Excel Files (Version 1.2.0). Retrieved from https://CRAN.R-project.org/package=readxl

Wickham, H., François, R., Henry, L., & Müller, K. (2018). dplyr: A Grammar of Data Manipulation (Version 0.7.8). Retrieved from https://CRAN.R-project.org/package=dplyr