**Open Review**. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

## 11.5 Exercises

#### 1. Titanic Survival Data

Chapter 11.2 presented three approaches to model the conditional expectation function of a binary dependent variable: the linear probability model as well as Probit and Logit regression.

The exercises in this Chapter use data on the fate of the passengers of the ocean linear *Titanic*. We aim to explain survival, a binary variable, by socioeconomic variables using the above approaches.

In this exercise we start with the aggregated data set `Titanic`. It is part of the package `datasets` which is part of base `R`. The following quote from the description of the dataset motivates the attempt to predict the *probability* of survival:

*The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts — from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class — are reflected in the survival rates for various classes of passenger.*

**Instructions:**

Assign the

`Titanic`data to`Titanic_1`and get an overview.Visualize the conditional survival rates for travel class (

`Class`), gender (`Sex`) and age (`Age`) using`mosaicplot()`.

#### 2. Titanic Survival Data — Ctd.

The `Titanic` data set from Exercise 1 is not useful for regression analysis because it is highly aggregated. In this exercise you will work with `titanic.csv` which is available under the URL https://stanford.io/2O9RUCF.

The columns of `titanic.csv` contain the following variables:

`Survived` — The survived indicator

`Pclass` — passenger class

`Name` — passenger’s Name

`Sex` — passenger’s gender

`Age` — passengers’s age

`Siblings` — number of siblings aboard

`Parents.Children.Aboard` — number of parents and children aboard

`fare` — the fare paid in british pound

**Instructions:**

Import the data from

`titanic.csv`using the function`read.csv2()`. Save it to`Titanic_2`.Assign the following column names to

`Titanic_2`:`Survived, Class, Name, Sex, Age, Siblings, Parents`and`Fare`.Get an overview over the data set. Drop the column

`Name`.Attach the packages

`corrplot`and`dplyr`. Check whether there is multicollinearity in the data using`corrplot()`.

**Hints:**

`read_csv()`guesses the column specification as well as the seperators used in the`.csv`file. You should always check if the result is correct.You may use

`select_if()`from the`dplyr`package to select all numeric columns from the data set.

#### 3. Titanic Survival Data — Survival Rates

Contingency tables similar to those provided by the data set `Titanic` from Exercise 1 may shed some light on the distribution of survival conditional and possible determinants thereof, e.g., the passenger class. Contingency tables are easily created using the base `R` function `table`.

**Instructions:**

Generate a contingency table for

`Survived`and`Class`using`table()`. Save the table to`t_abs`.`t_abs`reports absolute frequencies. Transform`t_abs`into a table which reports relative frequencies (relative to the total number of observations). Save the result to`t_rel`.Visualize the relative frequencies in

`t_rel`using`barplot()`. Use different colors for better distinquishablitly among survival and non-survival rate (it does not matter which colors you use).

#### 4. Titanic Survival Data — Conditional Distributions of `Age`

Contingency tables are useful for summarizing distribution of categorical variables like `Survived` and `Class` in Exercise 3. They are, however, not useful when the variable of interest takes on many different integers (and they are even impossible to generate when the variable is continuous).

In this exercise you are asked to generate and visualize density estimates of the distribution of `Age` conditional on `Survived` to see whether there are indications how age relates to the chance of survival (despite that the data set reports integers, we treat `Age` as a continuous variable here). For example, it is interesting to see if the ‘women and children first’ policy was effective.

The data set `Titanic_2` from the previous exercises is available in your working environment.

**Instructions:**

Obtain kernel density estimates of the distributions of

`Age`for both the survivors and the deceased.Save the results to

`dens_age_surv`(survived) and`dens_age_died`(died).Plot both kernel density estimates (overlay them in a single plot!). Use different colors of your choive to make the estimates distinguishable.

**Hints:**

Kernel density estimates can be obtained using the functon

`density()`.Use

`plot()`and`lines()`to plot the density estimates.

#### 5. Titanic Survival Data — A Linear Probability Model for `Survival` I

How do socio-economic characteristics of the passengers impact the probability of survival? In particular, are there systematic differences between the three passenger classes? Do the data reflect the ‘children and women first’ policy?

It is natural to start the analysis by estimating a simple linear probability model like (LMP) \[Survived_i = \beta_0 + \beta_1 Class2_i + \beta_2 Class3_i + u_i\] with dummy variables \(Class2_i\) and \(Class3_i\).

The data set `Titanic_2` from the previous exercises is available in your working environment.

**Instructions:**

Attach the

`AER`package.`Class`is of type`int`(integer), Convert`Class`to a factor variable.Estimate the linear probability model and save the result to

`surv_mod`.Obtain a robust summary of the model coefficients.

Use

`surv_mod`to predict the probability of survival for the three passenger classes.

**Hints:**

Linear probability models can be estimated using

`lm()`.Use

`predict()`to obtain the predictions. Remember that a`data.frame`must be provided to the argument`newdata`.

#### 6. Titanic Survival Data — A Linear Probability Model for `Survival` II

Consider again the outcome from Exercise 5:

\[\widehat{Survived}_i = \underset{(0.03)}{0.63} - \underset{(0.05)}{0.16} Class2_i - \underset{(0.04)}{0.39} Class3_i + u_i \]

(The estimated coefficients in this model are related to the class specific sample means of `Survived`. You are asked to compute them below.)

The highly significant coefficients indicate that the probability of survival decreases with the passenger class, that is, passengers from a less luxurious class are less likely to survive.

This result could be affected by omitted variable bias arising from correlation of the passenger class with determinants of the probability of survival not included in the model. We therefore augment the model such that it includes all remaining variables as regressors.

The data set `Titanic_2` as well as the model `surv_mod` from the previous exercises are available in your working environment. The `AER` package is attached.

**Instructions:**

Use the model object

`surv_mod`to obtain the class specific estimates for the probability of survival. Store them in`surv_prob_c1`,`surv_prob_c2`and`surv_prob_c3`.Fit the augmented LMP and assign the result to the object

`LPM_mod`.Obtain a robust summary of the model coefficients.

**Hint:**

- Remember that the formula
`a ~ .`specifies a regression of`a`on all other variables in the data set provided as the argument`data`in`glm()`.

#### 7. Titanic Survival Data — Logistic Regression

Chapter 11.2 introduces Logistic regression, also called Logit regression, which is a more suitable than the LPM for modelling the conditional probability function of a dichotomous outcome variable. Logit regression uses a nonlinear link function that restricts the fitted values to lie between \(0\) and \(1\): in Logit regression, the *log-odds* of the outcome are modeled as a linear combination of the predictors while the LPM assumes that the conditional probability function of outcome is linear.

The data set `Titanic_2` from Exercise 2 is available in your working environment. The package `AER` is attached.

**Instructions:**

Use

`glm()`to estimate the model \[\begin{align*} \log\left(\frac{P(survived_i = 1)}{1-P(survived_i = 1)}\right) =& \, \beta_0 + \beta_1 Class2_i + \beta_2 Ckass3_i + \beta_3 Sex_i \\ +& \, \beta_4 Age_i + \beta_5 Siblings_i + \beta_6 Perents_i + \beta_7 Fare_i + u_i. \end{align*}\]Obtain a robust summary of the model coefficients.

The data frame

`passengers`contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use`Logit_mod`to predict the probability of survival for these passengers.

**Hints:**

Remember that the formula

`a ~ .`specifies a regression of`a`on all other variables in the data set provided as the argument`data`in`glm()`.You need to specify the correct type of prediction in

`predict()`.

#### 8. Titanic Survival Data — Probit Regression

Repeat Exercise 7 but this time estimate the Probit model \[\begin{align*} P(Survived_i = 1\vert Class2_i, Class3_i, \dots, Fare_i) =& \, \Phi (\beta_0 + \beta_1 Class2_i + \beta_2 Class3_i + \beta_3 Sex_i \\ +& \, \beta_4 Age_i + \beta_5 Siblings_i + \beta_6 Parents_i + \beta_7 Fare_i + u_i). \end{align*}\]

The data set `Titanic_2` from the previous exercises as well as the Logit model `Logit_mod` are available in your working environment. The package `AER` is attached.

**Instructions:**

Use

`glm()`to estimate the above Probit model. Save the result to`Probit_mod`.Obtain a robust summary of the model coefficients.

The data frame

`passengers`contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use`Probit_mod`to predict the probability of survival for these passengers.

**Hints:**

Remember that the formula

`a ~ .`specifies a regression of`a`on all other variables in the data set provided as the argument`data`in`glm()`.You need to specify the correct type of prediction in

`predict()`.