This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page

3.8 Exercises

1. Biased …

Consider the following alternative estimator for $\mu_Y$, the mean of the $Y_i$

$\widetilde{Y}=\frac{1}{n-1}\sum\limits_{i=1}^n Y_i$

In this exercise we will illustrate that this estimator is a biased estimator for $\mu_Y$.

Instructions:

• Define a function Y_tilde that implements the estimator above.

• Randomly draw 5 observations from the $\mathcal{N}(10, 25)$ distribution and compute an estimate using Y_tilde(). Repeat this procedure 10000 times and store the results in est_biased.

• Plot a histogram of est_biased.

• Add a red vertical line at $\mu=10$ using the function abline().

Hints:

• To compute the sum of a vector you can use sum(), to get the length of a vector you can use length().

• Use the function replicate() to compute repeatedly estimates of random samples. With the arguments expr and n you can specify the operation and how often it has to be replicated.

• A histogram can be plotted with the function hist().

• The point on the x-axis as well as the color for the vertical line can be specified via the arguments v and col.

2. … but consistent estimator

Consider again the estimator from the previous exercise. It is available in your environment as the function Y_tilde(). You are requested to do the same procedure as in the previous exercise. This time, however, increase the number of observations to draw from 5 to 1000.

Instructions:

• Randomly draw 1000 observations from the $\mathcal{N}(10, 25)$ distribution and compute an estimate of the mean using Y_tilde(). Repeat this procedure 10000 times and store the results in est_consistent.

• Plot a histogram of est_consistent.

• Add a red vertical line at $\mu=10$ using the function abline().

Hints:

• Use the function replicate() to compute estimates of repeatedly drawn random samples. Using the arguments expr and n you may specify the operation and how often it will be replicated.

• A histogram can be plotted with the function hist().

• The position on the x-axis as well as the color for the vertical line can be specified via the arguments v and col.

3. Efficiency of an Estimator

In this exercise we want to illustrate the result that the sample mean

$\hat{\mu}_Y=\sum\limits_{i=1}^{n}a_iY_i$ with the equal weighting scheme $a_i=\frac{1}{n}$ for $i=1,...,n$ is the best linear unbiased estimator (BLUE) of $\mu_Y$.

As an alternative, consider the estimator

$\tilde{\mu}_Y=\sum\limits_{i=1}^{n}b_iY_i$

where $b_i$ gives the first $\frac{n}{2}$ observations a higher weighting than the second $\frac{n}{2}$ observations (we assume that $n$ is even for simplicity).

The vector of weights w has been defined already and is available in your working environment.

Instructions:

• Verify that $\tilde{\mu}$ is an unbiased estimator of $\mu_Y$, the mean of the $Y_i$.

• Implement the alternative estimator of $\mu_Y$ as a function mu_tilde().

• Randomly draw 100 observations from the $\mathcal{N}(5, 10)$ distribution and compute estimates with both estimators. Repeat this procedure 10000 times and store the results in est_bar and est_tilde.

• Compute the sample variances of est_bar and est_tilde. What can you say about both estimators?

Hints:

• In order for $\tilde{\mu}$ to be an unbiased estimator all weights have to sum up to 1.

• Use the function replicate() to compute estimates of repeatedly drawn samples. With the arguments expr and n you can specify the operation and how often it is replicated.

• You may use var() the compute the sample variance.

4. Hypothesis Test — $t$-statistic

Consider the CPS dataset from Chapter 3.6 again. The dataset cps is available in your working environment.

We suppose that the average hourly earnings (in prices of 2012) ahe12 exceed 23.50 $\/h$ and wish to test this hypothesis at a significance level of $\alpha=0.05$. Please do the following:

Instructions:

• Compute the test statistic by hand and assign it to tstat.

• Use tstat to accept or reject the null hypothesis. Please do so using the normal approximation.

Hints:

• We test $H_0:\mu_{Y_{ahe}}\leq 23.5$ vs. $H_1:\mu_{Y_{ahe}}>23.5$. That is, we conduct a right-sided test.

• The $t$-statistic is defined as $\frac{\bar{Y}-\mu_{Y,0}}{s_{Y}/\sqrt{n}}$ where $s_Y$ denotes the sample variance.

• To decide whether the null hypothesis is accepted or rejected you can compare the $t$-statistic with the respective quantile of the standard normal distribution. Use logical operators.

5. Hypothesis Test — $p$-value

Reconsider the test situation from the previous exercise. The dataset cps as well as the vector tstat are available in your working environment.

Instead of using the $t$-statistic as decision criterion you may also use the $p$-value. Now please do the following:

Instructions:

• Compute the $p$-value by hand and assign it to pval.

• Use pval to accept or reject the null hypothesis.

Hints:

• The $p$-value for a right-sided test can be computed as $p=P(t>t^{act}|H_0)$.

• We reject the null if $p<\alpha$. Use logical operators to check for this.

6. Hypothesis Test — One Sample $t$-test

In the last two exercises we discussed two ways of conducting a hypothesis test. These approaches are somewhat cumbersome to apply by hand which is why R provides the function t.test(). It does most of the work automatically. t.test() provides $t$-statistics, $p$-values and even confidence intervals (more on the latter in later exercises). Note that t.test() uses the $t$-distribution instead of the normal distribution which becomes important when the sample size is small.

The dataset cps and the variable pval from Exercise 3.4 are available in your working environment.

Instructions:

• Conduct the hypothesis test from previous exercises using the function t.test().

• Extract the $t$-statistic and the $p$-value from the list created by t.test(). Assign them to the variables tstat and pvalue.

• Verify that using the normal approximation here is valid as well by computing the difference between both $p$-values.

Hints:

• The type of the test as well as the null hypothesis can be specified via the arguments alternative and mu.

• The $t$-statistic and the $p$-value can be obtained via $statistic and$p.value, respectively.

7. Hypothesis Test — Two Sample $t$-test

Consider the annual maximum sea levels at Port Pirie (Southern Australia) and Fremantle (Western Australia) for the last 30 years.

The observations are made available as vectors portpirie and fremantle in your working environment.

Instructions:

• Test whether there is a significant difference in the annual maximum sea levels at a significance level of $\alpha=0.05$.

Hints:

• We test $H_0:\mu_{P}-\mu_{F}=0$ vs. $H_1:\mu_{P}-\mu_{F}\ne 0$. That is, we conduct a two sample $t$-test.

• For a two sample $t$-test the function t.test() expects two vectors containing the data.

8. Confidence Interval

Reconsider the test situation concerning the annual maximum sea levels at Port Pirie and Fremantle.

The variables portpirie and fremantle are again available in your working environment.

Instructions:

• Construct a $95\%$-confidence interval for the difference in the sea levels using t.test().

Hint:

• The function t.test() computes a $95\%$ confidence interval by default. This is accessible via \$conf.int.

9. (Co)variance and Correlation I

Consider a random sample $(X_i, Y_i)$ for $i=1,...,100$.

The respective vectors X and Y are available in your working environment.

Instructions:

• Compute the variance of $X$ using the function cov().

• Compute the covariance of $X$ and $Y$.

• Compute the correlation between $X$ and $Y$.

Hints:

• The variance is a special case of the covariance.

• cov() as well as cor() expect a vector for each variable.

10. (Co)variance and Correlation II

In this exercise we want to examine the limitations of the correlation as a dependency measure.

Once the session has initialized you will see the plot of 100 realizations from two random variables $X$ and $Y$.

The respective observations are available in the vectors X and Y in your working environment.

Instructions:

• Compute the correlation between $X$ and $Y$. Interpret your result critically.

Hint:

• cor() expects a vector for each variable.