4.3 Measures of Fit

This book is in Open Review. We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click "Annotate" in the pop-up menu. You can also see the annotations of others: click the arrow in the upper right hand corner of the page

After fitting a linear regression model, a natural question is how well the model describes the data. Visually, this amounts to assessing whether the observations are tightly clustered around the regression line. Both the coefficient of determination and the standard error of the regression measure how well the OLS Regression line fits the data.

The Coefficient of Determination

\(R^2\), the coefficient of determination, is the fraction of the sample variance of \(Y_i\) that is explained by \(X_i\). Mathematically, the \(R^2\) can be written as the ratio of the explained sum of squares to the total sum of squares. The explained sum of squares (\(ESS\)) is the sum of squared deviations of the predicted values \(\hat{Y_i}\), from the average of the \(Y_i\). The total sum of squares (\(TSS\)) is the sum of squared deviations of the \(Y_i\) from their average. Thus we have

\[\begin{align} ESS & = \sum_{i = 1}^n \left( \hat{Y_i} - \overline{Y} \right)^2, \\ TSS & = \sum_{i = 1}^n \left( Y_i - \overline{Y} \right)^2, \\ R^2 & = \frac{ESS}{TSS}. \end{align}\]

Since \(TSS = ESS + SSR\) we can also write

\[ R^2 = 1- \frac{SSR}{TSS}, \]

where \(SSR\) is the sum of squared residuals, a measure for the errors made when predicting \(Y\) by \(X\). The \(SSR\) is defined as

\[ SSR = \sum_{i=1}^n \hat{u}_i^2. \]

\(R^2\) lies between \(0\) and \(1\). It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line, implies \(R^2 = 1\) then we have \(SSR=0\). On the contrary, if our estimated regression line does not explain any variation in the \(Y_i\), we have \(ESS=0\) and consequently \(R^2=0\).

The Standard Error of the Regression

The Standard Error of the Regression (\(SER\)) is an estimator of the standard deviation of the residuals \(\hat{u}_i\). As such it measures the magnitude of a typical deviation from the regression line, i.e., the magnitude of a typical residual.

\[ SER = s_{\hat{u}} = \sqrt{s_{\hat{u}}^2} \ \ \ \text{where} \ \ \ s_{\hat{u} }^2 = \frac{1}{n-2} \sum_{i = 1}^n \hat{u}^2_i = \frac{SSR}{n - 2} \]

Remember that the \(u_i\) are unobserved. This is why we use their estimated counterparts, the residuals \(\hat{u}_i\), instead. See Chapter 4.3 of the book for a more detailed comment on the \(SER\).

Application to the Test Score Data

Both measures of fit can be obtained by using the function summary() with an lm object provided as the only argument. While the function lm() only prints out the estimated coefficients to the console, summary() provides additional predefined information such as the regression’s \(R^2\) and the \(SER\).

mod_summary <- summary(linear_model)
mod_summary
#> 
#> Call:
#> lm(formula = score ~ STR, data = CASchools)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -47.727 -14.251   0.483  12.822  48.540 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
#> STR          -2.2798     0.4798  -4.751 2.78e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 18.58 on 418 degrees of freedom
#> Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
#> F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

The \(R^2\) in the output is called Multiple R-squared and has a value of \(0.051\). Hence, \(5.1 \%\) of the variance of the dependent variable \(score\) is explained by the explanatory variable \(STR\). That is, the regression explains little of the variance in \(score\), and much of the variation in test scores remains unexplained (cf. Figure 4.3 of the book).

The \(SER\) is called Residual standard error and equals \(18.58\). The unit of the \(SER\) is the same as the unit of the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is \(18.58\) points.

Now, let us check whether summary() uses the same definitions for \(R^2\) and \(SER\) as we do when computing them manually.

# compute R^2 manually
SSR <- sum(mod_summary$residuals^2)
TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS

# print the value to the console
R2
#> [1] 0.05124009

# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))

# print the value to the console
SER
#> [1] 18.58097

We find that the results coincide. Note that the values provided by summary() are rounded to two decimal places.