Dustin Lennon

Dustin Lennon

Applied Scientist

notes elements of statistical learning machine learning

Annotations: Elements of Statistical Learning

My margin notes from reading Hastie, Tibshirani, and Friedman

Dustin Lennon
March 2021
March 2021



Mathematicians value brevity; details are left to the reader. As a lifetime consumer of mathematical ideas, I’ve often wondered if the authors have given too much credit to their audience. I find myself wanting to understand every step of a logical argument, and missing pieces give me pause. As such, I find it helpful to document, and hopefully elucidate, these points of confusion. Perhaps others will find this exercise useful.

Here, we covers one of my favorite machine learning books, “The Elements of Statistical Learning,” written by Hastie, Tibshirani, and Friedman. Specific references are to the 2009 print edition.

Chapter 2

Chapter 2

Expected Prediction Error: Linear Regression

Expected prediction error, or test error, is one of the key ideas in Chapter 2. In this context, HTF describes a bias variance tradeoff, and they work out the decomposition for both linear regression and nearest neighbor. For linear regression, this happens in 2.27 and 2.28, and the details are left to the reader as Exercise 2.5.

My confusion here started with the double expectations:

\[ \Evy{ \EvTb{ \left( y_0 - \yhat \right)^2 } } \]

Recall that the goal is to estimate the prediction error at a new location, \(x_0\). From 2.26, we could write,

\[ y_0 = \xbeta + \varepsilon_0 \]

with \(x_0\) is known and \(\beta\) an unknown constant. The only randomness enters through \(\varepsilon_0\). We’ll associate the outer expectation, \(\Evy{}\), with the above.

To make progress, let \(\yhat\) be an estimator of \(y_0\). As an estimator, it is a function of the data. As such, we require a sample, to be drawn from \(\sspace\). For the linear regression problem, the estimator is

\[ \yhat = \xbhat \]

The variability associated with the sample may be ascribed to the inner expectation, \(\EvT{}\). A different sample from the population distribution would produce a different realization of \(\bhat\).

Now we proceed by adding and subtracting \(\xbeta\) and \(\emean\) inside the original expression:

\[ \begin{align*} \left( y_0 - \yhat \right)^2 & = \left( \color{red}{(y_0 - \xbeta)} + \color{green}{(\xbeta - \emean)} + \color{blue}{(\emean - \yhat)} \right)^2 \\ & = \color{red}{(y_0 - \xbeta)^2} \\ & \eqi + \color{green}{(\xbeta - \emean)^2} \\ & \eqi + \color{blue}{(\emean - \yhat)^2} \\ & \eqi + 2 \color{red}{(y_0 - \xbeta)} \color{green}{(\xbeta - \emean)} \\ & \eqi + 2 \color{red}{(y_0 - \xbeta)} \color{blue}{(\emean - \yhat)} \\ & \eqi + 2 \color{green}{(\xbeta - \emean)} \color{blue}{(\emean - \yhat)} \end{align*} \]

Consider the cross product terms above:

  • The green terms are constant with respect to both \(\Evy{}\) and \(\EvT{}\).
  • The red terms are constant with respect to \(\EvT{}\) and have zero expectation under \(\Evy{}\).
  • The blue terms have zero expectation under \(\EvT{}\).

Thus, all the cross product terms will be zero under the nested expectations, and, generally:

\[ \begin{align} \Evy{ \EvTb{ \left( y_0 - \yhat \right)^2 } } & = \Evys{ \color{red}{(y_0 - \xbeta)^2} } + \color{green}{(\xbeta - \emean)^2} + \EvTs{ \color{blue}{(\emean - \yhat)^2} } \nonumber \\ & = \sigma^2 + \mbox{Bias($\yhat$)}^2 + \VarOp_\sspace \yhat \label{decomposition} \end{align} \]

For the linear regression case, we can say more. In particular, the estimator \(\yhat\) is unbiased; that is, \(\EvT{\yhat} = \xbeta\). This follows from writing \(\yhat = \xbhat = \xbeta + \underbrace{x_0^{\top} (X^{\top} X)^{-1} X^{\top}}_{l(X)^{\top}} \varepsilon\) and noting that

\[ \EvTs{l(X)^{\top} \varepsilon} = \Evb{ \Evs{ l(X)^{\top} \varepsilon | X } } = 0 \]

We can arrive at the final equation in 2.27 by simplifying the variance term above:

\[ \begin{align*} \EvT{ (\yhat - \emean)^2 } & = \EvT{(\yhat - \xbeta)^2} \\ & = \EvT{(l(X)^{\top} \varepsilon)^2} \\ & = \EvTb{ \left( \sum_{i=1}^N l(X)_i \varepsilon_i \right)^2 } \\ & = \EvTb{ \sum_{i=1}^N l(X)_i^2 \varepsilon_i^2 } \\ & = \sigma^2 \EvTs{ x_0^{\top} (X^{\top} X)^{-1} x_0 } \end{align*} \]

noting from model assumptions that \(\varepsilon\) and \(X\) are independent as are \(\varepsilon_i\) and \(\varepsilon_j\).

Unbiased Estimators, Misspecified Models

Unfortunately, the unbiasedness of \(\yhat\) leads to confusion later when discussing model complexity. In the latter context, low complexity models such as linear regression are said to have high bias and low variance.

The discrepancy seems to be explained by the following.

When we say that \(\yhat\) is an unbiased estimator of \(\xbeta\), we are tacitly assuming a correctly specified model, and our understanding is that the distribution of estimates based on new population samples would have a mean of \(\xbeta\).

On the other hand, when we speak of unbiasedness in the context of model complexity it is with respect to a misspecified model. For low complexity models, we tend to underfit; for high complexity models, overfit. So, intuitively, when we claim that a low complexity model has high bias and low variance it is really a statement about the stability of predictions.

To be precise, linear regression is not unbiased for a misspecified model. Suppose

\[ y_0 = \xbeta + g(x_0) + \varepsilon_0 \\ \]


\[ \begin{align*} \yhat & = \xbhat \\ & = x_0^{\top} (X^{\top} X)^{-1} X^{\top} y \\ & = x_0^{\top} (X^{\top} X)^{-1} X^{\top}(X \beta + g + \varepsilon) \\ & = \xbeta + \left(X^{\top} X\right)^{-1} X^{\top} \left( g + \varepsilon \right) \end{align*} \]


\[ \begin{align*} \emean & = \xbeta + x_0^{\top} \EvTs{ \left(X^{\top} X\right)^{-1} X^{\top} g } \end{align*} \]


\[ \begin{align*} \mbox{Bias}(\yhat) & = \emean - (\xbeta + g_0) \\ & = x_0^{\top} \EvTs{ \left(X^{\top} X\right)^{-1} X^{\top} g } - g_0 \end{align*} \]

The claim in the context of model complexity, in terms of the bias variance tradeoff, is with respect to a misspecified model. And when the data is underfit by a low complexity model, we expect a stable prediction–low variance–that is consistently wrong, hence highly biased.

Expected Prediction Error: Nearest Neighbor

Here, we have

\[ y_0 = f(x_0) + \varepsilon_0 \]

As we revisit 2.46 and 2.47, any confusion is likely notational: the authors have collapsed the \(\Evy{}\) and \(\EvT{}\) into a single expectation operator.

Working as before–see equation \(\ref{decomposition}\)–we obtain 2.46.

\[ \begin{align*} \Evy{ \EvT{ (y_0 - \yhat)^2 } } & = \sigma^2 + \mbox{Bias}^2\left( \nnfk \right) + \VarOp_\sspace \left( \nnfk \right) \end{align*} \]

To obtain 2.47, note that

\[ \begin{gather*} \nnfk = \nnkavg{y} \\ \EvT{ \nnfk } = \nnkfavg{x} \\ \end{gather*} \]

As always, the bias is the difference between the expected value of the estimator and the true value, so \(\mbox{Bias}\left(\nnfk\right) = \nnkfavg{x} - f(x_0)\).

The variance is the expectation of the squared difference between the estimator and its mean,

\[ \begin{align*} \EvT{ \left( \nnfk - \EvT{ \nnfk } \right)^2 } & = \EvTb{ \left( \nnkavg{\varepsilon} \right)^2 } \\ & = \EvTb{ \frac{1}{k} \sum_{l=1}^k \varepsilon_{(l)}^2 } \\ & = \frac{\sigma^2}{k} \end{align*} \]

where we again use the fact that \(\varepsilon_i\) and \(\varepsilon_j\) are independent.

Combining the pieces yields 2.47.