Explore / Study / Statistics / Statistical Machine Learning 3.7k words | 23 minutes

§5 Linear Model Selection and Regularization

  1. Why consider alternatives to least squares?
  2. Three classes of methods
  3. Subset Selection
    1. Best Subset Selection
      1. Extensions to other models
    2. Stepwise Selection
      1. Forward Stepwise Selection
      2. Backward Stepwise Selection
    3. Choosing the Optimal Model
    4. Estimating test error: two approaches
      1. CpC_{p}Cp​ , AIC, BIC, and Adjusted R2R^{2}R2
      2. Validation and Cross-Validation
  4. Shrinkage Methods
    1. Ridge regression
      1. Scaling of predictors
    2. The Lasso
    3. Conclusions
    4. Selecting the Tuning Parameter for Ridge Regression and Lasso
  5. Dimension Reduction Methods
    1. Principal Components Regression
    2. Partial Least Squares
  6. Summary

Why consider alternatives to least squares?

  • Prediction Accuracy: especially when p>np>n , to control the variance.
  • Model Interpretability: By removing irrelevant features that is, by setting the corresponding coefficient estimates to zero - we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing feature selection.

Three classes of methods

  • Subset Selection. We identify a subset of the pp predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
  • Shrinkage. We fit a model involving all pp predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection.
  • Dimension Reduction. We project the pp predictors into a MM -dimensional subspace, where M<p.M<p. This is achieved by computing MM different linear combinations, or projections, of the variables. Then these MM projections are used as predictors to fit a linear regression model by least squares.

Subset Selection

Best Subset Selection

  1. Let M0\mathcal{M}_{0} denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
  2. For k=1,2,pk=1,2, \ldots p :
    1. Fit all (pk)\left(\begin{array}{l}p \\ k\end{array}\right) models that contain exactly kk predictors.
    2. Pick the best among these (pk)\left(\begin{array}{l}p \\ k\end{array}\right) models, and call it Mk.\mathcal{M}_{k}. Here best is defined as having the smallest RSS, or equivalently largest R2R^{2}.
  3. Select a single best model from among M0,,Mp\mathcal{M}_{0}, \ldots, \mathcal{M}_{p} using cross-validated prediction error, Cp(AIC),BICC_{p}(\mathrm{AIC}), \mathrm{BIC} , or adjusted R2R^{2}.

Extensions to other models

  • Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression.
  • The deviance - negative two times the maximized log-likelihood - plays the role of RSS for a broader class of models.

Stepwise Selection

  • For computational reasons, best subset selection cannot be applied with very large pp.
  • Best subset selection may also suffer from statistical problems when pp is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data.
  • Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.
  • For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

Forward Stepwise Selection

  • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.
  1. Let M0\mathcal{M}_{0} denote the null model, which contains no predictors.
  2. For k=0,,p1k=0, \ldots, p-1 :
    1. Consider all pkp-k models that augment the predictors in Mk\mathcal{M}_{k} with one additional predictor.
    2. Choose the best among these pkp-k models, and call it Mk+1\mathcal{M}_{k+1}. Here best is defined as having smallest RSS or highest R2R^{2}.
  3. Select a single best model from among M0,,Mp\mathcal{M}_{0}, \ldots, \mathcal{M}_{p} using cross-validated prediction error, Cp(AIC),BICC_{p}(\mathrm{AIC}), \mathrm{BIC} , or adjusted R2R^{2}.
  • Computational advantage over best subset selection is clear.
  • It is not guaranteed to find the best possible model out of all 2p2^{p} models containing subsets of the pp predictors.

Backward Stepwise Selection

  • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection.
  • However, unlike forward stepwise selection, it begins with the full least squares model containing all pp predictors, and then iteratively removes the least useful predictor, one-at-a-time.
  1. Let Mp\mathcal{M}_{p} denote the full model, which contains all pp predictors.
  2. For k=p,p1,,1k=p, p-1, \ldots, 1:
    1. Consider all kk models that contain all but one of the predictors in Mk\mathcal{M}_{k} , for a total of k1k-1 predictors.
    2. Choose the best among these kk models, and call it Mk1\mathcal{M}_{k-1}.Here best is defined as having smallest RSS or highest R2R^{2}.
  3. Select a single best model from among M0,,Mp\mathcal{M}_{0}, \ldots, \mathcal{M}_{p} using cross-validated prediction error, Cp(AIC),BICC_{p}(\mathrm{AIC}), \mathrm{BIC} , or adjusted R2R^{2}.
  • Like forward stepwise selection, the backward selection approach searches through only 1+p(p+1)/21+p(p+1) / 2 models, and so can be applied in settings where pp is too large to apply best subset selection
  • Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the pp predictors.
  • Backward selection requires that the number of samples nn is larger than the number of variables pp (so that the full model can be fit). In contrast, forward stepwise can be used even when n<pn<p , and so is the only viable subset method when pp is very large.

Choosing the Optimal Model

  • The model containing all of the predictors will always have the smallest RSS and the largest R2R^{2} , since these quantities are related to the training error.
  • We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error.
  • Therefore, RSS and R2R^{2} are not suitable for selecting the best model among a collection of models with different numbers of predictors.

Estimating test error: two approaches

  • We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
  • We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures.
  • We illustrate both approaches next.

CpC_{p} , AIC, BIC, and Adjusted R2R^{2}

  • These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.

  • Mallow’s CpC_{p} :

    Cp=1n(RSS+2dσ^2)C_{p}=\frac{1}{n}\left(\mathrm{RSS}+2 d \hat{\sigma}^{2}\right)

    where dd is the total # of parameters used and σ^2\hat{\sigma}^{2} is an estimate of the variance of the error ϵ\epsilon associated with each response measurement.

  • The AIC criterion is defined for a large class of models fit by maximum likelihood:

    AIC=2logL+2d\mathrm{AIC}=-2 \log L+2 \cdot d

    where LL is the maximized value of the likelihood function for the estimated model.

  • In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and CpC_{p} and AIC are equivalent.

  • Like CpC_{p} , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.

    BIC=1n(RSS+log(n)dσ^2) \mathrm{BIC}=\frac{1}{n}\left(\mathrm{RSS}+\log (n) d \hat{\sigma}^{2}\right)

  • Notice that BIC replaces the 2dσ^22 d \hat{\sigma}^{2} used by CpC_{p} with a log(n)dσ^2\log (n) d \hat{\sigma}^{2} term, where nn is the number of observations.

  • Since logn>2\log n>2 for any n>7n>7 , the BIC statistic generally places a heavier penalty on models with many variables.

  • For a least squares model with dd variables, the adjusted R2R^{2} statistic is calculated as

    Adjusted R2=1RSS/(nd1)TSS/(n1)\text{Adjusted } R^{2}=1-\frac{\operatorname{RSS} /(n-d-1)}{\operatorname{TSS} /(n-1)}

    where TSS is the total sum of squares.

  • Unlike CpC_{p} , AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R2R^{2} indicates a model with a small test error.

  • Maximizing the adjusted R2R^{2} is equivalent to minimizing  RSS nd1\frac{\text { RSS }}{n-d-1}. While RSS always decreases as the number of variables in the model increases, RSSnd1\frac{\mathrm{RSS}}{n-d-1} may increase or decrease, due to the presence of dd in the denominator.

  • Unlike the R2R^{2} statistic, the adjusted R2R^{2} statistic pays a price for the inclusion of unnecessary variables in the model.

Validation and Cross-Validation

  • Each of the procedures returns a sequence of models Mk\mathcal{M}_{k} indexed by model size k=0,1,2,k=0,1,2, \ldots Our job here is to select k^\hat{k}. Once selected, we will return model Mk^\mathcal{M}_{\hat{k}}.

  • We compute the validation set error or the cross-validation error for each model Mk\mathcal{M}_{k} under consideration, and then select the kk for which the resulting estimated test error is smallest.

  • This procedure has an advantage relative to AIC, BIC, CpC_{p} , and adjusted R2R^{2} , in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance σ2\sigma^{2}.

  • It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ2\sigma^{2}.

  • One-standard-error rule:

    We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

Shrinkage Methods

Ridge regression and Lasso

  • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
  • As an alternative, we can fit a model containing all pp predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
  • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

Ridge regression

  • Recall that the least squares fitting procedure estimates β0,β1,,βp\beta_{0}, \beta_{1}, \ldots, \beta_{p} using the values that minimize

    RSS=i=1n(yiβ0j=1pβjxij)2\mathrm{RSS}=\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}

  • In contrast, the ridge regression coefficient estimates β^R\hat{\beta}^{R} are the values that minimize

    i=1n(yiβ0j=1pβjxij)2+λj=1pβj2=RSS+λj=1pβj2\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}=\mathrm{RSS}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}

    where λ0\lambda \geq 0 is a tuning parameter, to be determined separately.

  • As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.

  • However, the second term, λjβj2\lambda \sum_{j} \beta_{j}^{2} , called a shrinkage penalty, is small when β1,,βp\beta_{1}, \ldots, \beta_{p} are close to zero, and so it has the effect of shrinking the estimates of βj\beta_{j} towards zero.

  • The tuning parameter λ\lambda serves to control the relative impact of these two terms on the regression coefficient estimates.

  • Selecting a good value for λ\lambda is critical; cross-validation is used for this.

Scaling of predictors

  • The standard least squares coefficient estimates are scale equivariant: multiplying XjX_{j} by a constant cc simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c.1 / c. In other words, regardless of how the jj th predictor is scaled, Xjβ^jX_{j} \hat{\beta}_{j} will remain the same.

  • In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.

  • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

    x~ij=xij1ni=1n(xijxˉj)2\tilde{x}_{i j}=\frac{x_{i j}}{\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i j}-\bar{x}_{j}\right)^{2}}}

The Lasso

  • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all pp predictors in the final model.

  • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, β^λL\hat{\beta}_{\lambda}^{L}, minimize the quantity

    i=1n(yiβ0j=1pβjxij)2+λj=1pβj=RSS+λj=1pβj\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p}\left|\beta_{j}\right|=\mathrm{RSS}+\lambda \sum_{j=1}^{p}\left|\beta_{j}\right|

  • In statistical parlance, the lasso uses an 1\ell_{1} (pronounced “ell 1”) penalty instead of an 2\ell_{2} penalty. The 1\ell_{1} norm of a coefficient vector β\beta is given by β1=βj\|\beta\|_{1}=\sum\left|\beta_{j}\right|.

  • As with ridge regression, the lasso shrinks the coefficient estimates towards zero.

  • However, in the case of the lasso, the 1\ell_{1} penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ\lambda is sufficiently large.

  • Hence, much like best subset selection, the lasso performs variable selection.

  • We say that the lasso yields sparse models - that is, models that involve only a subset of the variables.

  • As in ridge regression, selecting a good value of λ\lambda for the lasso is critical; cross-validation is again the method of choice.

Conclusions

  • These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other.
  • In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
  • However, the number of predictors that is related to the response is never known a priori for real data sets.
  • A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

Selecting the Tuning Parameter for Ridge Regression and Lasso

  • As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best.
  • That is, we require a method selecting a value for the tuning parameter λ\lambda or equivalently, the value of the constraint ss.
  • Cross-validation provides a simple way to tackle this problem. We choose a grid of λ\lambda values, and compute the cross-validation error rate for each value of λ\lambda.
  • We then select the tuning parameter value for which the cross-validation error is smallest.
  • Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

Dimension Reduction Methods

  • The methods that we have discussed so far in this chapter have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1,X2,,XpX_{1}, X_{2}, \ldots, X_{p}.

  • We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.

  • Let Z1,Z2,,ZMZ_{1}, Z_{2}, \ldots, Z_{M} represent M<pM<p linear combinations of our original pp predictors. That is,

    Zm=j=1pϕmjXjZ_{m}=\sum_{j=1}^{p} \phi_{m j} X_{j}

    for some constants ϕm1,,ϕmp\phi_{m 1}, \ldots, \phi_{m p}.

  • We can then fit the linear regression model,

    yi=θ0+m=1Mθmzim+ϵi,i=1,,ny_{i}=\theta_{0}+\sum_{m=1}^{M} \theta_{m} z_{i m}+\epsilon_{i}, \quad i=1, \ldots, n

    using ordinary least squares.

  • Note that in model (2), the regression coefficients are given by θ0,θ1,,θM\theta_{0}, \theta_{1}, \ldots, \theta_{M}. If the constants ϕm1,,ϕmp\phi_{m 1}, \ldots, \phi_{m p} are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.

  • Notice that from definition (1),

    m=1Mθmzim=m=1Mθmj=1pϕmjxij=j=1pm=1Mθmϕmjxij=j=1pβjxij\sum_{m=1}^{M} \theta_{m} z_{i m}=\sum_{m=1}^{M} \theta_{m} \sum_{j=1}^{p} \phi_{m j} x_{i j}=\sum_{j=1}^{p} \sum_{m=1}^{M} \theta_{m} \phi_{m j} x_{i j}=\sum_{j=1}^{p} \beta_{j} x_{i j}

    where

    βj=m=1Mθmϕmj\beta_{j}=\sum_{m=1}^{M} \theta_{m} \phi_{m j}

  • Hence model (2) can be thought of as a special case of the original linear regression model.

  • Dimension reduction serves to constrain the estimated βj\beta_{j} coefficients, since now they must take the form (3).

  • Can win in the bias-variance tradeoff.

Principal Components Regression

  • Here we apply principal components analysis (PCA) to define the linear combinations of the predictors, in our regression.
  • The first principal component is that (normalized) linear combination of the variables with the largest variance.
  • The second principal component has largest variance, subject to being uncorrelated with the first.
  • And so on.
  • Hence with many correlated original variables, we replace them with a small set of principal components that capture their joint variation.

Partial Least Squares

  • PCR identifies linear combinations, or directions, that best represent the predictors X1,,XpX_{1}, \ldots, X_{p}.
  • These directions are identified in an unsupervised way, since the response YY is not used to help determine the principal component directions.
  • That is, the response does not supervise the identification of the principal components.
  • Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
  • Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1,,ZMZ_{1}, \ldots, Z_{M} that are linear combinations of the original features, and then fits a linear model via OLS using these MM new features.
  • But unlike PCR, PLS identifies these new features in a supervised way - that is, it makes use of the response YY in order to identify new features that not only approximate the old features well, but also that are related to the response.
  • Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.
  • After standardizing the pp predictors, PLS computes the first direction Z1Z_{1} by setting each ϕ1j\phi_{1 j} in (1) equal to the coefficient from the simple linear regression of YY onto XjX_{j}.
  • One can show that this coefficient is proportional to the correlation between YY and XjX_{j}.
  • Hence, in computing Z1=j=1pϕ1jXjZ_{1}=\sum_{j=1}^{p} \phi_{1 j} X_{j} , PLS places the highest weight on the variables that are most strongly related to the response.
  • Subsequent directions are found by taking residuals and then repeating the above prescription.

Summary

  • Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors.
  • Research into methods that give sparsity, such as the lasso is an especially hot area.

— Jul 15, 2022

Creative Commons License
§5 Linear Model Selection and Regularization by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.