These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.
Mallow’s :
where is the total # of parameters used and is an estimate of the variance of the error associated with each response measurement.
The AIC criterion is defined for a large class of models fit by maximum likelihood:
where is the maximized value of the likelihood function for the estimated model.
In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and and AIC are equivalent.
Like , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.
Notice that BIC replaces the used by with a term, where is the number of observations.
Since for any , the BIC statistic generally places a heavier penalty on models with many variables.
For a least squares model with variables, the adjusted statistic is calculated as
where TSS is the total sum of squares.
Unlike , AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted indicates a model with a small test error.
Maximizing the adjusted is equivalent to minimizing . While RSS always decreases as the number of variables in the model increases, may increase or decrease, due to the presence of in the denominator.
Unlike the statistic, the adjusted statistic pays a price for the inclusion of unnecessary variables in the model.
Each of the procedures returns a sequence of models indexed by model size Our job here is to select . Once selected, we will return model .
We compute the validation set error or the cross-validation error for each model under consideration, and then select the for which the resulting estimated test error is smallest.
This procedure has an advantage relative to AIC, BIC, , and adjusted , in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance .
It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance .
One-standard-error rule:
We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
Ridge regression and Lasso
Recall that the least squares fitting procedure estimates using the values that minimize
In contrast, the ridge regression coefficient estimates are the values that minimize
where is a tuning parameter, to be determined separately.
As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
However, the second term, , called a shrinkage penalty, is small when are close to zero, and so it has the effect of shrinking the estimates of towards zero.
The tuning parameter serves to control the relative impact of these two terms on the regression coefficient estimates.
Selecting a good value for is critical; cross-validation is used for this.
The standard least squares coefficient estimates are scale equivariant: multiplying by a constant simply leads to a scaling of the least squares coefficient estimates by a factor of In other words, regardless of how the th predictor is scaled, will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all predictors in the final model.
The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, , minimize the quantity
In statistical parlance, the lasso uses an (pronounced “ell 1”) penalty instead of an penalty. The norm of a coefficient vector is given by .
As with ridge regression, the lasso shrinks the coefficient estimates towards zero.
However, in the case of the lasso, the penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection.
We say that the lasso yields sparse models - that is, models that involve only a subset of the variables.
As in ridge regression, selecting a good value of for the lasso is critical; cross-validation is again the method of choice.
The methods that we have discussed so far in this chapter have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, .
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.
Let represent linear combinations of our original predictors. That is,
for some constants .
We can then fit the linear regression model,
using ordinary least squares.
Note that in model (2), the regression coefficients are given by . If the constants are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.
Notice that from definition (1),
where
Hence model (2) can be thought of as a special case of the original linear regression model.
Dimension reduction serves to constrain the estimated coefficients, since now they must take the form (3).
Can win in the bias-variance tradeoff.
— Jul 15, 2022
Made with ❤ at Earth.