Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 22

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee

Скачать книгу

(or a) “best” model? As was stated on page 4, there is no “true” model, since any model is only a representation of reality (or equivalently, the true model is too complex to be modeled usefully). Since the goal is not to find the “true” model, but rather to find a model or set of models that best balances fit and simplicity, any strategy used to guide model selection should be consistent with this principle. The goal is to provide a good predictive model that also provides useful descriptions of the process being studied from estimated parameters.

      Once a potential set of predictors is chosen, most statistical packages include the capability to produce summary statistics for all possible regression models using those predictors. Such algorithms (often called best subsets algorithms) do not actually look at all possible models, but rather list statistics for only the models with strongest fits for each number of predictors in the model. Such a listing can then be used to determine a set of potential “best” models to consider more closely. The most common algorithm, described in Furnival and Wilson (1974), is based on branch and bound optimization, and while it is much less computationally intensive than examining all possible models, it still has a practical feasible limit of roughly images to images predictors. In Chapter 14, we discuss model selection and fitting for (potentially much) larger numbers of predictors.

      Note that model comparisons are only sensible when based on the same data set. Most statistical packages drop any observations that have missing data in any of the variables in the model. If a data set has missing values scattered over different predictors, the set of observations with complete data will change depending on which variables are in the model being examined, and model comparison measures will not be comparable. One way around this is to only use observations with complete data for all variables under consideration, but this can result in discarding a good deal of available information for any particular model.

      Tools like best subsets by their very nature are likely to be more effective when there are a relatively small number of useful predictors that have relatively strong effects, as opposed to a relatively large number of predictors that have relatively weak effects. The strict present/absent choice for a predictor is consistent with true relationships with either zero or distinctly nonzero slopes, as opposed to many slopes that are each nonzero but also not far from zero.

      

      2.3.2 EXAMPLE — ESTIMATING HOME PRICES (CONTINUED)

      Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 . Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361 1.262 Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 1.420 *** Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 1.661 *** Lot.size -8.971e-01 4.194e+00 -0.214 0.831197 1.074 Year.built 3.761e+03 1.963e+03 1.916 0.058981 1.242 . Property.tax 1.476e+00 2.832e+00 0.521 0.603734 1.300 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47380 on 78 degrees of freedom Multiple R-squared: 0.5065, Adjusted R-squared: 0.4685 F-statistic: 13.34 on 6 and 78 DF, p-value: 2.416e-10

      This is identical to the output given earlier, except that variance inflation factor (images) values are given for each predictor. It is apparent that there is virtually no collinearity among these predictors (recall that images is the minimum possible value of the images), which should make model selection more straightforward. The following output summarizes a best subsets fitting:

       P L r i Y o B v e p B a i L a e e t n o r r d h g t . t r r . . b y o o a s u . o o r i i t Mallows m m e z l a Vars R-Sq R-Sq(adj) Cp AICc S s s a e t x 1 35.3 34.6 21.2 1849.9 52576 X 1 29.4 28.6 30.6 1857.3 54932 X 1 10.6 9.5 60.3 1877.4 61828 X 2 46.6 45.2 5.5 1835.7 48091 X X 2 38.9 37.5 17.5 1847.0 51397 X X 2 37.8 36.3 19.3 1848.6 51870 X X 3 49.4 47.5 3.0 1833.1 47092 X X X 3 48.2 46.3 4.9 1835.0 47635 X X X 3 46.6 44.7 7.3 1837.5 48346 X X X 4 50.4 48.0 3.3 1833.3 46885 X X X X 4 49.5 47.0 4.7 1834.8 47304 X X X X 4 49.4 46.9 5.0 1835.1 47380 X X X X 5 50.6 47.5 5.0 1835.0 47094 X X X X X 5 50.5 47.3 5.3 1835.2 47162 X X X X X 5 49.6 46.4 6.7 1836.8 47599 X X X X X 6 50.6 46.9 7.0 1836.9 47381 X X X X X X

      1 Increase the number of predictors until the value levels off. Clearly, the highest for a given cannot be smaller than that for a smaller value of . If levels off, that implies that additional variables are not providing much additional fit. In this case, the largest values go from roughly to from to , which is clearly a large gain in fit, but beyond that more complex models do not provide much additional fit (particularly past ). Thus, this guideline suggests choosing either or .

      2 Choose the model that maximizes the adjusted . Recall from equation (1.7) that the adjusted equalsIt is apparent that explicitly trades off strength of fit () versus simplicity [the multiplier ], and can decrease if predictors that do not add any predictive power are added to a model. Thus, it is reasonable to not complicate a model beyond the point where its adjusted increases. For these data, is maximized at .

      The fourth column in the output refers to a criterion called Mallows' images (Mallows, 1973). This criterion equals

equation

Скачать книгу