Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 28

Handbook of Regression Analysis With Applications in R - Samprit  Chatterjee

Скачать книгу

not unreasonable if collinearity is not a problem, but control variables that do not provide additional significant predictive power, but are collinear with the variables that are of direct interest, might be worth removing so they don't obscure the relationships involving the more important variables. In these data the two voter turnout variables are (not surprisingly) highly collinear, but a potential simplification to consider (particularly given that the target variable is the change in Bush voting percentage from 2000 to 2004) is to consider the change in voter turnout as a predictor (the fact that the estimated slope coefficients for 2000 and 2004 voter turnout are of opposite signs and not very different also supports this idea). The model using change in voter turnout is a subset of the model using 2000 and 2004 voter turnout separately (corresponding to restriction images), so the two models can be compared using a partial images‐test. As can be seen below, the fit of the simpler model is similar to that of the more complicated one, collinearity is no longer a problem, and it turns out that the partial images‐test (images, images) supports that the simpler model fits well enough compared to the more complicated model to be preferred (although voter turnout is still apparently not important).

      Coefficients: Estimate Std.Error t val P(>|t|) VIF (Intercept) 1.157e+00 2.54e+00 0.46 0.651 Bush.pct.2000 1.633e-01 3.67e-02 4.46 3.7e-05 1.55 *** e.Voting 1.272e+01 4.20e+00 3.03 0.004 41.25 ** Bush.2000 X e.Voting -2.297e-01 7.53e-02 -3.05 0.003 38.25 ** Change.turnout -1.223e-05 1.36e-05 -0.90 0.370 2.44 Median.income -1.718e-04 5.57e-05 -3.08 0.003 1.65 ** Hispan.pop.pct -4.892e-02 2.94e-02 -1.66 0.102 1.14 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.233 on 60 degrees of freedom Multiple R-squared: 0.4585, Adjusted R-squared: 0.4044 F-statistic: 8.468 on 6 and 60 DF, p-value: 1.145e-06

“Residual plots for the 2004 election data that do not indicate any obvious problems, although the potential nonconstant variance related to whether a county used electronic voting or not.”

      These considerations open up a broader spectrum of tools for model building than just hypothesis tests. Best subsets regression algorithms allow for the quick summarization of hundreds or even thousands of potential regression models. The underlying principle of these summaries is the principle of parsimony, which implies the tradeoff of strength of fit versus simplicity: that a model should only be as complex as it needs to be. Measures such as the adjusted images, images, and images explicitly provide this tradeoff, and are useful tools in helping to decide when a simpler model is preferred over a more complicated one. An effective model selection strategy uses these measures, as well as hypothesis tests and estimated prediction intervals, to suggest a set of potential “best” models, which can then be considered further. In doing so, it is important to remember that the variability that comes from model selection itself (model selection uncertainty) means that it is likely that several models actually provide descriptions of the underlying population process that are equally valid. One way of assessing the effects of this type of uncertainty is to keep some of the observed data aside as a holdout sample, and then validate the chosen fitted model(s) on that held out data.

      A related point increasingly raised in recent years has been focused on issues of replicability, or the lack thereof — the alarming tendency for supposedly established relationships to not reappear as strongly (or at all) when new data are examined. Much of this phenomenon comes from quite valid attempts to find appropriate representations of relationships in a complicated world (including those discussed here and in the next three chapters), but that doesn't alter the simple fact that interacting with data to make models more appropriate tends to make things look stronger than they actually are. Replication and validation of models (and the entire model building process) should be a fundamental part of any exploration of a random process. Examining a problem further and discovering that a previously‐believed relationship does not replicate is not a failure of the scientific process; in fact, it is part of the essence of it.

Скачать книгу