Читать онлайн книгу - Handbook of Regression Analysis With Applications in R. Samprit Chatterjee. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Handbook of Regression Analysis With Applications in R - Samprit Chatterjee

Скачать книгу

not unreasonable if collinearity is not a problem, but control variables that do not provide additional significant predictive power, but are collinear with the variables that are of direct interest, might be worth removing so they don't obscure the relationships involving the more important variables. In these data the two voter turnout variables are (not surprisingly) highly collinear, but a potential simplification to consider (particularly given that the target variable is the change in Bush voting percentage from 2000 to 2004) is to consider the change in voter turnout as a predictor (the fact that the estimated slope coefficients for 2000 and 2004 voter turnout are of opposite signs and not very different also supports this idea). The model using change in voter turnout is a subset of the model using 2000 and 2004 voter turnout separately (corresponding to restriction

), so the two models can be compared using a partial

‐test. As can be seen below, the fit of the simpler model is similar to that of the more complicated one, collinearity is no longer a problem, and it turns out that the partial

‐test (

) supports that the simpler model fits well enough compared to the more complicated model to be preferred (although voter turnout is still apparently not important).

Coefficients: Estimate Std.Error t val P(>|t|) VIF (Intercept) 1.157e+00 2.54e+00 0.46 0.651 Bush.pct.2000 1.633e-01 3.67e-02 4.46 3.7e-05 1.55 *** e.Voting 1.272e+01 4.20e+00 3.03 0.004 41.25 ** Bush.2000 X e.Voting -2.297e-01 7.53e-02 -3.05 0.003 38.25 ** Change.turnout -1.223e-05 1.36e-05 -0.90 0.370 2.44 Median.income -1.718e-04 5.57e-05 -3.08 0.003 1.65 ** Hispan.pop.pct -4.892e-02 2.94e-02 -1.66 0.102 1.14 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.233 on 60 degrees of freedom Multiple R-squared: 0.4585, Adjusted R-squared: 0.4044 F-statistic: 8.468 on 6 and 60 DF, p-value: 1.145e-06

Residual plots given in Figure 2.7 do not indicate any obvious problems, although the potential nonconstant variance related to whether a county used electronic voting or not noted in Figure 2.4 is still indicated. We will not address that issue here, but correction of nonconstant variance related to subgroups in the data will be discussed in Section 6.3.3.

“Residual plots for the 2004 election data that do not indicate any obvious problems, although the potential nonconstant variance related to whether a county used electronic voting or not.”

FIGURE 2.7: Residual plots for the 2004 election data.

2.5 Summary

In this chapter, we have discussed various issues related to model building and model selection. Such methods are important because both underfitting (not including variables that are needed) and overfitting (including variables that are not needed) lead to problems in interpreting the results of regression analyses and making predictions using fitted regression models. Hypothesis tests provide one tool for model building through formal comparisons of models. If one model is a special case of another, defined through a linear restriction, then a partial ‐statistic provides a test of whether the more complex model provides significantly more predictive power than does the simpler one. One important example of a partial ‐test is the standard ‐test for the significance of a slope coefficient. Another important use of partial ‐tests is in the construction of models for data where observations fall into two distinct subgroups that allow for common (pooled) relationships over groups, constant shift relationships that differ only in level but not in slopes, and completely distinct and different relationships across groups.

While useful, hypothesis tests do not provide a complete tool for model building. The problem is that a hypothesis test does not necessarily answer the question that is of primary importance to a data analyst. The ‐test for a particular slope coefficient tests whether a variable adds predictive power given the other variables in the model, but if predictors are collinear it could be that none add anything given the others, while separately still being very important. A related problem is that collinearity can lead to great instability in regression coefficients and ‐tests, making results difficult to interpret. Hypothesis tests also do not distinguish between statistical significance (whether or not a true coefficient is exactly zero) from practical importance (whether or not a model provides the ability for an analyst to make important discoveries in the context of how a model is used in practice).

These considerations open up a broader spectrum of tools for model building than just hypothesis tests. Best subsets regression algorithms allow for the quick summarization of hundreds or even thousands of potential regression models. The underlying principle of these summaries is the principle of parsimony, which implies the tradeoff of strength of fit versus simplicity: that a model should only be as complex as it needs to be. Measures such as the adjusted , , and explicitly provide this tradeoff, and are useful tools in helping to decide when a simpler model is preferred over a more complicated one. An effective model selection strategy uses these measures, as well as hypothesis tests and estimated prediction intervals, to suggest a set of potential “best” models, which can then be considered further. In doing so, it is important to remember that the variability that comes from model selection itself (model selection uncertainty) means that it is likely that several models actually provide descriptions of the underlying population process that are equally valid. One way of assessing the effects of this type of uncertainty is to keep some of the observed data aside as a holdout sample, and then validate the chosen fitted model(s) on that held out data.

A related point increasingly raised in recent years has been focused on issues of replicability, or the lack thereof — the alarming tendency for supposedly established relationships to not reappear as strongly (or at all) when new data are examined. Much of this phenomenon comes from quite valid attempts to find appropriate representations of relationships in a complicated world (including those discussed here and in the next three chapters), but that doesn't alter the simple fact that interacting with data to make models more appropriate tends to make things look stronger than they actually are. Replication and validation of models (and the entire model building process) should be a fundamental part of any exploration of a random process. Examining a problem further and discovering that a previously‐believed relationship does not replicate is not a failure of the scientific process; in fact, it is part of the essence of it.

A valid question regarding the logistics of performing model selection remains: what is the “correct” order in which to perform the different steps of model selection, assumption checking, and so on? Do you omit unusual observations first, and then try to determine the best model? Or do you work on variable selection, and then check diagnostics based on your chosen model?

Скачать книгу

Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 28

Информация о книге:

2.5 Summary