Handbook of Regression Analysis With Applications in R. Samprit Chatterjee
Чтение книги онлайн.
Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 27
![Handbook of Regression Analysis With Applications in R - Samprit Chatterjee Handbook of Regression Analysis With Applications in R - Samprit Chatterjee](/cover_pre848485.jpg)
This analysis is based on data from Hout et al. (2004) (see also Theus and Urbanek, 2009). The observations are the
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.9968 2.0253 -1.480 0.14379 Bush.pct.2000 0.1190 0.0355 3.352 0.00134 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.693 on 65 degrees of freedom Multiple R-squared: 0.1474, Adjusted R-squared: 0.1343 F-statistic: 11.24 on 1 and 65 DF, p-value: 0.00134
There is a weak, but statistically significant, relationship between 2000 Bush vote and the change in vote to 2004, with counties that went more strongly for Bush in 2000 gaining more in 2004. The constant shift model now adds an indicator variable for whether a county used electronic voting in 2004. The side‐by‐side boxplots in the right plot in Figure 2.4 show that overall the
Coefficients: Estimate Std. Error t value Pr(>|t|) VIF (Intercept) -2.12713 2.10315 -1.011 0.31563 Bush.pct.2000 0.10804 0.03609 2.994 0.00391 1.049 ** e.Voting -1.12840 0.80218 -1.407 0.16437 1.049 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.672 on 64 degrees of freedom Multiple R-squared: 0.173, Adjusted R-squared: 0.1471 F-statistic: 6.692 on 2 and 64 DF, p-value: 0.002295
It can be seen that there is only weak (if any) evidence that the constant shift model provides improved performance over the pooled model. This does not mean that electronic voting is irrelevant, however, as it could be that two separate (unrestricted) lines are preferred.
Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -5.23862 2.35084 -2.228 0.029431 * Bush.pct.2000 0.16228 0.04051 4.006 0.000166 1.44 *** e.Voting 9.67236 4.26530 2.268 0.026787 32.26 * Bush.2000 X e.Voting -0.20051 0.07789 -2.574 0.012403 31.10 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.562 on 63 degrees of freedom Multiple R-squared: 0.2517, Adjusted R-squared: 0.2161 F-statistic: 7.063 on 3 and 63 DF, p-value: 0.0003626
The
for counties that did not use electronic voting in 2004, and
for counties that did use electronic voting. This is represented in Figure 2.5. This relationship implies that in counties that did not use electronic voting the more Republican a county was in 2000, the larger the gain for Bush in 2004, while in counties with electronic voting, the opposite pattern held true.
FIGURE 2.5: Regression lines for election data separated by whether the county used electronic voting in 2004.
As can be seen from the VIFs, the predictor and the product variable are collinear. This isn't very surprising, since one is a function of the other, and such collinearity is more likely to occur if one of the subgroups is much larger than the other, or if group membership is related to the level or variability of the predictor variable. Given that using the product variable is just a computational construction that allows the fitting of two separate regression lines, this is not a problem in this context.
This model is probably underspecified, as it does not include control variables that would be expected to be related to voting percentage. Figure 2.6 gives scatter plots of the percentage change in Bush votes versus (a) the total county voter turnouts in 2000 and (b) 2004, (c) median income, and (d) percentage of the voters being Hispanic. None of the marginal relationships are very strong, but in the multiple regression summarized below, median income does seem to add important predictive power without changing the previous relationships between change in Bush voting percentage and 2000 Bush percentage very much.
Coefficients: Estimate Std.Error t val P(>|t|) VIF (Intercept) 1.166e+00 2.55e+00 0.46 0.650 Bush.pct.2000 1.639e-01 3.69e-02 4.45 3.9e-5 1.55 *** e.Voting 1.426e+01 4.84e+00 2.95 0.005 54.08 ** Bush.2000 X e.Voting -2.545e-01 8.47e-02 -3.01 0.004 47.91 ** Vote.turn.2000 -5.957e-06 3.10e-05 -0.19 0.848 210.66 Vote.turn.2004 1.413e-06 2.49e-05 0.06 0.955 205.81 Median.income -1.745e-04 5.61e-05 -3.11 0.003 1.66 ** Hispan.pop.pct -4.127e-02 3.18e-02 -1.30 0.200 1.32 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.244 on 59 degrees of freedom Multiple R-squared: 0.4624, Adjusted R-squared: 0.3986 F-statistic: 7.25 on 7 and 59 DF, p-value: 2.936e-06
FIGURE 2.6: Plots for the 2004 election data. (a) Plot of percentage change in Bush vote versus 2000 voter turnout. (b) Plot of percentage change in Bush vote versus 2004 voter turnout. (c) Plot of percentage change in Bush vote versus median income. (d) Plot of percentage change in Bush vote versus percentage Hispanic voters.
We could consider simplifying the model here, but often researchers prefer to not remove control variables, even if they do not add to the fit, so that they can be sure that