Читать онлайн книгу - Handbook of Regression Analysis With Applications in R. Samprit Chatterjee. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Handbook of Regression Analysis With Applications in R - Samprit Chatterjee

Скачать книгу

vote versus 2000 Bush vote. (b) Side‐by‐side boxplots of percentage change in Bush vote by whether or not the county employed electronic voting in 2004.

This analysis is based on data from Hout et al. (2004) (see also Theus and Urbanek, 2009). The observations are the counties of Florida. Although this is not a sample of Florida counties (it is actually a census of all of them), these counties can be considered a sample of all of the counties in the country, making inferences drawn about the larger population of counties based on this set of counties meaningful. The target variable is the change in the percentage of votes cast for Bush from 2000 to 2004 (a positive number meaning a higher percentage in 2004). We start with the simple regression model relating the change in Bush percentage to the percentage of votes Bush took in 2000, with corresponding scatter plot given in the left plot of Figure 2.4. It can be seen that most of the changes are positive, reflecting that Bush carried the state by more than votes in 2004, compared with the very close result (a vote margin) in 2000.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.9968 2.0253 -1.480 0.14379 Bush.pct.2000 0.1190 0.0355 3.352 0.00134 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.693 on 65 degrees of freedom Multiple R-squared: 0.1474, Adjusted R-squared: 0.1343 F-statistic: 11.24 on 1 and 65 DF, p-value: 0.00134

There is a weak, but statistically significant, relationship between 2000 Bush vote and the change in vote to 2004, with counties that went more strongly for Bush in 2000 gaining more in 2004. The constant shift model now adds an indicator variable for whether a county used electronic voting in 2004. The side‐by‐side boxplots in the right plot in Figure 2.4 show that overall the counties that used electronic voting had smaller gains for Bush than the that did not, but that of course does not take the 2000 Bush vote into account. There are also signs of nonconstant variance, as the variability is smaller among the counties that used electronic voting.

Coefficients: Estimate Std. Error t value Pr(>|t|) VIF (Intercept) -2.12713 2.10315 -1.011 0.31563 Bush.pct.2000 0.10804 0.03609 2.994 0.00391 1.049 ** e.Voting -1.12840 0.80218 -1.407 0.16437 1.049 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.672 on 64 degrees of freedom Multiple R-squared: 0.173, Adjusted R-squared: 0.1471 F-statistic: 6.692 on 2 and 64 DF, p-value: 0.002295

It can be seen that there is only weak (if any) evidence that the constant shift model provides improved performance over the pooled model. This does not mean that electronic voting is irrelevant, however, as it could be that two separate (unrestricted) lines are preferred.

Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -5.23862 2.35084 -2.228 0.029431 * Bush.pct.2000 0.16228 0.04051 4.006 0.000166 1.44 *** e.Voting 9.67236 4.26530 2.268 0.026787 32.26 * Bush.2000 X e.Voting -0.20051 0.07789 -2.574 0.012403 31.10 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.562 on 63 degrees of freedom Multiple R-squared: 0.2517, Adjusted R-squared: 0.2161 F-statistic: 7.063 on 3 and 63 DF, p-value: 0.0003626

The ‐test for the product variable indicates that the model with two unrestricted lines is preferred over the model with two parallel lines. A partial ‐test comparing this model to the pooled model, which is (), also supports two distinct lines,

for counties that did not use electronic voting in 2004, and

for counties that did use electronic voting. This is represented in Figure 2.5. This relationship implies that in counties that did not use electronic voting the more Republican a county was in 2000, the larger the gain for Bush in 2004, while in counties with electronic voting, the opposite pattern held true.

Graph depicting regression lines for election data separated by whether the county used electronic voting in 2004.

FIGURE 2.5: Regression lines for election data separated by whether the county used electronic voting in 2004.

As can be seen from the VIFs, the predictor and the product variable are collinear. This isn't very surprising, since one is a function of the other, and such collinearity is more likely to occur if one of the subgroups is much larger than the other, or if group membership is related to the level or variability of the predictor variable. Given that using the product variable is just a computational construction that allows the fitting of two separate regression lines, this is not a problem in this context.

This model is probably underspecified, as it does not include control variables that would be expected to be related to voting percentage. Figure 2.6 gives scatter plots of the percentage change in Bush votes versus (a) the total county voter turnouts in 2000 and (b) 2004, (c) median income, and (d) percentage of the voters being Hispanic. None of the marginal relationships are very strong, but in the multiple regression summarized below, median income does seem to add important predictive power without changing the previous relationships between change in Bush voting percentage and 2000 Bush percentage very much.

Coefficients: Estimate Std.Error t val P(>|t|) VIF (Intercept) 1.166e+00 2.55e+00 0.46 0.650 Bush.pct.2000 1.639e-01 3.69e-02 4.45 3.9e-5 1.55 *** e.Voting 1.426e+01 4.84e+00 2.95 0.005 54.08 ** Bush.2000 X e.Voting -2.545e-01 8.47e-02 -3.01 0.004 47.91 ** Vote.turn.2000 -5.957e-06 3.10e-05 -0.19 0.848 210.66 Vote.turn.2004 1.413e-06 2.49e-05 0.06 0.955 205.81 Median.income -1.745e-04 5.61e-05 -3.11 0.003 1.66 ** Hispan.pop.pct -4.127e-02 3.18e-02 -1.30 0.200 1.32 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.244 on 59 degrees of freedom Multiple R-squared: 0.4624, Adjusted R-squared: 0.3986 F-statistic: 7.25 on 7 and 59 DF, p-value: 2.936e-06 Image described by caption.

FIGURE 2.6: Plots for the 2004 election data. (a) Plot of percentage change in Bush vote versus 2000 voter turnout. (b) Plot of percentage change in Bush vote versus 2004 voter turnout. (c) Plot of percentage change in Bush vote versus median income. (d) Plot of percentage change in Bush vote versus percentage Hispanic voters.

We could consider simplifying the model here, but often researchers prefer to not remove control variables, even if they do not add to the fit, so that they can be sure that

Скачать книгу

Handbook of Regression Analysis With Applications in R. Samprit Chatterjee

Чтение книги онлайн.

Читать онлайн книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee страница 27

Информация о книге: