Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis
Чтение книги онлайн.
Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis страница 40
where Rx and Ry are the ranks on xi and yi for the ith individual in the data,
> cor.test(parent, child, method = "spearman") Spearman's rank correlation rho data: parent and child S = 76569964, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4251345
We see that rs of 0.425 is slightly less than was Pearson r of 0.459.
To understand why Spearman's rank correlation and Pearson coefficient differ, consider data (Table 2.5) on the rankings of favorite movies for two individuals. In parentheses are subjective scores of “favorability” of these movies, scaled 1–10, where 1 = least favorable and 10 = most favorable.
From the table, we can see that Bill very much favors Star Wars (rating of 10) while least likes Batman (rating of 2.1). Mary's favorite movie is Scarface (rating of 9.7) while her least favorite movie is Batman (rating of 7.6). We will refer to these subjective scores in a moment. For now, we focus only on the ranks. For instance, Bill's ranking of Scarface is third, while Mary's ranking of Star Wars is third.
Table 2.5 Favorability of Movies for Two Individuals in Terms of Ranks
Movie | Bill | Mary |
---|---|---|
Batman | 5 (2.1) | 5 (7.6) |
Star Wars | 1 (10.0) | 3 (9.0) |
Scarface | 3 (8.4) | 1 (9.7) |
Back to the Future | 4 (7.6) | 4 (8.5) |
Halloween | 2 (9.5) | 2 (9.6) |
Actual scores on the favorability measure are in parentheses.
To compute Spearman's rs in R the “long way,” we generate two vectors that contain the respective rankings:
> bill <- c(5, 1, 3, 4, 2) > mary <- c(5, 3, 1, 4, 2)
Because the data are already in the form of ranks, both Pearson r and Spearman rho will agree:
> cor(bill, mary) [1] 0.6 > cor(bill, mary, method = “spearman”) > 0.6
Note that by default, R returns the Pearson correlation coefficient. One has to specify method = “spearman”
to get rs. Consider now what happens when we correlate, instead of rankings, the actual subjective favorability scores corresponding to the respective ranks. When we plot the favorability data, we obtain:
> bill.sub <- c(2.1, 7.6, 8.4, 9.5, 10.0) > mary.sub <- c(7.6, 8.5, 9.0, 9.6, 9.7) > plot(mary.sub, bill.sub)
Note that though the relationship is not perfectly linear, each increase in Bill's subjective score is nonetheless associated with an increase in Mary's subjective score. When we compute Pearson's r on this data, we obtain:
> cor(bill.sub, mary.sub) [1] 0.9551578
However, when we compute rs, we get:
> cor(bill.sub, mary.sub, method = "spearman") [1] 1
Spearman's rs is equal to 1.0 because the rankings of movie preferences are perfectly monotonically increasing (i.e., for each increase in movie preference along the abscissa corresponds an increase in movie preference along the ordinate). In the case of Pearson's, the correlation is less than 1.0 because r captures the linear relationship among variables and not simply a monotonically increasing one. Hence, a high magnitude coefficient for Spearman's essentially tells us that two variables are “moving together,” but it does not necessarily imply the relationship is a linear one. A similar test that measures rank correlation is that of Kendall's rank‐order correlation. See Siegel and Castellan (1988, p. 245) for details.
2.20 STUDENT'S t DISTRIBUTION
The density for Student's t is given by (Shao, 2003):
where Γ is the gamma function and v are degrees of freedom. For small degrees of freedom v, the t distribution is quite distinct from the standard normal. However, as degrees of freedom increase, the t distribution converges to that of a normal density (Figure 2.11). That is, in the limit, f(t) → f(z), or a bit more formally,
The fact that t converges to z for large degrees of freedom but is quite distinct from z for small degrees of freedom is one reason why t distributions are often used for small sample problems. When sample size is large, and so consequently are degrees of freedom, whether one treats a random variable as t or z will make little difference in terms of computed p‐values and decisions on respective null hypotheses. This is a direct consequence of the convergence of the two distributions for large degrees of freedom. For a historical overview of how t‐distributions came to be, consult Zabell (2008).
2.20.1 t‐Tests for One Sample
When we perform hypothesis testing using the z distribution, we assume we have knowledge of the population variance σ2. Having direct knowledge of σ2 is the most ideal and preferable of circumstances. When we know σ2, we can compute the standard error of the mean directly as