Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis
Чтение книги онлайн.
Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis страница 43
var.test
yields a p‐value of 0.11, which under most circumstances would be considered insufficient reason to doubt the null hypothesis of equal variances. Hence, the Welch adjustment on the variances was probably not needed in this case as there was no evidence of an inequality of variances to begin with.
Carrying out the same test in SPSS is straightforward by requesting (output not shown):
t-test groups = grade(0 1) /variables = studytime.
A classic nonparametric equivalent to the independent‐samples t‐test is the Wilcoxon rank‐sum test. It is a useful test to run when either distributional assumptions are known to be violated or when they are unknown and sample size too small for the central limit theorem to come to the “rescue.” The test compares rankings across the two samples instead of actual scores. For a brief overview of how the test works, see Kirk (2008, Chapter 18) and Howell (2002, pp. 707–717), and for a more thorough introduction to nonparametric tests in general, see the following chapter on ANOVA in this book, or consult Denis (2020) for a succinct chapter and demonstrations using R. We can request the test quite easily in R:
> wilcox.test(grade.0, grade.1) Wilcoxon rank sum test data: grade.0 and grade.1 W = 0, p-value = 0.007937 alternative hypothesis: true location shift is not equal to 0
We see that the obtained p‐value still suggests we reject the null hypothesis, though the p‐value is slightly larger than for the Welch‐corrected parametric test.
2.21 STATISTICAL POWER
Power, first and foremost, is a probability. Power is the probability of rejecting a null hypothesis given that the null hypothesis is false. It is equal to 1 − β (i.e., 1 minus the type II error rate). If the null hypothesis were true, then regardless of how much power one has, one would still not be able to reject the null. We may think of it somewhat in terms of the sensitivity of a statistical test for detecting the falsity of the null hypothesis. If the test is not very sensitive to departures from the null (i.e., in terms of a particular alternative hypothesis), we will not detect such departures. If the test is very sensitive to such departures, then we will correctly detect these departures and be able to infer the statistical alternative hypothesis in question.
A useful analogy for understanding power is to think of a sign on a billboard that reads “H0 is false.” Are you able to detect such a sign with your current glasses or contact lenses that you are wearing? If not, you lack sufficient power. That is, you lack the sensitivity in your instrument (your reading glasses) to correctly detect the falsity of the null hypothesis, and in doing, be in a position to reject it. Alternatively, if you have 20/20 vision, you will be able to detect the false null with ease, and reject it with confidence. A key point to note here is that if H0 is false, it is false regardless of your ability to detect it, analogous to a virus strain being present but biomedical engineers lacking a powerful enough microscope to see it. If the null is false, the only question that remains is whether or not you will have a powerful enough test to detect its falsity. If the null were not false on the other hand, then regardless of your degree of power, you will not be able to detect its falsity (because it is not false to begin with).
Power is a function of four elements, all of which will be featured in our discussion of the p‐value toward the conclusion of this chapter:
1 The value hypothesized under the statistical alternative hypothesis, H1. All else equal, a greater distance between H0 and H1 means greater power. Though “distance” in this regard is not a one‐to‐one concept with effect size, the spirit of the two concepts is the same. The greater the scientific effect, the more power you will have to detect that effect. This is true whether we are dealing with mean differences in ANOVA‐type models or testing a null hypothesis of the sort H0 : R2 = 0 in regression. In all such cases, we are seeking to detect a deviation from the null hypothesis.
2 The significance level, or type I error rate (α) at which you set your test. All else equal, a more liberal setting such as 0.05 or 0.10 affords more statistical power than a more conservative setting such as 0.01 or 0.001, for instance. It is easier to detect a false null if you allow yourself more of a risk of committing a type I error. Since we usually want to minimize type I error, we typically want to regard α as fixed at a nominal level (e.g., 0.05 or 0.01) and consider it not amenable to adjustment for the purpose of increasing power. Hence, when it comes to boosting power, researchers usually do not want to “mess with” the type I error rate.
3 Population variability, σ2, often unknown but estimated by s2. All else equal, the greater the variance of objects studied in the population, the less sensitive the statistical test, and the less power you will have. Why is this so? As an analogy, consider a rock thrown into the water. The rock will make a definitive particular “splash” in that it will displace a certain amount of water when it hits the surface. This can be considered to be the “effect size” of the splash. If the water is noisy with wind and waves (i.e., high population variability), it will be difficult to detect the splash. If, on the other hand, the water is calm and serene (i.e., low population variability), you will more easily detect the splash. Either way, the rock made a particular splash of a given size. The magnitude of the splash is the same regardless of whether the waters are calm or turbulent. Whether we can detect the splash or not is in part a function of the variance in the population.
4 Applying this concept to research settings, if you are sampling from “noisy” populations, it is harder to see the effect of your independent variable than if you are sampling from less noisy and thus, less variable, populations. This is why research using lab rats or other equally controllable objects can usually detect effects with relatively few animals in a sample, whereas research studying humans on variables such as intelligence, anxiety, attitudes, etc., usually requires many more subjects in order to detect effects. A good way to boost power is to study populations that have relatively low variability before your treatment is administered. If your treatment works, you will be able to detect its efficacy with fewer subjects than if dealing with a highly variable population. Another approach is to covary out one or two factors that are thought to be related to the dependent variable through a technique such as the analysis of covariance (Keppel and Wickens, 2004), discussed and demonstrated later in the book.
5 Sample size, n. All else equal, the greater the sample size, the greater the statistical power. Boosting sample size is a common strategy for increasing power. Indeed, as will be discussed at the conclusion of this chapter, for any significance test in which there is at least some effect (i.e., some distance between the null and alternative), statistical significance is assured for a large‐enough sample size. Obtaining large samples is a good thing (since after all, the most ideal goal would be to have the actual population), but as sample size increases, the p‐value becomes an increasingly poor indicator or measure of experimental effect. Effect sizes should always be reported alongside any significance test.
2.21.1 Visualizing Power
Figure 2.12, adapted from Bollen (1989), depicts statistical power under competing values for detecting the population parameter θ. Note carefully in the figure that the critical value for the test remains constant as a result of our desire to keep the type I error rate constant. It is the distance from θ = 0 to θ = C1 or θ = C2 that determines power (the shaded region in distributions (b) and (c)).
Statistical power matters so long as we have the inferential goal of rejecting null hypotheses. A study that is underpowered risks not being able to reject null hypotheses even if such