Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis

Чтение книги онлайн.

Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis страница 23

Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis

Скачать книгу

2.2. The area we are interested in is that at or above 2.5 (the area where the arrow is pointing). Since we know the area under the normal density is equal to 1, we can subtract pnorm(2.5, 0, 1) from 1:

      > 1-pnorm(2.5, 0, 1) [1] 0.006209665

Graph depicts the shaded area under the standard normal distribution at a z-score of up to 2.5 standard deviations.

      We can see then the percentage of students scoring higher than Mary is in the margin of approximately 0.6% (i.e., multiply the proportion by 100). What proportion of students scored better than John in his class? Recall that his z‐score was equal to −2.5. Because we know the normal distribution is symmetric, we already know the area lying below −2.5 is the same as that lying above 2.5. This means that approximately 99.38% of students scored higher than John. Hence, we see that Mary drastically outperformed her colleague when we consider their scores relative to their classes. Be careful to note that in drawing these conclusions, we had to assume each score (that of John's and Mary's) came from a normal distribution. The mere fact that we transformed their raw scores to z‐scores in no way normalizes their raw distributions. Standardization standardizes, but it does not normalize.

      One can also easily verify that approximately 68% of cases in a normal distribution lie within −1 and +1 standard deviations, while approximately 95% of cases lie within −2 and +2 standard deviations.

      2.1.1 Plotting Normal Distributions

      We can plot normal densities in R by simply requesting the lower and upper limit on the abscissa:

      > x <- seq(from = -3, to = +3, length.out = 100) > plot(x, dnorm(x))Graph depicts the plot of normal distributions.

      Distributions (and densities) of a single variable typically go by the name of univariate distributions to distinguish them from distributions of two (bivariate) or more variables (multivariate).

      > install.packages(“HistData”) > library(HistData) > attach(Galton) > Galton parent child 1 70.5 61.7 2 68.5 61.7 3 65.5 61.7 4 64.5 61.7 5 64.0 61.7 6 67.5 62.2 7 67.5 62.2 8 67.5 62.2 9 66.5 62.2 10 66.5 62.2

      We first install the package using the install.packages function. The library statement loads the package HistData into R's search path. From there, we attach the Galton data to insert the object (dataframe) into the search list. We generate a histogram of parent height:

      > hist(parent, main = "Histogram of Parent Height")Histogram depicting the plot of parent versus frequency.

      2.1.2 Binomial Distributions

      The binomial distribution is given by:

equation

      where,

       p(r) is the probability of observing r occurrences out of n possible occurrences,2

       p is the probability of a “success” on any given trial, and

       1 − p is the probability of a failure on any given trial, often simply referred to by “q” (i.e., q = 1 − p).

      The binomial setting provides an ideal context to demonstrate the essentials of hypothesis‐testing logic, as we will soon see. In a binomial setting, the following conditions must hold:

       The variable under study must be binary in nature. That is, the outcome of the experiment can result in only one category or another. That is, the outcome categories are mutually exclusive. For instance, the flipping of a coin has this characteristic, because the coin can either come up “head” or “tail” and nothing else (yes, we are ruling out the possibility that it lands on its side, and I think it is safe to do so).

       The probability of a “success” on each trial remains constant (or stationary) from trial to trial. For example, if the probability of head is equal to 0.5 on our first flip, we assume it is also equal to 0.5 on the second, third, fourth flips, and so on.

       Each trial is independent of each other trial. That is, the fact that we get a head on our first flip of the coin in no way changes the probability of getting a head or tail on the next flip, and so on for the other flips (i.e., no outcome is ever “due” to occur, as the gambler sometimes believes).

      We can easily demonstrate hypothesis testing in a binomial setting using R. For instance, let us return to the coin‐flipping experiment. Suppose you would like to know the probability of obtaining two heads on five flips of a fair coin, where each flip is assumed to have a probability of heads equal to 0.5. In R, we can compute this as follows:

      > dbinom(2, size = 5, prob = 0.5) [1] 0.3125Histogram depicting Fisher's overlay of normal density on empirical observations.

      Source: Fisher (1925, 1934).

      where dbinom calls the “density for the binomial,”

Скачать книгу