Applied Univariate, Bivariate, and Multivariate Statistics. Daniel J. Denis

Чтение книги онлайн.

Читать онлайн книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis страница 48

Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis

Скачать книгу

β is the slope parameter, which we also assume to be fixed, and ε is a vector of errors ε1 to εn (we use ε here instead of E).

      Suppose now we want to add a second response variable. Because of the generality of (2.7), this can be easily accommodated:

equation

      Performing inferential tests to help draw conclusions about population parameters is useful, but ultimately the findings of a statistical analysis should make their way into a graph or other visualization. Data visualization is a field in itself, and with the advent of modern computing power, possibilities exist today that could only be dreamt of in the past. Simple visualizations such a histograms, boxplots, scatterplots, etc., can be useful in depicting findings but also in helping to verify assumptions that underlay the statistical model one is using. For example, since many tests of normality and equality of variances (and covariances) are relatively sensitive to the types of data to which they are applied, oftentimes researchers will generate simple plots in order to detect potential gross violations of such assumptions. We feature such techniques throughout the book.

      For graphical displays meant to communicate findings (rather than test assumptions), Friendly (2000) puts the field into context:

      Designing good graphics is surely an art, but as surely, it is one that ought to be informed by scienceIn this view, an effective graphical display, like good writing, requires an understanding of its purpose – what aspects of the data are to be communicated to the viewer. In writing, we communicate most effectively when we know our audience and tailor the message appropriately. (p. 8)

      In high‐dimensional space, the challenge of graphical approaches is to summarize data into lower dimensions, while still retaining most of the information in the original data. We feature some such plots in later chapters. For a thorough account of data visualization, see datavis.ca (Friendly, 2020). For sophisticated graphics using R, consult Wickham (2009).

      For now, it is useful to briefly review some basic plots for which the reader is likely already familiar.

      

      2.27.1 Box‐and‐Whisker Plots

      The boxplot was a contribution of John Tukey (1977) in the spirit of what is called exploratory data analysis, or “EDA” which encouraged scientists to spend more of their energy on descriptive techniques instead of focusing exclusively on confirmatory statistical tests. Boxplots of parent heights from Galton's data appear below:

      The boxplot provides what is generally known as a five‐number summary of a distribution, of which we can obtain most of the numbers we need by the summary function in R:

      > summary(parent) Min. 1st Qu. Median Mean 3rd Qu. Max. 64.00 67.50 68.50 68.31 69.50 73.00

      Recall that the median is the point in the ordered data that divides the data set into two equal parts. The location of the median is computed by (n + 1)/2. In Galton's data, there are 928 observations, and so the location of the median is at 464.5th (i.e., (928 + 1)/2) point in the ordered data set. For parent, this value is equal to 68.50. The first and third quartiles represent the 25th and 75th percentiles and are 67.50 and 69.50 respectively. We can also compute the range as

      > range(parent) [1] 64 73

      We can also generate boxplots by category. Throughout the book, we use Fisher's iris data (Fisher, 1936) in which flower characteristics such as sepal and petal length are categorized by species of flower. We plot sepal length by species:

      > library(lattice) > attach(iris) > bwplot(Sepal.Length ~ Species)An illustration of a boxplot that plots setosa, versicolor, and virginica versus sepal.Length.

      Stem‐and‐leaf plots are also easily produced. These visual displays are kind of “naked histograms,” because they reveal the actual observations in the data while also providing information about their frequency of occurrence. In 1710, John Arbuthnot analyzed data on the ratios of males to female births in London from 1629 to 1710 and in so doing made an argument for these births being a function of a “divine being” (Arbuthnot, 1710; Shoesmith, 1987). One of his variables was the number of male christenings (i.e., baptisms) over the period 1629–1710. We generate a stem‐and‐leaf plot in R of these male christenings using package aplpack (Wolf and Bielefeld, 2014), for which the “leaves” are corresponding hundreds. For example, in the following plot, the first value of 2|8 would appear to represent a value of 2800 but is rounded down from the actual value in the data (which is also the minimum) of 2890. The maximum in the data is actually equal to 8426, but is represented by 8400 (i.e., 8|0012334):

      The workhorse for establishing statistical evidence in the social and natural sciences is the method of null hypothesis significance testing (or, “NHST” for short). However, since its inception with R.A. Fisher in the early 1900s, the significance test has been the topic of much debate, both statistical and philosophical. Throughout much of this book, NHST is regularly used to evaluate null hypotheses in methods such as the analysis of variance, regression, and various multivariate procedures. Indeed, the procedure is universally used in most statistical methods.

      It behooves us then, before embarking on all of these methodologies, to discuss the nature of the null hypothesis significance test, and clearly demonstrate what it actually means, not only in a statistical context but also in how it should be interpreted in a research or substantive context.

      The purpose of this final section of the present chapter is to provide a clear and concise demonstration and summary of the factors that influence the size of a computed p‐value in virtually every statistical significance test. Understanding why statements such as

Скачать книгу