Statistics. David W. Scott
Чтение книги онлайн.
Читать онлайн книгу Statistics - David W. Scott страница 9
However, in a re‐analysis of these data, we have included the shuttle flights that experienced no O‐ring failures. Now the final frame suggests that two or more O‐ring failures are quite likely at 28–36
1.2.3 Pearson's Father–Son Height Data Revisited
We have explored the two variables in this dataset individually, but there is an obvious question of how accurately a son's height can be predicted knowing his father's height. In the first frame of Figure 1.6, we display a scatter diagram of the
In the top right frame, we have placed a red dot at the location of the average heights of the fathers and sons. We have also drawn a straight line fit using the intuitive equation
Galton (1886) was one of the first to observe that many scatter diagrams observed in nature have an appearance similar to that in Figure 1.6. He noted that the shape appeared elliptical, so he superimposed elliptical contours over the scatter diagram. The bottom left frame in Figure 1.6 shows three (nested) ellipses for these data. Recall that a general ellipse has five parameters: two for the center of the ellipse; two for the horizontal and vertical scales; and a fifth called the eccentricity. Galton focused on this fifth parameter, and the correlation coefficient was the result. Ironically, this parameter is often referred to today as Pearson's correlation coefficient.
Figure 1.6 Father–son height data collected by Karl Pearson.
In the final frame, we take advantage of the large sample size to try to understand if the prediction (as weak as it may be) might be linear or nonlinear. For integer values of the rounded fathers' heights, we compute a three‐point summary of the corresponding sons' heights. The red dots are the arithmetic average of the sons' heights. The vertical lines display the (conditional) interquartile range. The final two red dots on each end are based on only a few points, so that the IQR can not be computed. These four red dots are shown in a smaller font size to indicate that even the averages are not so reliable.
We see that these summary points clearly suggest a linear rather than a nonlinear fit. We also see that the two blue reference lines from the second frame, namely
1.2.4 Discussion
These rather substantial examples illustrate the search for structure in distribution and prediction problems, as well as practical problems and cures that may be encountered. A more formal statistical approach to these questions will be introduced in the third part of this course. Probability theory will be the theoretical basis for many of these models, so we make it the focus of the next few chapters.
Problems
1 1.1 A frequency histogram of continuous data is constructed by counting the number of data points that fall into equally spaced bins of width . is called the bin width. Typically the bin edges are 0, , , , and so on. If the bin count in the th bin is denoted by , then the frequency histogram is defined as(1.1) Show that the total area of the frequency histogram is , where . Hint: the histogram is made up of rectangular blocks of width and height .A probability histogram is defined to have total area of one. Show that the following definition of a histogram has area equal to one:(1.2)
2 1.2 One of the most famous epidemiological cases occurred in 1854 when Dr. John Snow successfully tracked down the source of an outbreak of cholera in the London suburb of SoHo. He mapped the households of some 500 victims over a 10‐day period that lived within a quarter of mile of each other. However, many tens of thousands had died of cholera in England during the prior two decades. Dr. Snow believed contaminated water was a primary cause. Just as in the Space Shuttle example, there are choices of an appropriate time interval and the geographical extent that can influence our conclusions. Using the descriptions and maps conveniently assembled at http://www.ph.ucla.edu/epi/snow/snowcricketarticle.html,discuss the evidence and choices that were and could have been made. Hint: these data have been conveniently collected in CRAN Library HistData by Friendly (2018). Look at the help file for dataset snow and its example code.
3 1.3The Tukey power transformation of a variable is for any non‐zero . To better understand why the is used in place of when , we consider the linear re‐expression of the Tukey transformation given by the formula (Box and Cox (1964))(1.3) Since (1.3) is when , use l'Hôpital's rule to find the limit transformation as . The scatter diagram using either formula for fixed non‐zero will be visually identical. Formula (1.3) is referred to as the Box–Cox transformation; see Figure 1.7.Sometimes the transformation is used in place of when and can take on the value 0. In this case, the original and transformed values of 0 are both 0. Try this form on the body–brain data and compare to Figure 1.4.Figure 1.7 Box–Cox transformation on natural and log scales.
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу