Читать онлайн книгу - Statistics. David W. Scott. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Скачать книгу

alt="images"/>. Would you have supported the decision to launch? A least‐squares line (discussed in Chapter 8.5) is superimposed. This line suggests that, if anything, lower temperatures might result in fewer O‐ring failures. Thus the launch was attempted.

However, in a re‐analysis of these data, we have included the shuttle flights that experienced no O‐ring failures. Now the final frame suggests that two or more O‐ring failures are quite likely at 28–36. The question of including or excluding data is a difficult problem in practice. In other settings, including non‐event data can bias the analysis in the wrong direction. As we saw in the brain‐body weight data, excluding the two or three outliers was not necessary. However, in Rayleigh's nitrogen data, excluding an entire cluster of outliers as bad data would have postponed the discovery of argon.

1.2.3 Pearson's Father–Son Height Data Revisited

We have explored the two variables in this dataset individually, but there is an obvious question of how accurately a son's height can be predicted knowing his father's height. In the first frame of Figure 1.6, we display a scatter diagram of the pairs. This diagram clearly shows a positive tilt, consistent with the expectation that the sons of tall fathers are tall, and vice versa; however, the strength of the relationship does not seem as strong as in the brain–body weight dataset.

In the top right frame, we have placed a red dot at the location of the average heights of the fathers and sons. We have also drawn a straight line fit using the intuitive equation . However, the equation is an improvement, since we observed earlier that sons were 1 inch taller than their fathers on average. As a reference, we have also included a horizontal line at the average heights of the sons. This line would be appropriate if there were no information about a son's height to be gleaned from his father's height; but a positive relationship (correlation) is clear.

Galton (1886) was one of the first to observe that many scatter diagrams observed in nature have an appearance similar to that in Figure 1.6. He noted that the shape appeared elliptical, so he superimposed elliptical contours over the scatter diagram. The bottom left frame in Figure 1.6 shows three (nested) ellipses for these data. Recall that a general ellipse has five parameters: two for the center of the ellipse; two for the horizontal and vertical scales; and a fifth called the eccentricity. Galton focused on this fifth parameter, and the correlation coefficient was the result. Ironically, this parameter is often referred to today as Pearson's correlation coefficient.

Graphs depict the scatter plot of the father and son height data that are collected by Karl Pearson.

Figure 1.6 Father–son height data collected by Karl Pearson.

In the final frame, we take advantage of the large sample size to try to understand if the prediction (as weak as it may be) might be linear or nonlinear. For integer values of the rounded fathers' heights, we compute a three‐point summary of the corresponding sons' heights. The red dots are the arithmetic average of the sons' heights. The vertical lines display the (conditional) interquartile range. The final two red dots on each end are based on only a few points, so that the IQR can not be computed. These four red dots are shown in a smaller font size to indicate that even the averages are not so reliable.

We see that these summary points clearly suggest a linear rather than a nonlinear fit. We also see that the two blue reference lines from the second frame, namely and , both miss badly. A new (dashed) line with slope of 1/2 appears to capture the linear trend quite well. The relationship between this slope and the correlation coefficient, as well as a genetic explanation, will be discussed in Chapter 4.1.5.

1.2.4 Discussion

These rather substantial examples illustrate the search for structure in distribution and prediction problems, as well as practical problems and cures that may be encountered. A more formal statistical approach to these questions will be introduced in the third part of this course. Probability theory will be the theoretical basis for many of these models, so we make it the focus of the next few chapters.

Problems

1 1.1 A frequency histogram of continuous data is constructed by counting the number of data points that fall into equally spaced bins of width . is called the bin width. Typically the bin edges are 0, , , , and so on. If the bin count in the th bin is denoted by , then the frequency histogram is defined as(1.1) Show that the total area of the frequency histogram is , where . Hint: the histogram is made up of rectangular blocks of width and height .A probability histogram is defined to have total area of one. Show that the following definition of a histogram has area equal to one:(1.2)

2 1.2 One of the most famous epidemiological cases occurred in 1854 when Dr. John Snow successfully tracked down the source of an outbreak of cholera in the London suburb of SoHo. He mapped the households of some 500 victims over a 10‐day period that lived within a quarter of mile of each other. However, many tens of thousands had died of cholera in England during the prior two decades. Dr. Snow believed contaminated water was a primary cause. Just as in the Space Shuttle example, there are choices of an appropriate time interval and the geographical extent that can influence our conclusions. Using the descriptions and maps conveniently assembled at http://www.ph.ucla.edu/epi/snow/snowcricketarticle.html,discuss the evidence and choices that were and could have been made. Hint: these data have been conveniently collected in CRAN Library HistData by Friendly (2018). Look at the help file for dataset snow and its example code.

3 1.3The Tukey power transformation of a variable is for any non‐zero . To better understand why the is used in place of when , we consider the linear re‐expression of the Tukey transformation given by the formula (Box and Cox (1964))(1.3) Since (1.3) is when , use l'Hôpital's rule to find the limit transformation as . The scatter diagram using either formula for fixed non‐zero will be visually identical. Formula (1.3) is referred to as the Box–Cox transformation; see Figure 1.7.Sometimes the transformation is used in place of when and can take on the value 0. In this case, the original and transformed values of 0 are both 0. Try this form on the body–brain data and compare to Figure 1.4.Figure 1.7 Box–Cox transformation on natural and log scales.

Конец ознакомительного фрагмента.

Текст предоставлен ООО «ЛитРес».

Прочитайте эту книгу

Скачать книгу

Statistics. David W. Scott

Чтение книги онлайн.

Читать онлайн книгу Statistics - David W. Scott страница 9

Информация о книге:

1.2.3 Pearson's Father–Son Height Data Revisited

1.2.4 Discussion

Problems

Конец ознакомительного фрагмента.