Statistics. David W. Scott

Чтение книги онлайн.

Читать онлайн книгу Statistics - David W. Scott страница 9

Statistics - David W. Scott

Скачать книгу

alt="images"/>. Would you have supported the decision to launch? A least‐squares line (discussed in Chapter 8.5) is superimposed. This line suggests that, if anything, lower temperatures might result in fewer O‐ring failures. Thus the launch was attempted.

      1.2.3 Pearson's Father–Son Height Data Revisited

      In the top right frame, we have placed a red dot at the location of the average heights of the fathers and sons. We have also drawn a straight line fit using the intuitive equation images. However, the equation images is an improvement, since we observed earlier that sons were 1 inch taller than their fathers on average. As a reference, we have also included a horizontal line at the average heights of the sons. This line would be appropriate if there were no information about a son's height to be gleaned from his father's height; but a positive relationship (correlation) is clear.

Graphs depict the scatter plot of the father and son height data that are collected by Karl Pearson.

      In the final frame, we take advantage of the large sample size to try to understand if the prediction (as weak as it may be) might be linear or nonlinear. For integer values of the rounded fathers' heights, we compute a three‐point summary of the corresponding sons' heights. The red dots are the arithmetic average of the sons' heights. The vertical lines display the (conditional) interquartile range. The final two red dots on each end are based on only a few points, so that the IQR can not be computed. These four red dots are shown in a smaller font size to indicate that even the averages are not so reliable.

      We see that these summary points clearly suggest a linear rather than a nonlinear fit. We also see that the two blue reference lines from the second frame, namely images and images, both miss badly. A new (dashed) line with slope of 1/2 appears to capture the linear trend quite well. The relationship between this slope and the correlation coefficient, as well as a genetic explanation, will be discussed in Chapter 4.1.5.

      1.2.4 Discussion

      These rather substantial examples illustrate the search for structure in distribution and prediction problems, as well as practical problems and cures that may be encountered. A more formal statistical approach to these questions will be introduced in the third part of this course. Probability theory will be the theoretical basis for many of these models, so we make it the focus of the next few chapters.

      1 1.1 A frequency histogram of continuous data is constructed by counting the number of data points that fall into equally spaced bins of width . is called the bin width. Typically the bin edges are 0, , , , and so on. If the bin count in the th bin is denoted by , then the frequency histogram is defined as(1.1) Show that the total area of the frequency histogram is , where . Hint: the histogram is made up of rectangular blocks of width and height .A probability histogram is defined to have total area of one. Show that the following definition of a histogram has area equal to one:(1.2)

      2 1.2 One of the most famous epidemiological cases occurred in 1854 when Dr. John Snow successfully tracked down the source of an outbreak of cholera in the London suburb of SoHo. He mapped the households of some 500 victims over a 10‐day period that lived within a quarter of mile of each other. However, many tens of thousands had died of cholera in England during the prior two decades. Dr. Snow believed contaminated water was a primary cause. Just as in the Space Shuttle example, there are choices of an appropriate time interval and the geographical extent that can influence our conclusions. Using the descriptions and maps conveniently assembled at http://www.ph.ucla.edu/epi/snow/snowcricketarticle.html,discuss the evidence and choices that were and could have been made. Hint: these data have been conveniently collected in CRAN Library HistData by Friendly (2018). Look at the help file for dataset snow and its example code.

      3 1.3The Tukey power transformation of a variable is for any non‐zero . To better understand why the is used in place of when , we consider the linear re‐expression of the Tukey transformation given by the formula (Box and Cox (1964))(1.3) Since (1.3) is when , use l'Hôpital's rule to find the limit transformation as . The scatter diagram using either formula for fixed non‐zero will be visually identical. Formula (1.3) is referred to as the Box–Cox transformation; see Figure 1.7.Sometimes the transformation is used in place of when and can take on the value 0. In this case, the original and transformed values of 0 are both 0. Try this form on the body–brain data and compare to Figure 1.4.Figure 1.7 Box–Cox transformation on natural and log scales.

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу

Скачать книгу