Practical Data Analysis with JMP, Third Edition. Robert Carver
Чтение книги онлайн.
Читать онлайн книгу Practical Data Analysis with JMP, Third Edition - Robert Carver страница 20
We can also change the scale of the horizontal axis interactively. Initially, JMP set the left and right endpoints, and the limits changed when we chose uniform scaling. Suppose we want the axis to begin at 30 and end at 85.
6. Move the cursor to the left end of the horizontal axis, and notice that the hand now points to the left (this is true whether you have previously chosen the hand tool or not). Click and drag the cursor slowly left and right, and see that you are scrunching or stretching the axis values. Stop when the minimum value is 30.
7. Move the cursor to the right end of the axis, and similarly set the maximum at 100 years just by dragging the cursor.
Finally, we can “pan” along the axis. Think of the border around the graph as a camera’s viewfinder through which we see just a portion of the entire infinite axis.
8. Without holding the mouse button, move the cursor toward the middle of the axis until the hand points upward. Now click and drag to the left or right, and you will pan along the axis.
9. Alternatively, rather than clicking and dragging to change axis attributes, you can directly edit all “Axis Settings” by double-clicking on the axis itself. This opens a dialog box where you can specify a variety of settings.
Exploring Further with the Graph Builder
Our original data table contains values for 12 years, and we have now compared the variation in life expectancy for two years. The Graph Builder can allow us to make a quick visual comparison over 12 years.
1. First, we want to clear our earlier filtering so that we can now access all years. Choose Rows ► Clear Row States to deselect, show, and include all rows.
2. Select Graph ► Graph Builder.
3. Drag LifeExp to the X drop zone.
4. Find the menu bar at the top of the Graph Builder window and locate the Histogram button
5. Drag Year to the Wrap drop zone and click the Done button. Your graph should look like Figure 3.9.
Figure 3.9: Longer Lives in Most of the World, 1960 to 2015
What do you see as you inspect these small multiple histograms? Can you see life expectancies gradually getting longer in most countries? There were two peaks in 1960: many countries with short lives, and many with longer lives. The lower peak slowly flattened out as the entire distribution has crept rightward.
Summary Statistics for a Single Variable
Graphs are an ideal way to summarize a large data set and to communicate a great deal of information about a distribution. We can also describe variation in a quantitative variable with summary statistics (also called summary measures or descriptive statistics). Just as a distribution has shape, center, and dispersion, we have summary statistics that capture information about the shape, center, or dispersion of a variable.
Let’s look back at the distribution report for our sample of 2015 life expectancies in 198 countries of the world. Just to the right of the histogram, we find a table of Quantiles followed by a list of Summary Statistics.
Figure 3.10: Quantiles and Summary Statistics
Quantile is a generic term; you might be more familiar with percentiles. When we sort observations in a data set, divide them into groups of observations, and locate the boundaries between the groups, we establish quantiles. When there are 100 such groups, the boundaries are called percentiles. If there are four such groups, we refer to quartiles.
For example, we find that the 90th percentile is 81.54 years. This means that 90% of the observations have life expectancies shorter than 81.54 years. JMP also labels five quantiles known as the five-number summary. They identify the minimum, maximum, 25th percentile (1st quartile or Q1), 50th percentile (median), and 75th percentile (3rd quartile or Q3). Of the 198 countries listed in the data table, one-fourth have life expectancies shorter than 66.43 years, and one-fourth have life expectancies longer than 77.49 years.
Summary Statistics refer to the common descriptive statistics shown in Figure 3.10. At this stage in your study of statistics, three of these statistics are useful, and the other three should wait until Chapter 8.
● The mean is the simple arithmetic average of the observations, usually denoted by the symbol
Along with the median, it is commonly used as a measure of central tendency; in a symmetric distribution, the mean and median are quite close in value. When a distribution is strongly left-skewed like this one, the mean will tend to be smaller than the median. In a right-skewed distribution, the opposite will be true.
● The standard deviation (Std Dev) is a measure of dispersion, and you might think of it as a typical distance of a value from the mean of the distribution. It is usually represented by the symbol s, and is computed as follows:
We will have more to say about the standard deviation in later chapters, but for now, please note that it must be greater than or equal to zero, and that highly dispersed variables have larger standard deviations than consistent variables.
● n refers to the number of observations in the sample.
Outlier Box Plots
Now that we have discussed the five-number summary, we can interpret a box plot. The key to interpreting an outlier box plot is to recognize that it is a diagram of the five-number summary. Here is a typical example:
In a box plot, there is a rectangle with an intersecting line. Two edges of the rectangle are located at the first (Q1) and third (Q3) quartile values, and the line is located at the median. In other words, the rectangular box spans the interquartile range (IQR). Extending from the ends of the box are two lines called whiskers. In a distribution that is free of outliers, the whiskers reach to the minimum and maximum values. Otherwise, the plot limits the reach of the whiskers by the upper and lower fences, which are located 1.5 IQRs from each quartile. In this illustration, we have a cluster of seven low-value outliers.
JMP also adds two other features to the box plot. One is a diamond that represents the location of the mean. If you imagine a vertical line through the vertices of the diamond, you have located the mean. The other two vertices are positioned at the upper and lower confidence limits of the mean. We will discuss those in Chapter 11.
The second additional feature is a red bracket above the box. This is the shortest half bracket, representing the smallest part of the number line comprising 50% of the cases. We can divide the observations in half in