Practical Data Analysis with JMP, Third Edition. Robert Carver
Чтение книги онлайн.
Читать онлайн книгу Practical Data Analysis with JMP, Third Edition - Robert Carver страница 25
5. For a numerical comparison, let’s look at some quantiles. In the Oneway Analysis report, click the red triangle and select Quantiles.
6. Below the graph should look like Figure 4.10.
Figure 4.10: Quantiles for Speed by Phase of Flight
The table reports seven quantiles for each flight phase. We readily see strong similarities between speeds at the time of strikes for take-off run and landing roll, phases when the aircraft is on the ground. Quantiles during climb and approach are also comparable, but speeds during descent are uniformly higher than other phases.
Describing Covariation: Two Continuous Variables
To illustrate the standard bivariate methods for continuous data, we will now shift to a different set of data. In earlier chapters, we looked at variation in life expectancy around the world. We’ll now look at data related to variation in birth rates and the risks of childbirth in different nations as of 2017. We will rely on a data table with five continuous columns. Two of the columns measure the relative frequency of births in each country, and the other three measure risks to mothers and babies around birth. Initially, we will look at two of the variables: the columns labeled BirthRate and MortMaternal2016. A country’s annual birth rate is defined as the number of live births per 1,000 people in the country. The maternal mortality figure is the average number of mothers who die as a result of childbirth, per 100,000 births. At the time of this writing, the most current birth rate data was for 2017, but for maternal mortality, it was 2016.
As we did in the previous sections, let’s start by simply looking at the univariate summaries of these two variables.
1. Open the Birthrate 2017 data table.
2. Select Analyze ► Distribution. Cast BirthRate and MortMaternal in the Y role and click OK.
3. Within your BirthRate histogram, select different bars or groups of bars, and notice which bars are selected in the maternal mortality histogram. The results should look much like Figure 4.11.
Figure 4.11: Linked Histograms of Two Continuous Distributions
The general tendency is that countries with low birth rates also have low maternal mortality rates, but as one rate increases so does the other. We can see this tendency more directly in a scatterplot, an X-Y graph showing ordered pairs of values.
4. In the data table, press the Esc key to clear the de-select rows that you selected by clicking on histogram bars. Then choose Analyze ► Fit Y by X. Cast MortMaternal16 as Y and BirthRate as X and click OK.
Your results will look like those shown in Figure 4.12.
Figure 4.12: A Scatterplot
By now you also have had enough experience with Graph Builder to know that you can easily create a similar graph with that tool. You should feel free to explore data with Graph Builder.
This graph provides more information about the ways in which these two variables tend to vary together. First, it is very clear that these two variables do indeed vary together in a general pattern that curves upward from left to right; maternal mortality increases at an accelerating rate as the birth rate increases, though there are some countries that depart from the pattern. Many countries are concentrated in the lower left, with low birth rates and relatively low maternal mortality.
As we will learn in Chapters 15 through 19, there are powerful techniques available for building models of patterns like the one visible in Figure 4.12. At this early point in your studies, the curvature of the pattern presents some unwelcome complications. Figure 4.13 shows another pair of variables whose relationship is more linear. We will investigate this relationship and meet three common numerical summaries of such bivariate covariation.
5. Click the red triangle next to the word Bivariate in the current window and select Redo ► Relaunch Analysis.
6. Remove MortMaternal2016 from the Y, Response role and replace it with Fertil. This will produce a scatterplot (seen in modified fashion in Figure 4.13).
Earlier, we noted that birth rate counts the number of live births per 1,000 people in a country. Another measure of the frequency of births is the fertility rate, which is the mean number of children that would be born to a woman during her lifetime in each country.
When we look at this relationship in a scatterplot, we see that the points fall in a distinctive upward sloping pattern that is generally straight. We can also calculate three different numerical summaries to characterize the relationship. Each of the statistical measures compares the pattern of points to a perfectly straight line. The first summary is the equation of a straight line that follows the general trend of the points. (See Chapter 15 for a full explanation.) The second summary is a measure of the extent to which variation in X is associated with variation in Y, and the third summary measures the strength of the linear association between X and Y.
7. In the scatterplot of Fertility versus Birthrate, click the red triangle next to Bivariate Fit and select Fit Line.
8. Then click the red triangle again and select Histogram Borders.
Figure 4.13: Scatterplot with Histogram Borders and Line of Best Fit
Now your results look like Figure 4.13. The consequence of these two customizations is that along the outside borders of the bivariate scatterplot, we see the univariate distributions of each of our two columns. Additionally, we see a red fitted line that approximates the upward pattern of the points.
Below the graph, we find the equation of that line:
Fertil = 0.0845109 + 0.1282129*BirthRate
The slope of this line describes how these two variables co-vary. If we imagine two groups of countries whose birth rates differ by one birth per 1,000 people, the group with the higher birth rate would average 0.128 more births per woman.
9. Below the linear fit equation, you will find the Summary of Fit table (not shown here). Locate the first statistic called Rsquare.
R square (r2) is a goodness-of-fit measure; for now, think of it as the proportion of variation in Y that is associated with X. If fertility were perfectly and completely determined as a linear function of birth rate, then R square would equal 1.00. In this instance, R square equals 0.958; approximately 96% of the cross-country variability in fertility rates is accounted for by differences in birth rates.
A third commonly used summary for bivariate continuous data is called correlation, which measures the strength of the linear association between the two variables. The coefficient of correlation, symbolized by the letter r, is the square root of r2 (if the slope of the best-fit line is negative, then r is –). As such, r always lies within the interval [–1, +1]. Values near the ends of the interval indicate strong correlations, and values near zero are weak correlations.
10. Select Analyze ► Multivariate