Practical Data Analysis with JMP, Third Edition. Robert Carver
Чтение книги онлайн.
Читать онлайн книгу Practical Data Analysis with JMP, Third Edition - Robert Carver страница 17
Statisticians generally distinguish among four types of data:
Categorical Types | Quantitative Types |
Nominal | Interval |
Ordinal | Ratio |
One reason that it is important to understand the differences among data types is that we analyze them in different ways. In JMP, we differentiate between nominal, ordinal, and continuous data. Nominal and ordinal variables are categorical, distinguishing one observation from another in some qualitative, non-measurable way. Interval and ratio data are both numeric. Interval variables are artificially constructed, like a temperature scale or stock index, with arbitrarily chosen zero points. Most measurement data are considered ratio data because ratios of values are meaningful. For example, a film that lasts 120 minutes is twice as long as one lasting 60 minutes. In contrast, 120 degrees Celsius is not twice as hot as 60 degrees Celsius.
Distribution of a Categorical Variable
In its reporting, the World Bank identifies each country of the world with a continental region. There are seven regions, each with a different number of countries. The variable Region is nominal—it literally names a country’s general location on earth. Let’s get familiar with the different regions and see how many countries are in each. In other words, let’s look at the distribution of Region.
1. Select Analyze ► Distribution. In the Distribution dialog box (Figure 3.1), select the variable region as the Y, Columns variable. Click OK.
Figure 3.1: Distribution Dialog Box
Anytime you want to assign a column to a role in a JMP dialog box, you have three options: you can highlight the column name in the Select Columns list and click the corresponding role button, you can double-click the column name, or you can click-drag the column name into the role box.
The result appears in Figure 3.2. JMP constructs a simple bar chart listing the six continental regions and showing a rectangular bar corresponding to the number of times the name of the region occurs in the data table. Though we cannot immediately tell from the graph alone exactly how many countries are in each, North America clearly has the fewest countries and Europe and Central Asia has the most.
Figure 3.2: Distribution of Region
Below the graph is a frequency distribution (titled Frequencies), which provides a more specific summary. Here we find the name of each region, and the number of times each regional name appears in our table. For example, “East Asia & Pacific” occurs 432 times. As a proportion of the whole table, 16.7% of the rows (Prob. = 0.16744) represent countries in that region.
At this point, you might wisely pause and say, “Wait a second. Can there possibly be 432 countries in the East Asia and the Pacific region?” And you would be right. Remember that we have stacked data, with 13 rows representing 12 years of data devoted to each country. Therefore, there are 432/12 = 36 countries in the region.
Even though JMP handles the heavy computational or graphical tasks, always think about the data and its context and ask yourself if the results make sense to you.
Using the Data Filter to Temporarily Narrow the Focus
Because we know each country appears repeatedly in this data table, let’s choose just one year’s data to obtain a clearer picture of regional variation. We can specify rows to display in a graph by using the Data Filter. This is a tool that allows us to select rows that satisfy specific conditions such as only displaying data rows from the year 2010.
This chapter illustrates the use of the Data Filter to temporarily select rows in a data table for all active analyses. This is known as the global Data Filter. Alternatively, when you click the red triangles in most analysis reports, you will find a Script option with a local Data Filter that applies only to the current report. The local Data Filter is illustrated in later chapters, but curious readers should explore it at any time.
1. To see the effects of the Data Filter, we will instruct JMP to automatically update the graph and recalculate the frequencies. Click the red arrow next to Distributions and choose Redo ► Automatic Recalc.
2. Select Rows ► Data Filter. In the list of Columns, select year and click the Add button.
3. The dialog box takes on a new appearance (Figure 3.3). It now displays a list of years contained in the table. Near the top of the dialog box, check Show and Include so that only the rows that we select for 2010 will appear in all graphs and be included in any computations. Other rows will be hidden and excluded.
Figure 3.3: Choosing 2010 in the Data Filter
4. Scroll down the list of Year levels and highlight 2015. As noted in the dialog box, this selects 215 rows and temporarily suppresses the others.
5. Minimize the Data Filter. If you look in the data table of Life Expectancy, you will see that most rows now have two icons (
Using Graph Builder to Explore Categorical Data Visually
In Chapter 1, we met the Graph Builder, and we will use it throughout this book. It is most useful when working with multiple variables, but even with a single nominal variable, it provides a quick way to generate multiple views of the same data. Because interactivity is such an important feature of the tool, this section of the chapter provides few step-by-step directions. You should interact with the tool and think about the extent to which different graphing formats and options communicate the information content of the variable called region.
1. Select Graph ► Graph Builder. The region column identifies groups of countries. Drag it to the X drop zone.
Within Graph Builder, you can freely reposition a column from one drop zone to another. Hover the cursor over the column name until the cursor changes to the hand shape
With Region on the X axis, you will see seven clumps of black points above the seven region names. This is not very informative.
At the top of Graph Builder is a selector bar of icons (see Figure 3.4) representing different graph types. The graphing options available depend on the type of data we have placed on the graph. Hence, some icons are dimmed, but with Region on the X axis, we can opt for any of the highlighted option.
Figure 3.4: Graphing Options for a Nominal Column