Practical Data Analysis with JMP, Third Edition. Robert Carver
Чтение книги онлайн.
Читать онлайн книгу Practical Data Analysis with JMP, Third Edition - Robert Carver страница 13
Simple random sampling requires that we have a sampling frame, or a list of all members of a population. The sampling frame could be a list of students in a university, firms in an industry, or members of an organization. To illustrate, we will start with a list of the countries in the world and see one way to select an SRS. For the sake of this example, suppose we want to draw a simple random sample of 20 countries for in-depth research.
There are several ways to select and isolate a simple random sample drawn from a JMP data table. In this illustration, we will first randomly select 20 rows, and then proceed to move them into a new data table. This is not the most efficient method, but it emphasizes the idea of random selection, and it introduces two useful commands.
In this chapter, we will work with several data tables. As such, this is an opportunity to introduce JMP Projects. A project keeps track of data and report files together.
1. First, we will create a Project to store the work that we are about to do. File ► New ► Project. A blank project window opens. (See Figure 2.1.)
Figure 2.1: A Project Window
2. File ► Open. Select the data table called World Nations. This table lists the countries in the world as of 2017, as identified by the United Nations and the World Bank. Notice that the data table opens within the project window, and the Windows List in the upper left contains World Nations.
3. Select Rows ► Row Selection ► Select Randomly…. A small dialog box opens (Figure 2.2) asking either for a sampling rate or a sample size. If you enter a value between 0 and 1, JMP understands it as a rate. A number large than 1 is interpreted as a sample size, n. Enter 20 into the dialog box and click OK.
Figure 2.2: Specifying a Simple Random Sample Size of 20
JMP randomly selects 20 of the 215 rows in this table. When you look at the Rows panel in the Data Table window, you will see that 20 rows have been selected. As you scroll down the list of countries, you will see that 20 selected rows are highlighted. If you repeat Step 2, a different list of 20 rows will be highlighted because the selection process is random. If you compare notes with others in your class, you should discover that their samples comprise different countries. With the list of 20 countries now chosen, let’s create a new data table containing just the SRS of 20 countries.
4. Select Tables ► Subset. This versatile dialog box (see Figure 2.3) enables us to build a table using the just-selected rows, or to randomly sample directly. Note that the dialog box opens in a fresh tab within the Project, and as an item in the Window List.
5. As shown in the figure, choose Selected Rows and then change the Output table name to World Nations SRS. Then click OK.
Figure 2.3: Creating a Subset of Selected Rows
In the upper left corner of the data table window, there is a green arrow with the label Source. JMP inserts the JSL script that created the subset. Readers wishing to learn more about writing JMP scripts should right-click the green arrow, choose Edit, and see what a JSL script looks like.
Before moving to the next example, let’s save the Project. Before saving a project, all documents within the project must be saved individually. We just created a new data table; save it now.
6. File ► Save. The default name is World Nations SRS, which is fine. Place it in a folder of your choice.
7. File ► Save Project. Choose a location and name this project Chap_02. Then, click OK. Among your Recent Files list, you will now find Chap_02.jmpprj.
Other Types of Random Sampling
As noted previously, simple random sampling requires that we can identify and access all N elements within a population. Sometimes this is not practical, and there are several alternative strategies available. It is well beyond the scope of this chapter to discuss these strategies at length, but Chapter 8 provides basic coverage of some of these approaches.
Non-Random Sampling
This book is about practical data analysis, and in practice, many data tables contain data that were not generated by anything like a random sampling process. Most data collected within businesses and nonprofit organizations come from the normal operations of the organization rather than from a carefully constructed process of sampling. The data generated by “Internet of Things” devices, for example, are decidedly non-random. We can summarize and describe the data within a non-random sample but should be very cautious about the temptation to generalize from such samples. Whether we are conducting the analysis or reading about it, we always want to ask whether a given sample is likely to misrepresent the population or process from which it came. Voluntary response surveys, for example, are very likely to mislead us if only highly motivated individuals respond. On the other hand, if we watch the variation in stock prices during an uneventful period in the stock markets, we might reasonably expect that the sample could represent the process of stock market transactions.
Big Data
You might have heard or read about “Big Data”—high volume raw data generated by numerous electronic technologies like cell phones, supermarket scanners, radio-frequency identification (RFID) chips, or other automated devices. The world-spanning, continuous stream of data carries huge potential for the future of data analytics and presents many ethical, technical, and economic challenges. In general, data generated in this way are not random in the conventional sense and don’t neatly fit into the traditional classifications of an introductory statistics course. Big data can include photographic images, video, or sound recordings that don’t easily occupy columns and rows in a data table. Furthermore, streaming data is neither cross-sectional nor time series in the usual sense.
Cross-Sectional and Time Series Sampling
When the research concerns a population, the sampling approach is often cross-sectional, which is to say the researchers select individuals from the population at one period of time. Again, the individuals can be people, animals, firms, cells, plants, manufactured goods, or anything of interest to the researchers.
When the research concerns a process, the sampling approach is more likely to be time series or longitudinal, whereby a single individual is repeatedly measured at regular time intervals. A great deal of business and economic data is longitudinal. For example, companies and governmental agencies track and report monthly sales, quarterly earnings, or annual employment. The major distinction between time series data and streaming data is whether observations occur according to a pre-determined schedule or whether they are event-driven (for example, when a customer places a cell phone call).
Panel studies combine cross-sectional and time series approaches. In a panel study, researchers repeatedly gather data about the same group of individuals. Some long-term public health studies follow panels of individuals for many years; some marketing researchers use consumer panels to monitor changes in taste and consumer preferences.
Study Design: Experimentation, Observation, and Surveying