Practical Data Analysis with JMP, Third Edition. Robert Carver

Чтение книги онлайн.

Читать онлайн книгу Practical Data Analysis with JMP, Third Edition - Robert Carver страница 21

Автор:
Жанр:
Серия:
Издательство:
Practical Data Analysis with JMP, Third Edition - Robert Carver

Скачать книгу

gives the upper and lower halves; the IQR box gives the middle half. This bracket gives the shortest half.

      A box plot very efficiently conveys information about the center, symmetry, dispersion, and outliers for a single distribution. When we compare box plots across several groups or samples, the results can be quite revealing. In the next chapter, we will look at such box plots and other ways of summarizing two variables at a time.

      Now that you have completed all of the activities in this chapter, use the techniques that you have learned to respond to these questions.

      1. Scenario: We will continue our analysis of the variation in life expectancy at birth in 2015. Reset the Data Filter to show and include 2015.

      a. When we first constructed the Life Exp histogram, we described it as multi-peaked and left-skewed. Use the hand tool to increase and reduce the number of bars. Adjust the number of bars so that there are two prominent peaks. Describe what you did, and where the peaks are located.

      b. Rescale the axes of the same histogram and see if you can emphasize the two peaks even more (in other words, have them separated distinctly). Describe what you did to make these peaks more distinct and noticeable.

      c. Based on what you have seen in these exercises, why is it a good idea to think critically about an analyst’s choice of scale in a reported graph?

      d. Highlight a few of the left-most bars in the histogram for LifeExp and look at the Distribution report for region. Which continent or continents are home to the countries with the shortest life expectancies in the world? What might account for this?

      2. Scenario: Now let’s look at the distribution of life expectancy 25 years before 2015. Use the Data Filter to choose the observations from 1990.

      a. Use the Distribution platform to summarize Region and LifeExp for this subset. In a few sentences, describe the distribution of LifeExp in 1990.

      b. Compare the five-number summaries for life expectancy in 1990 and in 2015. Comment on what you find.

      c. Compare the standard deviations for life expectancy in 1990 and 2015. Comment on what you find.

      d. You will recall that in 2015, the mean life expectancy was shorter than the median, consistent with the left-skewed shape. How do the mean and median compare in the 1990 data?

      3. Scenario: The data file called Sleeping Animals contains data about the size, sleep habits, lifespan, and other attributes of different mammalian species.

      a. Construct box plots for Lifespan and TotalSleep. For each plot, explain what the landmarks on each plot tell you about the distribution of each variable. Comment on noteworthy features of the plot.

      b. Which distribution is more symmetric? Explain specifically how the graphs and descriptive statistics helped you come to a conclusion.

      c. According to the data table, “Man” has a maximum life span of 100 years. Approximately what percent of mammals in the data set live less than 100 years?

      d. Sleep hours are divided into “dreaming” and “non-dreaming” sleep. How do the distributions of these types of sleep compare?

      e. Select the species that tend to get the most total sleep. Comment on how those species compare to the other species in terms of their predation, exposure, and overall danger indexes.

      f. Now use the Distribution platform to analyze the body weights of these mammals. What’s different about this distribution in comparison to the other continuous variables that you have analyzed thus far?

      g. Select those mammals that sleep in the most exposed locations. How do their body weights tend to compare to the other mammals? What might explain this comparison?

      4. Scenario: When financial analysts want a benchmark for the performance of individual equities (stocks), they often rely on a “broad market index” such as the S&P 500 in the U.S. There are many such indexes in stock markets around the world. One major index on the Tokyo Stock Exchange is the Nikkei 225, and this set of questions refers to data about the monthly values of the Nikkei 225 from December 31, 2013 through December 31, 2018. In other words, our data table called NIKKEI225 reflects monthly market activity for a five-year period.

      a. The variable called Volume is the total number of shares traded per month (in millions of shares). Describe the distribution of this variable.

      b. The variable called Change% is the monthly change, expressed as a percentage, in the closing value of the index. When Change% is positive, the index increased that month. When the variable is negative, the index decreased that month. Describe the distribution of this variable.

      c. Use the Quantiles to determine approximately how often the Nikkei declines. (Hint: What percentile is 0?)

      p. Use Graph Builder to make a Line Graph (6th icon in the icon bar) that shows adjusted closing prices over time. Then, use the Distribution platform to create a histogram of adjusted closing prices. Each graph summarizes the Adj Close variable, but each graph presents a different view of the data. Comment on the comparison of the two graphs.

      d. Now make a line graph of the monthly percentage changes over time. How would you describe the pattern in this graph?

      5. Scenario: Anyone traveling by air understands that there is always some chance of a flight delay. In the United States, the Department of Transportation monitors the arrival and departure time of every flight. The data table Airline Delays contains a sample of 51,603 flights for four airlines destined for three busy airports.

      b. The variable called DEST is the airport for the flight destination. Describe the distribution of this variable.

      c. The variable called Arr Delay is the actual arrival delay, measured in minutes. A positive value indicates that the flight was late, and a negative value indicates that the flight arrived early. Describe the distribution of this variable.

      d. Notice that the distribution of Arr Delay is skewed. Based on your experience as a traveler, why should we have anticipated that this variable would have a skewed distribution?

      e. Use the Quantiles to determine approximately how often flights in this sample were delayed. (Hint: Approximately what percentile is 0?)

      6. Scenario: For many years, it has been understood that tobacco use leads to health problems related to the heart and lungs. The Tobacco Use data table contains data about the prevalence of tobacco use and of certain diseases around the world.

      a. Use an appropriate technique from this chapter to summarize and describe the variation in tobacco usage (TobaccoUse) around the world.

      b. Use an appropriate technique from this chapter to summarize and describe the variation in cancer mortality (CancerMort) around the world.

      c. Use an appropriate technique from this chapter to summarize and describe the variation in cardiovascular mortality (CVMort) around the world.

      d. You have now examined three distributions. Comment on the similarities and differences in the shapes of these three distributions.

      e. Summarize the distribution of the region variable and comment on what you find.

      f. We have two columns containing the percentage of males and females around the world who use tobacco.

Скачать книгу