Industrial Data Analytics for Diagnosis and Prognosis. Yong Chen

Чтение книги онлайн.

Читать онлайн книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen страница 12

Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen

Скачать книгу

style="font-size:15px;">       xlab = "Horsepower", ylab = "Highway MPG")

      Figure 2.4 Scatter plot of highway MPG versus horsepower.

      Relationship Between A Numerical Variable and A Categorical Variable – Side-by-Side Box Plot

      Relationship Between Two Categorical Variables – Mosaic Plot

      Figure 2.6 Mosaic plot for fuel type and aspiration.

      mosaicplot(fuel.type ~ aspiration, data = auto.spec.df,

       xlab = "Fuel Type", ylab = "Aspiration",

       color = c("green", "blue"),

       main = "Mosaic Plot")

      In a mosaic plot, the height of a bar represents the percentage for each value of the variable in the vertical axis given a fixed value of the variable in the horizontal axis. For example, in Figure 2.6 the height of the bar corresponding to turbo aspiration is much higher when the fuel type is diesel than when it is gas, which means a higher percentage of diesel cars use turbo aspiration, while a lower percentage of gasoline cars use turbo aspiration. The width of a bar in a mosaic plot corresponds to the frequency, or the number of observations, for each value of the variable in the horizontal axis. For example, from Figure 2.6, the bars for gas fuel type is much wider than those for diesel fuel type, indicating that a much larger number of cars are gasoline cars in the data set.

      2.1.3 Plots for More than Two Variables

      It is very difficult to plot more than two variables in a two dimensional plot. This section introduces commonly used plots that show some aspects of how multiple variables are related to each other. In Chapter 4, we will study another technique called principal component analysis, which can also serve as a useful tool to visualize high dimensional data in a low dimensional space.

      Color Coded Scatter Plot

      oldpar <- par(xpd = TRUE) plot(auto.spec.df$peak.rpm ~ auto.spec.df$horsepower,

       xlab = "Horsepower", ylab = "Peak RPM",

       col = ifelse(auto.spec.df$fuel.type == "gas",

       "black", "gray")) legend("topleft", inset = c(0, -0.2),

       legend = c("gas", "diesel"),

       col = c(“black”, "gray"), pch = 1, cex = 0.8) par(oldpar)

      Although there is no clear relationship between the peak RPM and horsepower of a car from the scatter plot in Figure 2.7, it is obvious from the color coded plot that diesel cars tend to have low peak RPM and low horsepower.

      Scatter Plot Matrix and Heatmap

      The pairwise relationship of multiple numerical variables can be visualized simultaneously by using a matrix of scatter plots. The following R codes plot the scatter plot matrix for five of the numerical variables in the auto_spec data set: wheel.base, height, curb.weight, city.mpg, and highway.mpg. The column indices of the five variables are 8, 11, 12, 22, and 23, respectively.

      var.idx <- c(8, 11, 12, 22, 23) plot(auto.spec.df[, var.idx])

Скачать книгу