Industrial Data Analytics for Diagnosis and Prognosis. Yong Chen
Чтение книги онлайн.
Читать онлайн книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen страница 12
The obtained scatter plot is shown in Figure 2.4. It can be seen from the scatter plot that a general trend exists in the relationship between the highway MPG and the horsepower, where a car with higher horsepower is more likely to have a lower highway MPG.
Figure 2.4 Scatter plot of highway MPG versus horsepower.
Relationship Between A Numerical Variable and A Categorical Variable – Side-by-Side Box Plot
Side-by-side box plots can be used to show how the distribution of a numerical variable changes over different values of a categorical variable. The idea is to use a box plot to represent the distribution of the numerical variable at each value of the categorical variable. In Figure 2.5, we draw two side-by-side box plots for the auto_spec
data set using the following R
codes:
Figure 2.5 Side-by-side box plots.
oldpar <- par(mfrow = c(1, 2)) boxplot(auto.spec.df$compression.ratio ~ auto.spec.df$ fuel.type, xlab = "Fuel Type", ylab = "Compression Ratio") boxplot(auto.spec.df$highway.mpg ~ auto.spec.df$body. style, las = 2, xlab = "", ylab = "Highway MPG") mtext("Body Style", side = 3, line = 1) par(oldpar)
The left panel of Figure 2.5 shows how the numerical variable compression.ratio
is related to the two values (diesel
and gas
) of fuel.type
. It is clear from the side-by-side box plot that a car with diesel fuel has a much higher compression ratio than a car with gas fuel. This also explains the separate cluster of outliers in the histogram and box plot of compression.ratio
that is observed in Figure 2.3. The right panel of Figure 2.5 shows how highway.mpg
is related to the five values of body.style
. It can be seen that a hatchback car is more likely to have higher highway MPG while a convertible tends to have lower highway MPG.
Relationship Between Two Categorical Variables – Mosaic Plot
We can use a mosaic plot to see how values of two categorical variables are related to each other. Figure 2.6 shows a mosaic plot for fuel.type
and aspiration
of the auto_spec
data set, which is drawn by the following R
codes.
Figure 2.6 Mosaic plot for fuel type and aspiration.
mosaicplot(fuel.type ~ aspiration, data = auto.spec.df,
xlab = "Fuel Type", ylab = "Aspiration",
color = c("green", "blue"),
main = "Mosaic Plot")
In a mosaic plot, the height of a bar represents the percentage for each value of the variable in the vertical axis given a fixed value of the variable in the horizontal axis. For example, in Figure 2.6 the height of the bar corresponding to turbo aspiration is much higher when the fuel type is diesel than when it is gas, which means a higher percentage of diesel cars use turbo aspiration, while a lower percentage of gasoline cars use turbo aspiration. The width of a bar in a mosaic plot corresponds to the frequency, or the number of observations, for each value of the variable in the horizontal axis. For example, from Figure 2.6, the bars for gas fuel type is much wider than those for diesel fuel type, indicating that a much larger number of cars are gasoline cars in the data set.
2.1.3 Plots for More than Two Variables
It is very difficult to plot more than two variables in a two dimensional plot. This section introduces commonly used plots that show some aspects of how multiple variables are related to each other. In Chapter 4, we will study another technique called principal component analysis, which can also serve as a useful tool to visualize high dimensional data in a low dimensional space.
Color Coded Scatter Plot
We have seen that a scatter plot can effectively show the relationship between two numerical variables. By adding color coding to the points on a scatter plot of two numerical variables, we are able to study their relationship with a third variable. Typically, the third variable is a categorical variable, with each category represented by a different color. The color coded scatter plot is very useful in visualizing how some numerical variables can be used to predict a categorical variable. For the auto_spec
data, we can use a color coded scatter plot to show how fuel.type
is related to two of the numerical variables horsepower
and peak.rpm
. The color coded scatter plot is shown in Figure 2.7, which is created by the following R
codes.
Figure 2.7 Scatter plot color coded by fuel type.
oldpar <- par(xpd = TRUE) plot(auto.spec.df$peak.rpm ~ auto.spec.df$horsepower,
xlab = "Horsepower", ylab = "Peak RPM",
col = ifelse(auto.spec.df$fuel.type == "gas",
"black", "gray")) legend("topleft", inset = c(0, -0.2),
legend = c("gas", "diesel"),
col = c(“black”, "gray"), pch = 1, cex = 0.8) par(oldpar)
Although there is no clear relationship between the peak RPM and horsepower of a car from the scatter plot in Figure 2.7, it is obvious from the color coded plot that diesel cars tend to have low peak RPM and low horsepower.
Scatter Plot Matrix and Heatmap
The pairwise relationship of multiple numerical variables can be visualized simultaneously by using a matrix of scatter plots. The following R
codes plot the scatter plot matrix for five of the numerical variables in the auto_spec
data set: wheel.base
, height
, curb.weight
, city.mpg
, and highway.mpg
. The column indices of the five variables are 8, 11, 12, 22, and 23, respectively.
var.idx <- c(8, 11, 12, 22, 23) plot(auto.spec.df[, var.idx])
From the scatter plot matrix shown in Figure 2.8, there are different types of relationship among the variables. For example, there is a strong linear relationship between city.mpg
and highway.mpg
. Besides these two variables, wheel.base
, height
, and curb.weight
are positively related to each other. And the curb.weight
is negatively related to both city.mpg
and highway.mpg
.
Figure 2.8 Scatter plot matrix for five numerical variables.