Industrial Data Analytics for Diagnosis and Prognosis. Yong Chen

Чтение книги онлайн.

Читать онлайн книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen страница 13

Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen

Скачать книгу

a large number of numerical variables, it is difficult to visualize all pairwise scatter plots as in the scatter plot matrix. In this case, we can use a heatmap for pairwise correlations of the variables to quickly show the strength of the relationship. The heatmap uses different shades of colors to represent the values of the correlations so that the spots or regions of strong positive or negative relationship can be quickly detected. Detailed discussion of correlation is provided in Section 2.2. We draw the heatmap of correlations for all numerical variables in the auto_spec data set using the following R codes.

      library(gplots) var.idx <-c(8:12, 15, 17:23) data.nomiss <- na.omit(auto.spec.df[, var.idx]) heatmap.2(cor(data.nomiss), Rowv = FALSE, Colv = FALSE, dendrogram = “none”, cellnote = round(cor(data.nomiss),2), notecol = “black”, key = FALSE, trace = ’none’, margins=c(10,10))

      Figure 2.9 Heatmap of correlation for all numerical variables.

      Data visualization is an effective and intuitive representation of the qualitative features of the data. Key characteristics of data can also be quantitatively summarized by numerical statistics. This section introduces common summary statistics for univariate and multivariate data.

      2.2.1 Sample Mean, Variance, and Covariance

      Sample Mean – Measure of Location

      A sample mean or sample average provides a measure of location, or central tendency, of a variable in a data set. Consider a univariate data set, which is a data set with a single variable, that consists of a random sample of n observations x1, x2,…, xn. The sample mean is simply the ordinary arithmetic average

x with bar on top equals 1 over n sum from i equals 1 to n of x subscript i.

      For a data set yi, i = 1, 2,…, n obtained by multiplying each xi by a constant a, i.e., yi = axi, i = 1, 2,…, n, it is easy to see that

x with bar on top equals a top enclose x.

      Sample Variance – Measure of Spread

      The sample variance measures the spread of the data and is defined as

       s squared equals fraction numerator begin display style sum subscript i equals 1 end subscript superscript n left parenthesis x subscript i minus x with bar on top right parenthesis squared end style over denominator n minus 1 end fraction equals fraction numerator begin display style sum subscript i equals 1 end subscript superscript n x subscript i superscript 2 minus n x with bar on top squared end style over denominator n minus 1 end fraction. (2.1)

      The square root of the sample variance, s = √s2, is called the sample standard deviation. The sample standard deviation is of the same measurement unit as the observations. For yi = axi, i = 1,2,…, n, its sample variance is

s subscript y superscript 2 equals a squared s squared.

      If each of the n observations of a data set is measured on two variables x1 and x2, let (x11, x21,...,xn1) and (x12, x22,...,xn2) denote the n observations on x1 and x2, respectively. The sample covariance of x1 and x2 is defined as

      where 1 and 2 are the sample means of x1 and x2, respectively. The value of sample covariance of two variables is affected by the linear association between them. From (2.2), if x1 and x2 have a strong positive linear association, they are usually both above their means or both below their means. Consequently, the product (xi1x¯1)(xi2x¯2) will typically be positive and their sample covariance will have a large positive value. On the other hand, if x1 and x2 have a strong negative linear association, the product (xi1x¯1)(xi2x¯2) will typically be negative and their sample covariance will have a negative value. If y1 and y2 are obtained by multiplying each measurement of x1 and x2 with a1 and a2, respectively, it is easy to see from (2.2) that the sample covariance of y1 and y2 is

      Equation (2.3) says that if the measurements are scaled, for example by changing measurement units, the sample covariance will be scaled correspondingly. The sample covariance’s dependence on the measurement units makes it difficult to determine how large a sample covariance indicates a strong (linear) association between two variables. The sample correlation defined as follows is a measure of linear association that does not depend on the measurement units, or scaling of the variables