Discovering Partial Least Squares with JMP. Marie Gaudard A.

Чтение книги онлайн.

Читать онлайн книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. страница 9

Автор:
Жанр:
Серия:
Издательство:
Discovering Partial Least Squares with JMP - Marie Gaudard A.

Скачать книгу

of the corresponding parameters, so that the intersection of these two lines shows the pair of true values used to simulate the data. In an ideal world, all of the estimate pairs would be very close to the point defined by the true values.

      When X1 and X2 have correlation close to zero, the parameter estimates cluster rather uniformly around the true value. However, the impact of high correlation between X1 and X2 is quite dramatic. As this correlation increases, the estimates of β1 and β1 become much more variable, but also more strongly (and negatively) correlated themselves. As the correlation between the two predictors X1 and X2 approaches +1.0 or –1.0, we say that the X’X matrix involved in the MLR solution, becomes ill-conditioned (Belsley 1991).

      In fact, when there is perfect correlation between X1 and X2, the MLR coefficient estimates cannot be computed because the matrix (X’X)1 does not exist. The situation is similar to trying to build a regression model for MPG with two redundant variables, say, “Weight of Car in Kilograms” and “Weight of Car in Pounds.” Because the two predictors are redundant, there really is only a single predictor, and the MLR algorithm doesn’t know where to place its coefficients. There are infinitely many ways that the coefficients can be allocated to both terms to produce the same model.

      In cases of multicollinearity, the coefficient estimates are highly variable, as you see in Figure 2.5. This means that estimates have high standard errors, so that confidence intervals for the parameters are wide. Also, hypothesis tests can be ineffective because of the uncertainty inherent in the parameter estimates. Much research has been devoted to detecting multicollinearity and dealing with its consequences. Ridge regression and the lasso method (Hastie et al. 2001) are examples of regularization techniques that can be useful when high multicollinearity is present. (In JMP Pro 11, select Help > Books > Fitting Linear Models and search for “Generalized Regression Models”.)

      Whether multicollinearity is of concern depends on your modeling objective. Are you interested in explaining or predicting? Multicollinearity is more troublesome for explanatory models, where the goal is to figure out which predictors have an important effect on the response. This is because the parameter estimates have high variability, which negatively impacts any inference about the predictors. For prediction, the model is useful, subject to the general caveat that an empirical statistical model is good only for interpolation, rather than extrapolation. For example, in the correlated case shown in Figure 2.4, one would not be confident in making predictions when X1 = +1 and X2 = -1 because the model is not supported by any data in that region.

      You can close the reports produced by Multicollinearity.jsl at this point.

      3

      Principal Components Analysis: A Brief Visit

       Principal Components Analysis

       Centering and Scaling: An Example

       The Importance of Exploratory Data Analysis in Multivariate Studies

       Dimensionality Reduction via PCA

      Like PLS, principal components analysis (PCA) attempts to use a relatively small number of components to model the information in a set of data that consists of many variables. Its goal is to describe the internal structure of the data by modeling its variance. It differs from PLS in that it does not interpret variables as inputs or outputs, but rather deals only with a single matrix. The single matrix is usually denoted by X. Although the components that are extracted can be used in predictive models, in PCA there is no direct connection to a Y matrix.

      Let’s look very briefly at an example. Open the data table Solubility.jmp by clicking on the correct link in the master journal. This JMP sample data table contains data on 72 chemical compounds that were measured for solubility in six different solvents, and is shown in part in Figure 3.1. The first column gives the name of the compound. The next six columns give the solubility measurements. We would like to develop a better understanding of the essential features of this data set, which consists of a 72 x 6 matrix.

      Figure 3.1: Partial View of Solubility.jmp

Figure 3.1: Partial View of Solubility.jmp

      PCA works by extracting linear combinations of the variables. First, it finds a linear combination of the variables that maximizes the variance. This is done subject to a constraint on the sizes of the coefficients, so that a solution exists. Subject to this constraint, the first linear combination explains as much of the variability in the data as possible. The observations are then weighted by this linear combination, to produce scores. The vector of scores is called the first principal component. The vector of coefficients for the linear combination is sometimes called the first loading vector.

      Next, PCA finds a linear combination which, among all linear combinations that are orthogonal to the first, has the highest variance. (Again, a constraint is placed on the sizes of the coefficients.) This second vector of factor loadings is used to compute scores for the observations, resulting in the second principal component. This second principal component explains as much variance as possible in a direction orthogonal to that of the first loading vector. Subsequent linear combinations are extracted similarly, to explain the maximum variance in the space that is orthogonal to the loading vectors that have been previously extracted.

      To perform PCA for this data set in JMP:

      1. Select Analyze > Multivariate Methods > Principal Components.

      2. Select the columns 1-Octanol through Hexane and add them as Y, Columns.

      3. Click OK.

      4. In the red triangle menu for the resulting report, select Eigenvalues.

      Your report should appear as in Figure 3.2. (Alternatively, you can simply run the last script in the data table panel, Principal Components.)

      Figure 3.2: PCA Analysis for Solubility.jmp

Figure 3.2: PCA Analysis for Solubility.jmp

      Each row of data is transformed to a score on each principal component. Plots of these scores for the first two principal components are shown. We won’t get into the technical details, but each component has an associated eigenvalue and eigenvector. The Eigenvalues report indicates that the first component accounts for 79.75% of the variation in the data, and that the second component brings the cumulative total variation accounted for to 95.50%.

      The plot on the far right in Figure 3.2, called a loading plot, gives insight into the data structure. All six of the variables have positive loadings on the first component. This means that the largest component of the variability is explained by a linear combination of all six variables with positive coefficients for each variable. But the second component has positive loadings only for 1-Octanol and Ether, while all other variables have negative loadings. This indicates that the next largest

Скачать книгу