Discovering Partial Least Squares with JMP. Marie Gaudard A.

Чтение книги онлайн.

Читать онлайн книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. страница 6

Автор:
Жанр:
Серия:
Издательство:
Discovering Partial Least Squares with JMP - Marie Gaudard A.

Скачать книгу

1.5 shows 3 Factors, but your report might show a different number. This is because the Validation Method of KFold, set as a default in the JMP Pro Model Launch control panel, involves an element of randomness.

      Figure 1.5: Initial PLS Reports

Figure 1.5: Initial PLS Reports

      Once you have built a model in JMP, you can save the prediction formula to the table containing the data that were analyzed. We do this for our PLS model. From the options in the red-triangle menu for the NIPALS Fit with 3 Factors, select Save Columns > Save Prediction Formula (Figure 1.6).

      Figure 1.6: Saving the Prediction Formula

Figure 1.6: Saving the Prediction Formula

      The saved formula column, Pred Formula Tribe, appears as the last column in the data table. Because we are actually saving a formula, we obtain predicted values for all 19 rows.

      To see how well our PLS model has performed, let’s simulate the arrival of new data using our test set. We would like to remove the Hide and Exclude row states from rows 10-19, and apply them to rows 1-9. You can do this by hand, or by running the script Toggle Hidden/Excluded Rows. To do this by hand, select Rows > Clear Row States, select rows 1-9, right-click in the highlighted area near the row numbers, and select Hide and Exclude. (In versions of JMP prior to JMP 11, select Exclude/Unexclude, and then right-click again and select Hide/Unhide.)

      Now run the script Predicted vs Actual Tribe. For each row, this plots the predicted score for tribal origin on the vertical axis against the actual tribe of origin on the horizontal axis (Figure 1.7).

      Figure 1.7: Predicted versus Actual Tribe for Test Data

Figure 1.7: Predicted versus Actual Tribe for Test Data

      To produce this plot yourself, select Graph > Graph Builder. In the Variables panel, right-click on the modeling type icon to the left of Tribe and select Nominal. (This causes the value labels for Tribe to display.) Drag Tribe to the X area and Pred Formula Tribe to the Y area.

      Note that the predicted values are not exactly +1 or -1, so it makes sense to use a decision boundary (the dotted blue line at the value 0) to separate or classify the scores produced by our model into two groups. You can insert a decision boundary by double-clicking on the vertical axis. This opens the Y Axis Specification window. In the Reference Lines section near the bottom of the window, click Add to add a reference line at 0, and then enter the text Decision Boundary in the Label text box.

      The important finding conveyed by the graph is that our PLS model has performed admirably. The model has correctly classified all ten observations in the test set. All of the observations for “Tribe A” have predicted values below 0 and all those for “Tribe B” have predicted values above 0.

      Our model for the spearhead data was built using only nine spearheads, one less than the number of chemical measurements made. PLS provides an excellent classification model in this case.

      Before exploring PLS in more detail, let’s engage in a quick review of multiple linear regression. This is a common approach to modeling a single variable in Y using a collection of variables, X.

      2

      A Review of Multiple Linear Regression

       The Cars Example

       Estimating the Coefficients

       Underfitting and Overfitting: A Simulation

       The Effect of Correlation among Predictors: A Simulation

      Consider Figure 2.1, which displays the data table CarsSmall.jmp. You can open this table by clicking on the correct link in the master journal. This data table consists of six rows, corresponding to specific cars of different types, and six variables from the JMP sample data table Cars.jmp.

      Figure 2.1: Data Table CarsSmall.jmp

Figure 2.1: Data Table CarsSmall.jmp

      The first column, Automobile, is an identifier column. Our goal is to predict miles per gallon (MPG) from the other descriptive variables. So, in this context, the variable MPG is the single variable in Y, and X consists of the four variables Number of Cylinders, HP (horsepower), Weight, and Transmission (with values “Man” and “Auto”, for manual and automatic transmissions, respectively).

      This data structure is typical of the type of data to which multiple linear regression (MLR), or more generally, any modeling approach, is applied. This familiar tabular structure leads naturally to the representation and manipulation of data values as matrices.

      To be more specific, a multiple linear regression model for our data can be represented as shown here:

      (2.1)(21.022.818.718.114.324.4)=[ 161102.62Man(0)14932.32Man(0)181753.44Auto(1)161053.46Auto(1)182453.57Auto(1)14623.19Auto(1) ]*(β0β1β2β3β4)+(ε1ε2ε3ε4ε5ε6)

      Here are various items to note:

      1. The values of the response, MPG, are presented on the left side of the equality sign, in the form of a column vector, which is a special type of matrix that contains only a single column. In our example, this is the only column in the response matrix Y.

      2. The rectangular array to the immediate right of the equality sign, delineated by square brackets, consists of five columns. There is a column of ones followed by four columns consisting of the values of our four predictors, Number of Cylinders, HP (horsepower), Weight, and Transmission. These five columns are the columns in the matrix X.

      3. In parentheses, next to the entries in the last column of X, the Transmission value labels, “Man” and “Auto” have been assigned the numerical values 0 and 1, respectively. Because matrices can contain only numeric data, the values of the variable Transmission have to be coded in a numerical form. When a nominal variable is included in a regression model, JMP automatically codes that column, and you can interpret reports without ever knowing what has happened behind the scenes. But if you are curious, select Help > Books > Fitting Linear Models, and search for “Nominal Effects” and “Nominal Factors”.

      4. The column vector consisting of βs, denoted β, contains the unknown coefficients that relate the entries in X to the entries in Y. These are usually called regression parameters.

      5. The column vector consisting of epsilons (εi), denotedε, contains the unknown errors. This vector represents the variation that is unexplained when we model Y using X.

      The symbol “*” in Equation (2.1)

Скачать книгу