Discovering Partial Least Squares with JMP. Marie Gaudard A.

Чтение книги онлайн.

Читать онлайн книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. страница 8

Автор:
Жанр:
Серия:
Издательство:
Discovering Partial Least Squares with JMP - Marie Gaudard A.

Скачать книгу

values plotted horizontally and equally spaced over the interval 0 to 1 (Figure 2.2). The points exhibit some curvature. The script uses MLR to predict Y from X, using various polynomial models. (Note that your points will differ from ours because of the randomness.)

      When the slider in the bottom left corner is set at the far left, the order of the polynomial model is one. In other words, we are fitting the data with a line. In this case, the design matrix X has two columns, the first containing all 1s and the second containing the horizontal coordinates of the plotted points. The linear fit ignores the seemingly obvious pattern in the data— it is underfitting the data. This is evidenced by the residuals, whose magnitudes are illustrated using vertical blue lines. The RMSE (root mean square error) is calculated by squaring each residual, averaging these (specifically, dividing their sum by the number of observations minus one, minus the number of predictors), and then taking the square root.

      As we shift the slider to the right, we are adding higher-order polynomial terms to the model. This is equivalent to adding additional columns to the design matrix. The additional polynomial terms provide a more flexible model that is better able to capture the important characteristics, or the structure, of the data.

      Figure 2.2 Illustration of Underfitting and Overfitting, with Order = 1, 2, 3, and 10

Figure 2.2 Illustration of Underfitting and Overfitting, with Order = 1, 2, 3, and 10

      However, we get to a point where we go beyond modeling the structure of the data, and begin to model the noise in the data. Note that, as we increase the order of the polynomial, thereby adding more terms to the model, the RMSE progressively reduces. An order 10 polynomial, obtained by setting the slider all the way to the right, provides a perfect fit to the data and gives RMSE = 0 (bottom right plot in Figure 2.2). However, this model is not generalizable to new data, because it has modeled both the structure and the noise, and by definition the noise is random and unpredictable. Our model has overfit the data.

      In fitting models, we must strike a balance between modeling the intrinsic structure of the data and modeling the noise in the data. One strategy for reaching this goal is the use of cross-validation, which we shall discuss in the section “Choosing the Number of Factors” in Chapter 4. You can close the report produced by PolyRegr.jsl at this point.

      In MLR, correlation among the predictors is called multicollinearity. We explore the effect of multicollinearity on estimates of the regression coefficients by running the script Multicollinearity.jsl. Do this by clicking on the correct link in the master journal. The script produces the launch window shown in Figure 2.3.

      Figure 2.3: Multicollinearity Simulation Launch Window

Figure 2.3: Multicollinearity Simulation Launch Window

      The launch window enables you to set conditions to simulate data from a known model:

      • You can set the values of the three regression coefficients: Beta0 (constant); Beta1 (X1 coefficient); and Beta2 (X2 coefficient). Because there are three regression parameters, you are defining a plane that models the mean of the response, Y. In symbols,

      E[Y] = β0 + β1X1 + β2X2

      where the notation E[Y] represents the expected value of Y.

      • The noise that is applied to Y is generated from a normal distribution with mean 0 and with the standard deviation that you set as Sigma of Random Noise under Other Parameters. In symbols, this means that ε in the expression

      Y = β0 + β1X1 + β2X2 + ε

      has a normal distribution with mean 0 and standard deviation equal to the value you set.

      • You can specify the correlation between the values of X1 and X2 using the slider for Correlation of X1 and X2 under Other Parameters. X1 and X2 values will be generated for each simulation from a multivariate normal distribution with the specified correlation.

      • In the Size of Simulation panel, you can specify the Number of Points to be generated for each simulation, as well as the Number of Simulations to run.

      Once you have set values for the simulation using the slider bars, generate results by clicking Simulate. Depending on your screen size, you can view multiple results simultaneously without closing the launch window.

      Let’s first run a simulation with the initial settings. Then run a second simulation after moving the Correlation of X1 and X2 slider to a large, positive value. (We have selected 0.92.) Your reports will be similar to those shown in Figure 2.4.

      Figure 2.4: Comparison of Design Settings, Low and High Predictor Correlation

Figure 2.4: Comparison of Design Settings, Low and High Predictor Correlation

      The two graphs reflect the differences in the settings of the X variables for the two correlation scenarios. In the first, the points are evenly distributed in a circular pattern. In the second, the points are condensed into a narrow elliptical pattern. These patterns show the geometry of the design matrix for each scenario.

      In the high correlation scenario, note that high values of X1 tend to be associated with high values of X2, and that low values of X1 tend to be associated with low values of X2. This is exactly what is expected for positive correlation. (For a definition of the correlation coefficient between observations, select Help > Books > Multivariate Methods and search for “Pearson Product-Moment Correlation”).

      The true and estimated coefficient values are shown at the bottom of each plot. Because our model was not deterministic—the Y values were generated so that their means are linear functions of X1 and X2, but the actual values are affected by noise—the estimated coefficients are just that, estimates, and as such, they reflect uncertainty. This uncertainty is quantified in the columns Std Error (standard error), Lower95% (the lower 95% limit for the coefficient’s confidence interval), and Upper95% (the upper 95% limit for the coefficient’s confidence interval).

      Notice that the estimates of beta1 and beta2 can be quite different from the true values in the high correlation scenario (bottom plot in Figure 2.4). Consistent with this, the standard errors are larger and the confidence intervals are wider.

      Let’s get more insight on the impact that changing the correlation value has on the estimates of the coefficients. Increase the Number of Simulations to about 500 using the slider, and again simulate with two different values of the correlation, one near zero and one near one. You should obtain results similar to those in Figure 2.5.

      Figure 2.5: Plots of Estimates for Coefficients, Low and High Predictor Correlation

Figure 2.5: Plots of Estimates for Coefficients, Low and High Predictor Correlation

      These plots show Estimate beta1 and Estimate beta2, the estimated values of β1 and β2, from the 500 or so regression fits. The reference

Скачать книгу