Discovering Partial Least Squares with JMP. Marie Gaudard A.

Чтение книги онлайн.

Читать онлайн книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. страница 5

Автор:
Жанр:
Серия:
Издательство:
Discovering Partial Least Squares with JMP - Marie Gaudard A.

Скачать книгу

      • Square data (when n ~ v, and n is large or very large)

      • Collinear variables, namely, variables that convey the same, or nearly the same, information

      • Noisy data

      Just to whet your appetite, we point out that PLS routinely finds application in the following disciplines as a way of taming multivariate data:

      • Psychology

      • Education

      • Economics

      • Political science

      • Environmental science

      • Marketing

      • Engineering

      • Chemistry (organic, analytical, medical, and computational)

      • Bioinformatics

      • Ecology

      • Biology

      • Manufacturing

      Data should always be screened for outliers and anomalies prior to any formal analysis, and PLS is no exception. In fact, PLS works best when the variables involved have somewhat symmetric distributions. For that reason, for example, highly skewed variables are often logarithmically transformed prior to any analysis.

      Also, the data are usually centered and scaled prior to conducting the PLS analysis. By centering, we mean that, for each variable, the mean of all its observations is subtracted from each observation. By scaling, we mean that each observation is divided by the variable’s standard deviation. Centering and scaling each variable results in a working data table where each variable has mean 0 and standard deviation 1.

      The reason that centering and scaling are important is because the weights that form the basis for the PLS model are very sensitive to the measurement units of the variables. Without centering and scaling, variables with higher variance have more influence on the model. The process of centering and scaling puts all variables on an equal footing. If certain variables in X are indeed more important than others, and you want them to have higher influence, you can accomplish this by assigning them a higher scaling weight (Eriksson et al. 2006). As you will see, JMP makes centering and scaling easy.

      Later we discuss how PLS relates to other modeling and multivariate methods. But for now, let’s dive into an example so that we can compare and contrast it to the more familiar multivariate linear regression (MLR).

      The data table Spearheads.jmp contains data relating to the chemical composition of spearheads known to originate from one of two African tribes (Figure 1.1). You can open this table by clicking on the correct link in the master journal. A total of 19 spearheads of known origin were studied. The Tribe of origin is recorded in the first column (“Tribe A” or “Tribe B”). Chemical measurements of 10 properties were made. These are given in the subsequent columns and are represented in the Columns panel in a column group called Xs. There is a final column called Set, indicating whether an observation will be used in building our model (“Training”) or in assessing that model (“Test”).

      Figure 1.1: The Spearheads.jmp Data Table

Figure 1.1: The Spearheads.jmp Data Table

      Our goal is to build a model that uses the chemical measurements to help us decide whether other spearheads collected in the vicinity were made by “Tribe A” or “Tribe B”. Note that there are 10 columns in X (the chemical compositions) and only one column in Y (the attribution of the tribe).

      The model will be built using the training set, rows 1–9. The test set, rows 10–19, enables us to assess the ability of the model to predict the tribe of origin for newly discovered spearheads. The column Tribe actually contains the numerical values +1 and –1, with –1 representing “Tribe A” and +1 representing “Tribe B". The Tribe column displays Value Labels for these numerical values. It is the numerical values that the model actually predicts from the chemical measurements.

      The table Spearheads.jmp also contains four scripts that help us perform the PLS analysis quickly. In the later chapters containing examples, we walk through the menu options that enable you to conduct such an analysis. But, for now, the scripts expedite the analysis, permitting us to focus on the concepts underlying a PLS analysis.

      The first script, Fit Model Launch Window, located in the upper left of the data table as shown in Figure 1.2, enables us to set up the analysis we want. From the red-triangle menu, shown in Figure 1.2, select Run Script. This script only runs if you are using JMP Pro since it uses the Fit Model partial least squares personality. If you are using JMP, you can select Analyze > Multivariate Methods > Partial Least Squares from the JMP menu bar. You will be able to follow the text, but with minor modifications.

      Figure 1.2: Running the Script “Fit Model Launch Window”

Figure 1.2: Running the Script “Fit Model Launch Window”

      This script produces a populated Fit Model launch window (Figure 1.3). The column Tribe is entered as a response, Y, while the 10 columns representing metal composition measurements are entered as Model Effects. Note that the Personality is set to Partial Least Squares. In JMP Pro, you can access this launch window directly by selecting Analyze > Fit Model from the JMP menu bar.

      Below the Personality drop-down menu, shown in Figure 1.3, there are check boxes for Centering and Scaling. As mentioned in the previous section, centering and scaling all variables in a PLS analysis treats them equitably in the analysis. There is also a check box for Standardize X. This option, described in “The Standardize X Option” in Appendix 1, centers and scales columns that are involved in higher-order terms. JMP selects these three options by default.

      Figure 1.3: Populated Fit Model Launch Window

Figure 1.3: Populated Fit Model Launch Window

      Clicking Run brings us to the Partial Least Squares Model Launch control panel (Figure 1.4). Here, we can make choices about how we would like to fit the model. Note that we are allowed to choose between two fitting algorithms to be discussed later: NIPALS and SIMPLS. We accept the default settings. (To reproduce the exact analysis shown below, select Set Random Seed from the red triangle menu at the top of the report and enter 111.) Click Go. (You can, instead, run the script PLS Fit to see the report.)

      Figure 1.4: PLS Model Launch Control Panel

Figure 1.4: PLS Model Launch Control Panel

      This appends three new report sections, as shown in Figure 1.5: Model Comparison Summary, KFold Cross Validation with K=7 and Method=NIPALS, and NIPALS Fit with 3 Factors. Later, we fully explain the various options and report contents, but for now we take the analysis on trust in order to quickly see this example in its entirety. As we discuss later, the Number of Factors

Скачать книгу