Applied Data Mining for Forecasting Using SAS. Tim Rey

Чтение книги онлайн.

Читать онлайн книгу Applied Data Mining for Forecasting Using SAS - Tim Rey страница 14

Автор:
Жанр:
Серия:
Издательство:
Applied Data Mining for Forecasting Using SAS - Tim Rey

Скачать книгу

process abbreviation SEMMA (Sample-Explore-Modify-Model-Assess) includes the following key steps:

       Sample

      the data by creating informational rich data sets. This step includes data preparation blocks for importing, merging, appending, partitioning, and filtering, as well as statistical sampling and converting transactional data to time series data.

       Explore

      the data by searching for clusters, relationships, trends, and outliers. This step includes functional blocks for association discovery, cluster analysis, variable selection, statistical reporting and graphical exploration.

       Modify

      the data by creating, imputing, selecting, and transforming the variables. This step includes functional blocks for removing variables, imputation, principal component analysis, and defining transformations.

       Model

      the data by using various statistical or machine learning techniques. This step includes the use of functional blocks for linear and logistic regression, decision trees, neural networks, partial least squares, among others, and importing models defined by other developers even outside SAS Enterprise Miner.

       Assess

      the generated solutions by evaluating their performance and reliability. This step includes functional blocks for comparing models, cutoff analysis, decision support, and score code management.

      The data preparation functionality is implemented in the Sample and Modify sets of functional blocks.

      Recently, a special set of SAS Enterprise Miner functional blocks related for Time Series Data Mining (TSDM) has been released by SAS. Its functionality covers most of the needed procedures for exploring forecasting data. The data preparation step is delivered by a Time Series Data Preparation node (TSDP), which provides data aggregation, summarization, differencing, merging, and the replacement of missing values.

      Variable reduction and selection steps using specialized SAS subroutines

      The key procedures for variable reduction and selection based on SAS/ETS and SAS/STAT are discussed briefly below.

      AUTOREG (SAS/ETS) estimates and predicts linear regression models with autoregressive errors as well as stepwise regression. It also combines autoregressive models with autoregressive conditionally heteroscedastic (ARCH) and generalized autoregressive conditionally heteroscedastic (GARCH) models and generates a variety of model diagnostic tests, tables, and plots.

      MODEL (SAS/ETS) analyzes and simulates nonlinear systems regression. It supports dynamic nonlinear models of multiple equations and includes a full range of nonlinear parameter estimation methods, such as nonlinear ordinary least squares, generalized method of moments, nonlinear full information maximum likelihood, and so on.

      PLS (SAS/STAT) fits models by extracting successive linear combinations of the predictors, called factors (also called components or latent variables), which optimally address one or both of these two goals: explaining response or output variation and explaining predictor variation. In particular, the method of partial least squares balances the two objectives, seeking factors that explain both response and predictor variation. The contribution of the original variables to the factors is important to variable selection.

      PRINCOMP (SAS/STAT) provides PCA on the input data. The results contain eigenvalues, eigenvectors, and standardized or unstandardized principal component scores.

      REG (SAS/STAT) is used for linear regression with options for forward and backward stepwise regression. It provides all necessary diagnostic statistics.

      SIMILARITY (SAS/ETS) computes similarity measures associated with time-stamped data, time series, and other sequentially ordered numeric data. A similarity measure is a metric that measures the distance between the input and target sequences while taking into account the ordering of the data.

      VARCLUS (SAS/STAT) divides a set of variables into clusters. Associated with each cluster is a linear combination of the variables in the cluster. This linear combination can be generated by two options: as a first principal component or as a centroid component. The VARCLUS procedure creates an output data set with component scores for each cluster. A second output data set can be used to draw a decision tree diagram of hierarchical clusters. The VARCLUS procedure is very useful as a variable-reduction method since a large set of variables can be replaced by the set of cluster components with little loss of information.

      Variable reduction and selection steps using SAS Enterprise Miner

      The data mining capabilities in SAS Enterprise Miner for variable reduction and selection are spread in Explore, Modify, and Model tabs. It is not a surprise that the functional blocks are based on those SAS procedures, discussed in the previous section. The functional blocks or nodes of interest are the following:

      In Explore tab:

      Variable Clustering node implements the VARCLUS procedure in SAS Enterprise Miner—that is, it assigns input variables to clusters and allows variable reduction with a small set of cluster-representative variables.

      Variable Selection node evaluates the importance of potential input variables in predicting the output variable based on R-squared and Chi-squared selection criterion. The variables that are not related to the output variable are assigned with rejected status and are not used in the model building.

      In Modify tab:

      Principal Components node implements the PRINCOMP procedure and in the case of linear relationship, reduces the dimensionality of the original input data to the most important principal components that capture a significant part of data variability.

      In Model tab:

      Decision Tree node splits the data in the form of a decision tree. Decision tree modeling is based on performing a series of if-then decision rules that sequentially divide the target variable into a small number of homogeneous groups that formulate a tree-like structure. One of the advantages of this block, in the case of variable selection, is that it automatically ranks the input variables, based on the strength of their relationship to the tree.

      Partial Least Squares node implements the PLS procedure.

      Gradient Boosting node uses a specific partitioning algorithm, developed by Jerome Friedman, called a gradient boosting machine.8

      Regression node generates either linear regression models or logistic regression models. It supports stepwise, forward, and backward variable selection methods.

      Two SAS Enterprise Miner nodes—TS Similarity (TSSIM) and TS Dimension Reduction (TSDR), which are part of the new Time Series Data Mining tab—can be used for variable reduction as well. The TS Similarity node implements the SIMILARITY based on four distance metrics: squared deviation, absolute deviation, mean square deviation, and mean absolute deviation and delivers a similarity map. The TS Dimension Reduction node applies four reduction techniques on the original data: singular value decomposition (SVD), discrete Fourier transformation (DFT), discrete wavelet transformation (DWT), and line segment approximations.

Скачать книгу