Applied Data Mining for Forecasting Using SAS. Tim Rey

Чтение книги онлайн.

Читать онлайн книгу Applied Data Mining for Forecasting Using SAS - Tim Rey страница 11

Автор:
Жанр:
Серия:
Издательство:
Applied Data Mining for Forecasting Using SAS - Tim Rey

Скачать книгу

7 and a short description of the corresponding substeps and deliverables is given below.

      Variable reduction via data mining methods

      Since there is already a rich literature for the statistical and machine learning disciplines concerning approaches for variable reduction or selection, this book often refers to and contrasts methods used in “non-time series” or transactional data. New methods specifically for time series data are also discussed in more detail in Chapter 7. In the transaction data approach, the association among the independent variables is explored directly. Typical techniques, used in this case, are variable cluster analysis and principal component analysis (PCA). In both methods, the analysis can either be based on correlation or covariance matrices. Once the clusters are found, the variable with the highest correlation to the cluster centroid in each cluster is chosen as a representative of the whole cluster. Another approach, used frequently, is variable reduction via PCA where a transformed set of new variables (based on the correlation structure of the original variables) is used that describes some minimum amount of variation in the data. This reduces the dimensionality of the problem in the independent variables.

      In the time series-based variable reduction, the time factor is taken into account. One of the most used methods is similarity analysis where the data is first phase shifted and time warped. Then a distance metric is calculated to obtain the similarity measures between each two time series xi and xj. The variables below some critical distance are assumed as similar and one of them can be selected as representative. In the case of correlated inputs the dimensionality of the original data set could be significantly reduced after removing the similar variables. PCA can also be used in time series data, an example of such is the work done by the Chicago Fed wherein a National Activity Index (CFNAI), based on 85 variables representing different sectors of the US economy, was developed.4

      Variable selection via data mining methods

      Again, there is quite a rich literature in variable or feature selection for transactional data mining problems. In variable selection the significant inputs are chosen based on their association with the dependent variable. As in the case with variable reduction, there are different methods applied to data with a time series nature as compared to that of transactional data. The first approach uses traditional transactional data mining variable selection methods. Some of the known methods, discussed in Chapter 7, are correlation analysis, stepwise regression, decision trees, partial least squares (PLS), and genetic programming (GP). In order to use these same approaches on time series data, the time series data has to be preprocessed properly. First, both the Ys and Xs are made stationary by taking the first difference. Second, some dynamic in the system is added by introducing lags for each X. As a result, the number of extended X variables to consider as inputs is increased significantly. However, this enables you to capture dynamic dependences between the independent and the dependent variables. This approach is often referred to as the poor man's approach to time series variable selection since much of the extra work is being done to prepare the data and then non-time series approaches are being applied.

      The second approach is more specifically geared toward time series. There are four methods in this category. The first one is the correlation coefficient method. The second one is a special version of stepwise regression for time series models. The third method is similarity as discussed earlier in the variable reduction substep but in this case the distance metric is between the Y and the Xs. Thus, the smaller the similarity metric the better the relationship of the corresponding input to the output variable. The fourth approach is called co-integration, which is a specialized test that two time series variables move together in the long run. Much more detail is presented in Chapter 7 concerning these analyses.

      One important addition to the variable selection is to be sure to include the SME's favorite drivers, or those discussed as such in market studies (such as CMAI in the chemical industry) or by the market analysts.

      Event selection

      Specific class variables in forecasting are events. These class variables help describing big discrete shifts and deviations in the time series. Examples of such variables are advertising campaigns before Christmas and Mother's Day, mergers and acquisitions, natural disasters, and so on. It is very important to clarify and define the events and their type in this phase of project development.

      Variable reduction and selection deliverables

      The key deliverable from the variable reduction and selection step is a reduced set of Xs that are less correlated to one another. It is assumed that it includes only the most relevant drivers or independent variables, selected by consensus based on their statistical significance and expert judgment. However, additional variable reduction is possible during the forecasting phase. Selected events are another important deliverable before beginning the forecasting activities.

      As always document the variable reduction/selection actions. The document includes a detailed description of all steps for variable reduction and selection as well as the arguments for the final selection based on statistical significance and subject matter experts approval.

      Forecasting model development steps

      This block of the work process includes all necessary activities for delivering forecasting models with the best performance based on the available preprocessed data given the reduced number of potential independent variables. Among the numerous options to design forecasting models, the focus in this book is on the most used practical approaches for univariate and multivariate models. The related techniques and development methodologies are described in Chapters 811 with minimal theory and sufficient details for practitioners. The basic substeps and deliverables are described below.

      Basic forecasting steps: identification, estimation, forecasting

      Even the most complex forecasting models are based on three fundamental steps: (1) identification, (2) estimation, and (3) forecasting. The first step is identifying a specific model structure based on the nature of the time series and modeler's hypothesis. Examples of the most used forecasting model structures are exponential smoothing, autoregressive models, moving average models, their combination–autoregressive moving average (ARMA) models, and unobserved component models (UCM). The second step is estimating the parameters of the selected model structure. The third step is applying the developed model with estimated parameters for forecasting.

      Univariate forecasting model development

      This substep represents the classical forecasting modeling process of a single variable. The future forecast is based on discovering trend, cyclicality, or seasonality in the past data. The developed composite forecasting model includes individual components for each of these identified patterns. The key hypothesis is that the discovered patterns in the past will influence the future. In addition to the basic forecasting steps, univariate forecasting model development includes the following sequence:

       Dividing the data into in-sample set (for model development) and out-of-sample set (for model validation)

       Applying the basic forecasting steps for the selected method on an in-sample set

       Validating the model through appropriate residuals tests

       Comparing the performance by applying the model to an out-of-sample set where possible

       Selecting the best model

Скачать книгу