Applied Data Mining for Forecasting Using SAS. Tim Rey

Чтение книги онлайн.

Читать онлайн книгу Applied Data Mining for Forecasting Using SAS - Tim Rey страница 19

Автор:
Жанр:
Серия:
Издательство:
Applied Data Mining for Forecasting Using SAS - Tim Rey

Скачать книгу

advantages, however, outweigh the disadvantages and the client/server infrastructure is the standard solution for large-scale industrial applications of data mining and forecasting.

      Another potential solution, called cloud computing, uses powerful external and internal computing resources, and includes grid computing for parallel processing, multi-tiered computer architecture, and the capacity to handle super-large data sets. Such services are currently offered by a number of vendors including well-established industry leaders. Some of the advantages of using this option are as follows:

       low implementation and maintenance cost

       super-computer power, which is continuously upgraded by the cloud owner

       data consolidation in very large data sets

       increased reliability

      The disadvantages of using a cloud computing infrastructure are summarized as follows:

       proprietary data security

       initial transfer of very large corporate data to the cloud

       limited software

       trust issues

       information technology (IT) management resistance

      This option is still in an exploratory phase and has generated a lot of hype. However, if the technical and economic advantages are proved with more industrial applications, it could become a popular hardware infrastructure in the near future.

      The lion's share of the costs for implementing data mining for forecasting systems, especially for the PC network infrastructure, is not the cost of hardware but the cost of software infrastructure. One of the key decisions to make in advance is the scale of the efforts. In the case of large-scale forecasting on a corporate level that is to be implemented across the globe, an integrated software environment made up of all necessary components with global support is strongly recommended. An example of such infrastructure (based on SAS software) is discussed in this book.

      This part of the infrastructure strongly depends on the existing corporate information system architecture. Unfortunately, it could be very diverse with different database platforms. In most cases, however, the data are organized in relational databases and stored in separate tables for each entity. The relationship between the tables is defined by two columns—primary key and foreign key columns (Svolba 2006). Data that are accessed from a relational database are usually extracted table by table and are merged according to the primary or foreign keys.

      The software basis for handling data in relational database systems is the Structure Query Language (SQL). It includes the necessary operators for searching data pieces as well as different aggregations and joins of tables. The leading relational database systems include Oracle, SAP MaxDB and Sybase, Microsoft SQL Server, and IBM DB2. The good news is that the existing key software programs for data mining, such as SAS Enterprise Miner, IBM SPSS1 and StatSoft STATISTICA Data Miner2 include all necessary software interfaces to collect data from diverse sources.3 For example, SAS offers a specialized tool, SAS/ACCESS, that has almost universal capabilities for access, retrieval, and integration with any available data source.4

      It is recommended that the selected software has the following functionality for data preparation:

       Data manipulation capabilities that include functions for summary tables generation, data split, concatenation, transposition, stacking, sorting, flexible filtering, joining tables, and so on.

       Missing data handling that includes different options to impute missing data.

       Data description capabilities that are usually based on basic descriptive statistics, frequency tables, histograms, and so on.

       Data visualization capabilities that include a broad spectrum of graphics, such as 3-D scatter plots, contour plots, parallel plots, and so on.

       Data pre-processing capabilities that include filtering, outlier detection and removal, data sampling, data partitioning, data transformation, and so on.

      Examples of software tools with these capabilities are SAS Enterprise Guide, JMP, IBM SPSS, and StatSoft STATISTICA Data Miner.

      From the broad range of available data mining methods and functions, the following capabilities for variable reduction and selection are needed for the forecasting applications:

       Basic statistical capabilities that include building and analyzing linear regression models with options for variable selection by forward and backward stepwise regression.

       Multivariate analysis capabilities that include cross-correlation analysis, PCA, and PLS.

       Clustering capabilities that include dividing variables in clusters by linear or nonlinear methods, similarity analysis, and building decision trees.

       Variable selection capabilities that include different algorithms for variable selection, such as stepwise regression, decision trees, gradient boosting, singular value decomposition (SVD), and so on.

      The three most popular software options for industrial applications that offer most of these capabilities are SAS Enterprise Miner, IBM SPSS, and StatSoft STATISTICA Data Miner.

      The recommended capabilities for effective development of forecasting models in industrial applications are as follows.

       Time series analysis capabilities that include generating time series, different time plots, correlations, seasonality adjustments, decompositions, and so on.

       Forecasting model generation capabilities that include the most popular methods, such as exponential smoothing, ARIMA, unobserved components, and so on with a variety of diagnostic statistics and model performance metrics.

       Forecasting modeling with events capabilities that enable the introduction of big discrete shifts in the model development.

       Hierarchical forecasting capabilities that include developing a model hierarchy at the desired level based on the existing business structure and reconciling this with the final forecast.

       Scenario generation capabilities for multivariate-based forecasting models—these different “what if” scenarios can show the impact of the key inputs on the final forecast.

      The most powerful software tools that offer these capabilities are SAS Forecast Studio, Automatic Forecasting Systems Autobox and Business Forecast Systems Forecast Pro.

      In addition to the specific technical capabilities of the key software components for a data mining for forecasting system, the following generic selection criteria are recommended:

       Cost depends

Скачать книгу