Applied Data Mining for Forecasting Using SAS. Tim Rey

Applied Data Mining for Forecasting Using SAS - Tim Rey

users (use the forecasting models on a regular basis)

      System structure and data identification

      The purpose of this substep is to capture and document the available knowledge about the system under consideration. This step provides a meaningful context for the necessary data and the data mining and forecasting steps. Knowledge acquisition usually takes several brainstorming sessions facilitated by model developers and attended by selected subject matter experts. The documentation may include process descriptions, market structure studies, system diagrams and process maps, relationship maps, etc. The authors' favorite technique for system structure and data identification is mind-mapping, which is a very convenient way of capturing knowledge and representing the system structure during the brainstorming sessions.

      Mind-mapping (or concept mapping) involves writing down a central idea and thinking up new and related ideas that radiate out from the center.1 By focusing on key topics written down in SME's words, and then defining branches and connections between the topics, the knowledge of the SMEs can be mapped in a manner that will help understanding and document the details of knowledge necessary for future data and modeling activities. An example of a mind-map2 for system structure and data identification in the case of a forecasting project for Product A is shown in Figure 2.2.

      The system structure, shown in the mind-map in Figure 2.2, includes three levels. The first level represents the key topics related to the project by radial branches from the central block named “Product A Price Forecasting.” In this case, according to the subject matter experts, the central topics are: Data, Competitors, Potential drivers, Business structure, Current price decision-making process, and Potential users. Each key topic can be structured in as many levels of detail as necessary. However, beyond three levels down, the overall system structure visualization becomes cumbersome and difficult to understand. An example of an expanded structure of the key topic Data down to the third level of detail is shown in Figure 2.2. The second level includes the two key types of data – internal and external. The third level of detail in the mind-map captures the necessary topics related to the internal and external data. All other key topics are represented in a similar way (not shown in Figure 2.2). The different levels of detail are selected by collapsing or expanding the corresponding blocks or the whole mind-map.


      Project definition deliverables

      The deliverables in this step are: (1) project charter, (2) team composition, and (3) approved funding. The most important deliverable in project definition is the charter. It is a critical document which in many cases defines the fate of the project. Writing a good charter is an iterative process which includes gradually reducing uncertainty related to objectives, deliverables, and available data. The common rule of thumb is this: the less fuzzy the objectives and the more specific the language, the higher the probability for success. An example of the structure of this document in the case of the Product A forecasting project is given in the Appendix at the end of this chapter.

      The ideal team composition is shown in the corresponding charter section in the Appendix. In the case of some specific work processes, such as Six Sigma, the roles and responsibilities are well defined in generic categories like green belts, black belts, master black belts, and so on.

      The most important practical deliverable in the project definition step is a committed financial support for the project since this is when the real project work begins. No funding—no forecasting. It is as simple as that.

      Data preparation steps

      Data preparation includes all necessary procedures to explore, clean, and preprocess the previously extracted data in order to begin model development with maximal possible information content in the data.3 In reality, data preparation is time consuming, nontrivial, and difficult to automate. Very often it is also the most expensive phase of applied forecasting in terms of time, effort, and cost. External data might need to be purchased, which can be a significant part of the project cost. The key data preparation substeps and deliverables are discussed briefly below. The detailed description of this step is given in Chapters 5 and 6.

      Data collection

      The initial data collection is commonly driven by the data structure recommended by the subject matter experts in the system structure and data identification step. Data collection includes identifying the internal and external data sources, downloading the data, and then harmonizing the data in a consistent time series database format.

      In the case of the example for Product A price forecasting, data collection includes the following specific actions:

       identifying the data mart that stores the internal data

       identifying the specific services and tags of the external time series available in Global Insights (GI), Chemical Market Associates, Inc. (CMAI), Bloomberg, and so on.

       collecting the internal data is generally conducted by the business data SMEs

       collecting the external data is done using local GI or CMAI service experts

       harmonizing the collected internal and external data as a consistent time series database of the prescribed time interval

      Data preprocessing

      The common methods for improving the information content of the raw data (which very often are messy) include: imputation of missing data, accumulation, aggregation, outlier detection, transformations, expanding or contracting, and so on. All of these techniques are discussed in separate sections in Chapter 6.

      Data preparation deliverables

      The key deliverable in this step is a clean data set with combined and aligned targeted variables (Ys) and potential drivers (Xs) based on preprocessed internal and external data.

      Of equal importance to the preprocessed data set is a document that describes the details of the data preparation along with the scripts to collect, clean and harmonize the data.

      Variable reduction /selection steps

      The objective of this block of the work process is to reduce the number of potential economic drivers for the dependent variable by various data mining methods. The data reduction process is done in two key substeps: (1) variable reduction and (2) variable selection in static transactional data. The main difference between the two substeps is the relation of the potential drivers or independent variables (Xs) to the targeted or dependent variables (Ys). In the case of variable reduction, the focus is on the similarity between the independent variables, not on their association with the dependent variable. The idea is that some of the Xs are highly related to one another thus removing redundant variables reduces data dimensionality. In the case of variable selection, the independent variables are chosen based on their statistical significance or similarity with the dependent variables. The details of the methods for variable reduction and selection are presented in Chapter

