Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 48

Computational Statistics in Data Science - Группа авторов

Скачать книгу

are called source streams, while the output streams are called derived streams [17]. In a streaming analytics system, the application comes as continuous queries, data are continuously ingested, analyzed, and interrelated to produce results in streaming fashion. Streaming analytics frameworks must be able to recognize new data, build models incrementally, and detect deviation from model predictions [18].

      One of the challenges of data stream mining is concept drift. Concept drift is a phenomenon that bothers on how data stream evolves [19]. The presence of concept drift affects the fundamental characteristics that the learning system seeks to uncover, thus leading to degraded results by the classifier as the change progresses [20].

stat08310fgy001

      Three standard solutions to address concept drift are (i) to detect changes and retrain classifiers when the degree of changes is significantly high, (ii) retraining of the classification model at the arrival of a new chunk or instance, and (iii) the use of adaptive learning methods. However, option number 2 is practically not feasible due to computational cost. The four main approaches for addressing concept drift are (i) concept drift detectors [22], (ii) sliding windows [23], (iii) online learners [24], (iv) and ensemble learners [25]. Other challenges for data stream are briefly highlighted below.

      3.1 Scalability

      3.2 Integration

      Building a distributed framework with every node having a data stream flow view implies that each node is liable for performing analysis with few sources. Aggregating these views to build a complete view is inconsequential. This calls for the development of an integration technique that can perform efficient operations through disparate datasets [27].

      3.3 Fault‐Tolerance

      For life‐critical systems, high fault‐tolerance is required. In streaming computing environments, where unbounded data are generated in real‐time, an amazing high adaptation to noncritical failure procedure and scalable system, is required to allow an application to keep working without interruption despite component failure. The most widely recognized adaptation to internal failure is checkpointing, where the framework state is intermittently persisted to recapture the computational state after system failures. However, the overhead incurred with checkpointing can negatively affect system performance. An improved checkpointing to minimize the overhead cost was proposed by [28, 29].

      3.4 Timeliness

      Time is essential for time‐sensitive processes, which incorporate foiling fraud, mitigating security threats, or responding to a natural disaster. Such architectures or platforms must be scalable to enable consistent handling of data streams [30]. The fundamental challenge bothers on implementing a distributed architecture for data aggregation with insignificant latency between the communicating nodes.

      3.5 Consistency

      Achieving high consistency or stability in the data stream computing environments is nontrivial as it is hard to figure out which data are required and which nodes ought to be consistent [31, 32]. Thus, a good framework is required.

      3.6 Heterogeneity and Incompleteness

      Data streams are heterogeneous in structure, semantics, organizations, granularity, and accessibility. Different data in disparate sources, different formats, combined with the volume of data, make the integration, retrieval, and reasoning over the data stream a challenging task [33]. The challenge here is how to deal with ever‐growing data, extract, aggregate, and correlate data streams from numerous sources in real‐time. There is need to design a competent data presentation to mirror the structure, hierarchy, and diversity of data streams.

      3.7 Load Balancing

      3.8 High Throughput

      Decision concerning the identification of the data stream portion that needs replication, number of these replicas that is required, and which of the data stream to assign to each replica is an issue in data stream computing environment. Proper multiple instances replication is required if high throughput is to be achieved [35].

      3.9 Privacy

      Data stream analytics open doors for real‐time analysis of massive amount of data but also made a colossal danger to individual privacy [36]. As indicated by the International Data Cooperation (IDC), half of the aggregate data that needs protection is not adequately protected. Relevant and efficient privacy‐preserving solutions for interpretation, observation, evaluation, and decision for data stream mining should be designed [37]. The sensitive nature of data necessitates privacy‐preserving techniques. One of the leading privacy‐preserving techniques is perturbation [38].

      3.10 Accuracy

      Developing efficient methods that can precisely predict future observations is one of the leading goals of data stream analysis.

Скачать книгу