Читать онлайн книгу - Computational Statistics in Data Science. Группа авторов. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Computational Statistics in Data Science - Группа авторов

Скачать книгу

are called source streams, while the output streams are called derived streams [17]. In a streaming analytics system, the application comes as continuous queries, data are continuously ingested, analyzed, and interrelated to produce results in streaming fashion. Streaming analytics frameworks must be able to recognize new data, build models incrementally, and detect deviation from model predictions [18].

3 Issues in Data Stream Mining

One of the challenges of data stream mining is concept drift. Concept drift is a phenomenon that bothers on how data stream evolves [19]. The presence of concept drift affects the fundamental characteristics that the learning system seeks to uncover, thus leading to degraded results by the classifier as the change progresses [20].

Concept drift in data stream can be broadly classified into two main categories, which are concept drift based on classification boundaries and concept drift concerning types of change. The former influences the classification boundaries and can be further subdivided into virtual concept drift and real concept drift. Virtual concept drift affects the conditional probability density functions, though the influence on the decision boundary is insignificant on the currently used learning models. On the other hand, real concept drift often impacts the unconditional probability density functions, leading to degraded results of the learning models. Concept drift concerning change is subdivided into sudden, gradual, and incremental concept drift. Other categories based on types of change include blip, noise, mixed, local, global, feature, and adversarial concept drifts [21]. The taxonomy of concept drift is presented in Figure 1.

Figure 1 Taxonomy of concept drift in data stream.

Three standard solutions to address concept drift are (i) to detect changes and retrain classifiers when the degree of changes is significantly high, (ii) retraining of the classification model at the arrival of a new chunk or instance, and (iii) the use of adaptive learning methods. However, option number 2 is practically not feasible due to computational cost. The four main approaches for addressing concept drift are (i) concept drift detectors [22], (ii) sliding windows [23], (iii) online learners [24], (iv) and ensemble learners [25]. Other challenges for data stream are briefly highlighted below.

3.1 Scalability

Another fundamental challenge in streaming data analysis is the issue of scalability. The rate at which data stream is growing is much faster than the resources available to the computer. While the processors keep Moore's law, the data size is experiencing exponential growth. Subsequently, research endeavors must be equipped toward creating scalable frameworks and machine learning algorithms that will adjust data stream computing mode, manage resource allocation strategy effectively, and react to parallelization issues to adapt to the high‐volume and complexity in data. While data streams provide the opportunity for machine learning algorithms to uncover useful and interesting patterns, traditional machine learning algorithms face the challenge of scalability to truly uncover the hidden value in the data stream [26].

3.2 Integration

Building a distributed framework with every node having a data stream flow view implies that each node is liable for performing analysis with few sources. Aggregating these views to build a complete view is inconsequential. This calls for the development of an integration technique that can perform efficient operations through disparate datasets [27].

3.3 Fault‐Tolerance

For life‐critical systems, high fault‐tolerance is required. In streaming computing environments, where unbounded data are generated in real‐time, an amazing high adaptation to noncritical failure procedure and scalable system, is required to allow an application to keep working without interruption despite component failure. The most widely recognized adaptation to internal failure is checkpointing, where the framework state is intermittently persisted to recapture the computational state after system failures. However, the overhead incurred with checkpointing can negatively affect system performance. An improved checkpointing to minimize the overhead cost was proposed by [28, 29].

3.4 Timeliness

Time is essential for time‐sensitive processes, which incorporate foiling fraud, mitigating security threats, or responding to a natural disaster. Such architectures or platforms must be scalable to enable consistent handling of data streams [30]. The fundamental challenge bothers on implementing a distributed architecture for data aggregation with insignificant latency between the communicating nodes.

3.5 Consistency

Achieving high consistency or stability in the data stream computing environments is nontrivial as it is hard to figure out which data are required and which nodes ought to be consistent [31, 32]. Thus, a good framework is required.

3.6 Heterogeneity and Incompleteness

Data streams are heterogeneous in structure, semantics, organizations, granularity, and accessibility. Different data in disparate sources, different formats, combined with the volume of data, make the integration, retrieval, and reasoning over the data stream a challenging task [33]. The challenge here is how to deal with ever‐growing data, extract, aggregate, and correlate data streams from numerous sources in real‐time. There is need to design a competent data presentation to mirror the structure, hierarchy, and diversity of data streams.

3.7 Load Balancing

In an ideal situation, a data stream framework should be self‐adaptive and avoid load shedding. Be that as it may, this is challenging because the possibility of dedicated resources to cover peak loads 24/7 is slim, and load shedding is not realistic, most especially when the variance between the average load and the peak load is high [34]. Consequently, a distributing environment that can stream, analyze, and aggregate partial data streams to a global center when local resources become deficient is required.

3.8 High Throughput

Decision concerning the identification of the data stream portion that needs replication, number of these replicas that is required, and which of the data stream to assign to each replica is an issue in data stream computing environment. Proper multiple instances replication is required if high throughput is to be achieved [35].

3.9 Privacy

Data stream analytics open doors for real‐time analysis of massive amount of data but also made a colossal danger to individual privacy [36]. As indicated by the International Data Cooperation (IDC), half of the aggregate data that needs protection is not adequately protected. Relevant and efficient privacy‐preserving solutions for interpretation, observation, evaluation, and decision for data stream mining should be designed [37]. The sensitive nature of data necessitates privacy‐preserving techniques. One of the leading privacy‐preserving techniques is perturbation [38].

3.10 Accuracy

Developing efficient methods that can precisely predict future observations is one of the leading goals of data stream analysis.