Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 50
Semi‐supervised learning on streaming data may return similar results to that of the supervised approach. However, there are observations with semi‐supervised learning on streaming data, which include (i) to balance out classifiers, considerably more objects ought to be labeled, and (ii) more significant threshold adversely impacts the strength of classifiers with the increase in standard deviation and a bigger threshold [19]. Some of the semi‐supervised learning techniques for data streams include ensemble techniques, graph‐based methods, deep learning, active learning, linear neighborhood propagation.
6.3 Supervised Learning
Supervised learning is the type of machine learning that infers function from trained labeled data. The training examples contain a couple of input (vector) and output (supervisory signal). Let data stream S = {…, dt − 1, dt, dt + 1, …}, where dt = {xi, yi}, xi is the value set of the ith datum in each attribute and yi is the class of the instance. Data stream classification aims to train a classifier f : x → y that establishes a mapping relationship between feature vectors and class labels [92].
Supervised learning approaches can be subdivided into two major categories, which are regression and classification. When the class attribute is continuous, it is called regression, but when the class attribute is discrete, it is referred to as classification. Manual labeling is difficult, time‐consuming, and could be very costly [93]. In a streaming scenario with high velocity and volume, label data are very scarce, thus leading to poorly trained classifiers as a result of the constrained measure of labeled data accessible for building the models [94].
Some of the supervised learning algorithms for streaming scenario are grouped as presented in [95] (i) Tree‐based algorithms: OLIN, Ultra‐Fast Forest Tree system (UFFT), Very Fast Decision Tree learner (VFDT), VFDTc, Random Forest, and Vertical Hoeffding Tree, Concept‐adapting Evolutionary Algorithm for Decision Tree (CEVOT); (ii) Rule‐based algorithms: On‐demand classifier, Fuzzy Passive‐aggressive classification, Similarity‐based data stream classification (SimC), Prequential area under curve (AUC) based classifier, one‐class classifier with incremental learning and forgetting, and Classifying recurring concept using fuzzy similarity function; (iii) Ensemble‐based algorithms: Streaming ensemble algorithm, Weighted classifier ensemble, Distance‐based ensemble online classifier with kernel clustering; (iv) Nearest‐neighbor: Adaptive nearest neighbor classification algorithm, anytime nearest neighbor algorithm; (v) Statistical: Evolving Naïve Bayes; (vi) Deep learning: Activity recognition [96].
6.4 Ontology‐Based Methods
Performing streaming data analysis over ontologies and linked open data are a challenging and emerging research area. Semantic web technology, an extension of the World Wide Web, is used to improve the interoperability of heterogeneous sources with a data model called Resource Description Framework (RDF) and ontological languages such as Web Ontology Language (OWL). Some of the works done using ontology or linked open data on data stream include [97–99]. Due to the dynamic nature of data stream, current solutions for reasoning over the data model and ontological languages are not suited to streaming data context. This gap brought about what is referred to as stream reasoning. Stream reasoning is the set of inference approaches and deduction mechanisms concerned with the provision of continuous inference over a data stream, leading to a better decision support system [100]. Stream reasoning has been applied in remote health monitoring [101], smart cities [102], semantic analysis of social media [103], maritime safety, and securities [104], amongst others. Another attempt to improve semantic web ontology is to lift the existing streams to RDF streams using intuitive configuration mechanisms. Some of the techniques for RDF stream modeling include Semantic Sensor Network (SSN) ontology [105], Stream Annotation Ontology (SOA) [106], smart appliances reference (SAREF) ontology [107], and Linked Stream Annotation Engine (LSane) [108].
7 Strategies for Processing Data Streams
Data stream processing includes techniques, models, and systems for processing data as soon as they arrive to detect trends and patterns in a low latency [109]. Data stream processing requires two factors which include storage capability and computational power in the face of an unbounded generation of data with high velocity and brief life span. To cope with these requirements, approximate computing, which aims at low latency at the expense of acceptable quality loss, has been a practical solution [110]. The ideology behind approximate computing is based on returning approximate answer instead of the exact answer for user queries. This is done by choosing a representative sample of data instead of the whole data [111]. The two main techniques for approximate computing includes (i) sampling [4], which constructs data stream summaries by probability selection, and (ii) sketches [112], which compress data using data structure (such as histogram or hash tables), prediction‐based method (such as Bayesian Inference), and transformation‐based method (such as wavelet).
Fixed window and sliding window are two computation models for the partitioning of the data stream. Fixed window partitions data stream into nonoverlapping time segments, and the current data are removed after processing, resetting the window size back to zero. The sliding window contains a historical snapshot of the data stream at any point in time. When the arriving data are at variance with the current window elements, tuples are updated by discarding the oldest data [5]. The sliding window can be further sub‐divided into a count‐based window and time‐based window. In the count‐based window, the progressive step is expressed in tuple counts, while items with the oldest timestamp are replaced with items with the latest timestamp in the time‐based window [113].