Читать онлайн книгу - Computational Statistics in Data Science. Группа авторов. Математика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Computational Statistics in Data Science - Группа авторов

Скачать книгу

learning belongs to a class of AI frameworks that trains on the combination of both the unlabeled and labeled data [89]. Semi‐supervised learning in data stream context is challenging because data are being generated at real‐time and the labels may be missing due to different factors, which include communication errors, network delays, expensive labeling processes, among others [90]. According to Zhu and Li [91], a semi‐supervised learning problem in a data stream context is defined as follows. Let

upper S equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals 1 Superscript upper T 0

as the data in the first T₀ time period and S denote streaming data. Let Y = {1, 2, …, K} be the known label set. The arriving data stream has an instance x_t and y_t ∈ Y^′ = {−1, 1, 2, …, K}. If y_t = − 1, x_t is an unlabelled instance, but the true label is in set Y. As time goes on, evolution happens, a data stream

upper S prime equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals upper T 0 plus 1 Superscript infinity Baseline comma

which contains novel classes. That is,

there-exists left-brace right-brace comma xt prime comma yt prime element-of upper S prime

where

y Subscript t Sub Superscript prime Subscript Baseline equals negative 1 comma

but the true label of

is not in set Y. Note that if

y Subscript t Sub Superscript prime Subscript Baseline not-equals negative 1 comma y Subscript t Sub Superscript prime Subscript Baseline element-of upper Y

holds forever.

Semi‐supervised learning on streaming data may return similar results to that of the supervised approach. However, there are observations with semi‐supervised learning on streaming data, which include (i) to balance out classifiers, considerably more objects ought to be labeled, and (ii) more significant threshold adversely impacts the strength of classifiers with the increase in standard deviation and a bigger threshold [19]. Some of the semi‐supervised learning techniques for data streams include ensemble techniques, graph‐based methods, deep learning, active learning, linear neighborhood propagation.

6.3 Supervised Learning

Supervised learning is the type of machine learning that infers function from trained labeled data. The training examples contain a couple of input (vector) and output (supervisory signal). Let data stream S = {…, d_{t − 1}, d_t, d_{t + 1}, …}, where d_t = {x_i, y_i}, x_i is the value set of the ith datum in each attribute and y_i is the class of the instance. Data stream classification aims to train a classifier f : x → y that establishes a mapping relationship between feature vectors and class labels [92].

Supervised learning approaches can be subdivided into two major categories, which are regression and classification. When the class attribute is continuous, it is called regression, but when the class attribute is discrete, it is referred to as classification. Manual labeling is difficult, time‐consuming, and could be very costly [93]. In a streaming scenario with high velocity and volume, label data are very scarce, thus leading to poorly trained classifiers as a result of the constrained measure of labeled data accessible for building the models [94].

Some of the supervised learning algorithms for streaming scenario are grouped as presented in [95] (i) Tree‐based algorithms: OLIN, Ultra‐Fast Forest Tree system (UFFT), Very Fast Decision Tree learner (VFDT), VFDTc, Random Forest, and Vertical Hoeffding Tree, Concept‐adapting Evolutionary Algorithm for Decision Tree (CEVOT); (ii) Rule‐based algorithms: On‐demand classifier, Fuzzy Passive‐aggressive classification, Similarity‐based data stream classification (SimC), Prequential area under curve (AUC) based classifier, one‐class classifier with incremental learning and forgetting, and Classifying recurring concept using fuzzy similarity function; (iii) Ensemble‐based algorithms: Streaming ensemble algorithm, Weighted classifier ensemble, Distance‐based ensemble online classifier with kernel clustering; (iv) Nearest‐neighbor: Adaptive nearest neighbor classification algorithm, anytime nearest neighbor algorithm; (v) Statistical: Evolving Naïve Bayes; (vi) Deep learning: Activity recognition [96].

6.4 Ontology‐Based Methods

Performing streaming data analysis over ontologies and linked open data are a challenging and emerging research area. Semantic web technology, an extension of the World Wide Web, is used to improve the interoperability of heterogeneous sources with a data model called Resource Description Framework (RDF) and ontological languages such as Web Ontology Language (OWL). Some of the works done using ontology or linked open data on data stream include [97–99]. Due to the dynamic nature of data stream, current solutions for reasoning over the data model and ontological languages are not suited to streaming data context. This gap brought about what is referred to as stream reasoning. Stream reasoning is the set of inference approaches and deduction mechanisms concerned with the provision of continuous inference over a data stream, leading to a better decision support system [100]. Stream reasoning has been applied in remote health monitoring [101], smart cities [102], semantic analysis of social media [103], maritime safety, and securities [104], amongst others. Another attempt to improve semantic web ontology is to lift the existing streams to RDF streams using intuitive configuration mechanisms. Some of the techniques for RDF stream modeling include Semantic Sensor Network (SSN) ontology [105], Stream Annotation Ontology (SOA) [106], smart appliances reference (SAREF) ontology [107], and Linked Stream Annotation Engine (LSane) [108].

7 Strategies for Processing Data Streams

Data stream processing includes techniques, models, and systems for processing data as soon as they arrive to detect trends and patterns in a low latency [109]. Data stream processing requires two factors which include storage capability and computational power in the face of an unbounded generation of data with high velocity and brief life span. To cope with these requirements, approximate computing, which aims at low latency at the expense of acceptable quality loss, has been a practical solution [110]. The ideology behind approximate computing is based on returning approximate answer instead of the exact answer for user queries. This is done by choosing a representative sample of data instead of the whole data [111]. The two main techniques for approximate computing includes (i) sampling [4], which constructs data stream summaries by probability selection, and (ii) sketches [112], which compress data using data structure (such as histogram or hash tables), prediction‐based method (such as Bayesian Inference), and transformation‐based method (such as wavelet).

Fixed window and sliding window are two computation models for the partitioning of the data stream. Fixed window partitions data stream into nonoverlapping time segments, and the current data are removed after processing, resetting the window size back to zero. The sliding window contains a historical snapshot of the data stream at any point in time. When the arriving data are at variance with the current window elements, tuples are updated by discarding the oldest data [5]. The sliding window can be further sub‐divided into a count‐based window and time‐based window. In the count‐based window, the progressive step is expressed in tuple counts, while items with the oldest timestamp are replaced with items with the latest timestamp in the time‐based window [113].

Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 50

Информация о книге:

6.3 Supervised Learning

6.4 Ontology‐Based Methods

7 Strategies for Processing Data Streams

8
Скачать книгу

Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 50

Информация о книге:

6.3 Supervised Learning

6.4 Ontology‐Based Methods

7 Strategies for Processing Data Streams

8 Скачать книгу

8
Скачать книгу