Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 50

Computational Statistics in Data Science - Группа авторов

Скачать книгу

learning belongs to a class of AI frameworks that trains on the combination of both the unlabeled and labeled data [89]. Semi‐supervised learning in data stream context is challenging because data are being generated at real‐time and the labels may be missing due to different factors, which include communication errors, network delays, expensive labeling processes, among others [90]. According to Zhu and Li [91], a semi‐supervised learning problem in a data stream context is defined as follows. Let upper S equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals 1 Superscript upper T 0 as the data in the first T0 time period and S denote streaming data. Let Y = {1, 2, …, K} be the known label set. The arriving data stream has an instance xt and ytY = {−1, 1, 2, …, K}. If yt = − 1, xt is an unlabelled instance, but the true label is in set Y. As time goes on, evolution happens, a data stream upper S prime equals left-brace left-parenthesis x Subscript t Baseline comma y Subscript t Baseline right-parenthesis right-brace Subscript t equals upper T 0 plus 1 Superscript infinity Baseline comma which contains novel classes. That is, there-exists left-brace right-brace comma xt prime comma yt prime element-of upper S primewhere y Subscript t Sub Superscript prime Subscript Baseline equals negative 1 comma but the true label of x Subscript t primeis not in set Y. Note that if y Subscript t Sub Superscript prime Subscript Baseline not-equals negative 1 comma y Subscript t Sub Superscript prime Subscript Baseline element-of upper Y holds forever.

      Semi‐supervised learning on streaming data may return similar results to that of the supervised approach. However, there are observations with semi‐supervised learning on streaming data, which include (i) to balance out classifiers, considerably more objects ought to be labeled, and (ii) more significant threshold adversely impacts the strength of classifiers with the increase in standard deviation and a bigger threshold [19]. Some of the semi‐supervised learning techniques for data streams include ensemble techniques, graph‐based methods, deep learning, active learning, linear neighborhood propagation.

      6.3 Supervised Learning

      Supervised learning is the type of machine learning that infers function from trained labeled data. The training examples contain a couple of input (vector) and output (supervisory signal). Let data stream S = {…, dt − 1, dt, dt + 1, …}, where dt = {xi, yi}, xi is the value set of the ith datum in each attribute and yi is the class of the instance. Data stream classification aims to train a classifier f : xy that establishes a mapping relationship between feature vectors and class labels [92].

      Supervised learning approaches can be subdivided into two major categories, which are regression and classification. When the class attribute is continuous, it is called regression, but when the class attribute is discrete, it is referred to as classification. Manual labeling is difficult, time‐consuming, and could be very costly [93]. In a streaming scenario with high velocity and volume, label data are very scarce, thus leading to poorly trained classifiers as a result of the constrained measure of labeled data accessible for building the models [94].

      6.4 Ontology‐Based Methods

      Performing streaming data analysis over ontologies and linked open data are a challenging and emerging research area. Semantic web technology, an extension of the World Wide Web, is used to improve the interoperability of heterogeneous sources with a data model called Resource Description Framework (RDF) and ontological languages such as Web Ontology Language (OWL). Some of the works done using ontology or linked open data on data stream include [97–99]. Due to the dynamic nature of data stream, current solutions for reasoning over the data model and ontological languages are not suited to streaming data context. This gap brought about what is referred to as stream reasoning. Stream reasoning is the set of inference approaches and deduction mechanisms concerned with the provision of continuous inference over a data stream, leading to a better decision support system [100]. Stream reasoning has been applied in remote health monitoring [101], smart cities [102], semantic analysis of social media [103], maritime safety, and securities [104], amongst others. Another attempt to improve semantic web ontology is to lift the existing streams to RDF streams using intuitive configuration mechanisms. Some of the techniques for RDF stream modeling include Semantic Sensor Network (SSN) ontology [105], Stream Annotation Ontology (SOA) [106], smart appliances reference (SAREF) ontology [107], and Linked Stream Annotation Engine (LSane) [108].

      Fixed window and sliding window are two computation models for the partitioning of the data stream. Fixed window partitions data stream into nonoverlapping time segments, and the current data are removed after processing, resetting the window size back to zero. The sliding window contains a historical snapshot of the data stream at any point in time. When the arriving data are at variance with the current window elements, tuples are updated by discarding the oldest data [5]. The sliding window can be further sub‐divided into a count‐based window and time‐based window. In the count‐based window, the progressive step is expressed in tuple counts, while items with the oldest timestamp are replaced with items with the latest timestamp in the time‐based window [113].

Скачать книгу