Computational Statistics in Data Science. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 49
4 Streaming Data Tools and Technologies
The demand for stream processing is on the increase, and data have to be processed fast to make decisions in real‐time. Because of the developing interest in streaming data analysis, a huge number of enormous streaming data solutions have been created both by the open‐source community and enterprise technology vendors [10]. As indicated by Millman [40], there are a few elements to consider while choosing data stream tools and technologies in request to settle on viable data management decisions. Those elements include the shape of the data, data accessibility, availability, and consistency requirement, and workload. Some prominent open‐source tools and technologies for data stream analytics include NoSQL [41], Apache Spark [42–44], Apache Storm [45], Apache Samza [46, 47], Yahoo! S4 [48], Photon [49], Apache Aurora [50], EsperTech [51], SAMOA [52], C‐SPARQL [53], CQELS [54], ETALIS [55], SpagoWorld [56]. Some proprietary tools and technologies for streaming data are Cloudet [57], Sentiment Brand Monitoring [58], Elastic Streaming Processing Engine [59], IBM InfoSphere Streams [16, 60, 61], Google MillWheel [46], Infochimps Cloud [56], Azure Stream [62], Microsoft Stream Insight [63], TIBCO StreamBase [64], Lambda Architecture [6], IoTSim‐Stream [65], and Apama Stream [62].
5 Streaming Data Pre‐Processing: Concept and Implementation
Data stream pre‐processing, which aims at reducing the inherent complexity associated with streaming data for a faster, more understandable, and interpretable, and more precise learning process is an essential technique in knowledge discovery. However, despite the recorded growth in online learning, data stream pre‐processing methods still have a long way to go due to the high level of noise [66]. These noisy terms incorporate a short length of messages, slangs, abbreviations, acronyms, blended dialects, linguistic and spelling mistakes, sporadic, casual, shortened words, and ill‐advised sentence structure, which make it hard for learning algorithms to perform productively and adequately [67]. Additionally, error from sensor reading due to low battery, damage, incorrect calibrations, among others, can render data delivered from such sensors unsuitable for analysis [68].
Data quality is a fundamental determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to strengthen data stream pre‐processing stage in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, data stream pre‐processing techniques with low computational requirement [73] needs to be evolved as this is still open for research. Moreover, the representation of social media posts must be in a way that the semantics of social media content is preserved [74, 75]. To improve the result of analysis in the data stream, there is need to develop frameworks that will cope with the noisy characteristics, redundancy, heterogeneity, data imbalance, transformation, feature representation, or selection issues in data streams [26]. Some of the new frameworks developed for pre‐processing and enriching data stream for better results are SlangSD [76], N‐gram and Hidden Markov Model [77], SLANGZY [78], and SMFP [67].
6 Streaming Data Algorithms
Data stream poses a significant number of challenges to mining algorithms and research community due to the high‐traffic, high‐velocity, and brief life span of streaming data [79]. Many algorithms that are suitable for mining data at rest are not suited to streaming data due to the inherent characteristics of streaming data [80]. Some of the constraints that are naturally imposed on mining algorithms by streaming data include (i) the concept of a single pass, (ii) the probability distribution of data chunk is not known in advance, (iii) no limitation on the amount of generated data, (iv) the size of incoming data may vary, (v) the incoming data may belong to various sub‐clusters, and (vi) access to correct class labels is limited due to overhead incurred by label query for each arriving instance [81]. The constraints further generate other problems, which include: (i) capturing sub‐cluster data within the bounded learning time complexity, (ii) the minimum number of epochs required to achieve the learning time complexity, and (iii) making algorithm robust in the face of dynamically evolving and irregular streaming data.
Different streaming data mining tasks include clustering, similarity search, prediction, classification, and object detection, among others [82, 83]. Algorithms used for streaming data analysis can be grouped into four: Unsupervised learning, semi‐supervised learning, supervised learning, and ontology‐based techniques. These are subsequently described.
6.1 Unsupervised Learning
Unsupervised learning is a type of learning that draws inductions from the unlabeled dataset [84]. Data stream source is nonstationary, and for clustering algorithms, there is no information about the data distribution in advance [85]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, the entirety of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in data as it evolves, which involves characterizing the streaming data in the form of clusters to leverage them to report useful and interesting patterns in the data stream [86]. Unsupervised learning algorithms are suitable for analyzing data stream as it does not require a predefined label [87]. Clusters are ordered dependent on scoring function, for example, catchphrase or keyword, hashtags, the semantic relationship of terms, and segment extraction [88].
Data stream clustering can be grouped into five categories, which are partitioning methods, hierarchical methods, model‐based methods, density‐based methods, and grid‐based methods.
Partition‐based techniques try to find out k‐partitions based on some measurement. Partitioning clustering methods are not suitable for streaming scenarios since they require earlier information on cluster number. Examples of partition‐based methods include Incremental K‐Mean, STREAMKM++, Stream LSearch, HPStream, SWClustering, and CluStream.
Hierarchical methods can be further subdivided into divisive and agglomerative. With divisive hierarchical clustering, a cluster is divided into small clusters until it cannot be split further. In contrast, agglomerative hierarchical clustering merges up separate clusters until the distance between two clusters reaches a required threshold. Balanced iterative reducing and clustering using hierarchies (BIRCH), open distributed application construction (ODAC), E‐Stream, clustering using representatives (CURE), and HUE‐ are some hierarchical algorithms for data stream analysis.
In model‐based methods, a hypothesized model is run for each cluster to check which data properly fits a cluster. Some of the algorithms that fit into this category are CluDistream, Similarity Histogram‐based Incremental Clustering, sliding window with expectation maximization (SWEM), COBWEB, and Evolving Fractal‐Based Clustering of Data Streams.
Density‐based methods separate data into density regions (i.e., nonoverlapping cells) of different shapes and sizes. Density‐based algorithms require a single pass and can handle noise. Stating the number of clusters in advance is not also required. Some density‐based algorithms include DGStream, MicroTEDAclus, clustering of evolving data‐streams into arbitrary shapes (CEDAS), Incremental DBSCAN (Density‐Based Spatial Clustering with Noise), DenStream, r‐DenStream, DStream, DBstream, data stream clustring (DSCLU), MR‐Stream, Ordering Points to Identify Clustering Structure (OPTICS), OPClueStream, and MBG‐Stream.
6.2 Semi‐Supervised