Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 49

Computational Statistics in Data Science - Группа авторов

Скачать книгу

volume, variety, value, veracity, variability, and volatility, data stream analysis strongly constrain processing algorithms spatiotemporally. Hence, to guarantee high accuracy, mitigation of these challenges must be taken into consideration as they can negatively influence the accuracy of data stream analysis [39].

      Data stream pre‐processing, which aims at reducing the inherent complexity associated with streaming data for a faster, more understandable, and interpretable, and more precise learning process is an essential technique in knowledge discovery. However, despite the recorded growth in online learning, data stream pre‐processing methods still have a long way to go due to the high level of noise [66]. These noisy terms incorporate a short length of messages, slangs, abbreviations, acronyms, blended dialects, linguistic and spelling mistakes, sporadic, casual, shortened words, and ill‐advised sentence structure, which make it hard for learning algorithms to perform productively and adequately [67]. Additionally, error from sensor reading due to low battery, damage, incorrect calibrations, among others, can render data delivered from such sensors unsuitable for analysis [68].

      Data quality is a fundamental determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to strengthen data stream pre‐processing stage in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, data stream pre‐processing techniques with low computational requirement [73] needs to be evolved as this is still open for research. Moreover, the representation of social media posts must be in a way that the semantics of social media content is preserved [74, 75]. To improve the result of analysis in the data stream, there is need to develop frameworks that will cope with the noisy characteristics, redundancy, heterogeneity, data imbalance, transformation, feature representation, or selection issues in data streams [26]. Some of the new frameworks developed for pre‐processing and enriching data stream for better results are SlangSD [76], N‐gram and Hidden Markov Model [77], SLANGZY [78], and SMFP [67].

      Different streaming data mining tasks include clustering, similarity search, prediction, classification, and object detection, among others [82, 83]. Algorithms used for streaming data analysis can be grouped into four: Unsupervised learning, semi‐supervised learning, supervised learning, and ontology‐based techniques. These are subsequently described.

      6.1 Unsupervised Learning

      Unsupervised learning is a type of learning that draws inductions from the unlabeled dataset [84]. Data stream source is nonstationary, and for clustering algorithms, there is no information about the data distribution in advance [85]. Due to several iterations required to compute similarity or dissimilarity in the observed dataset, the entirety of the datasets ought to be accessible in memory before running the algorithm in most cases. However, with data stream clustering, the challenge is searching for a new structure in data as it evolves, which involves characterizing the streaming data in the form of clusters to leverage them to report useful and interesting patterns in the data stream [86]. Unsupervised learning algorithms are suitable for analyzing data stream as it does not require a predefined label [87]. Clusters are ordered dependent on scoring function, for example, catchphrase or keyword, hashtags, the semantic relationship of terms, and segment extraction [88].

      Data stream clustering can be grouped into five categories, which are partitioning methods, hierarchical methods, model‐based methods, density‐based methods, and grid‐based methods.

      Partition‐based techniques try to find out k‐partitions based on some measurement. Partitioning clustering methods are not suitable for streaming scenarios since they require earlier information on cluster number. Examples of partition‐based methods include Incremental K‐Mean, STREAMKM++, Stream LSearch, HPStream, SWClustering, and CluStream.

      Hierarchical methods can be further subdivided into divisive and agglomerative. With divisive hierarchical clustering, a cluster is divided into small clusters until it cannot be split further. In contrast, agglomerative hierarchical clustering merges up separate clusters until the distance between two clusters reaches a required threshold. Balanced iterative reducing and clustering using hierarchies (BIRCH), open distributed application construction (ODAC), E‐Stream, clustering using representatives (CURE), and HUE‐ are some hierarchical algorithms for data stream analysis.

      In model‐based methods, a hypothesized model is run for each cluster to check which data properly fits a cluster. Some of the algorithms that fit into this category are CluDistream, Similarity Histogram‐based Incremental Clustering, sliding window with expectation maximization (SWEM), COBWEB, and Evolving Fractal‐Based Clustering of Data Streams.

      6.2 Semi‐Supervised

Скачать книгу