Computational Statistics in Data Science. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Computational Statistics in Data Science - Группа авторов страница 47

Computational Statistics in Data Science - Группа авторов

Скачать книгу

is. A stream S is a possibly infinite bag of elements (x, t) where x is a tuple belonging to the schema S and tT is the timestamp of the elements [3]. Data stream refers to an unbounded and ordered sequence of instances of data arriving over time [4]. Data stream can be formally defined as an infinite sequence of tuples S = (x1, ti), (x2, t2),…, (xn, tn),… where xi is a tuple and ti is a timestamp [5]. Streaming data can be defined as frequently changing, and potentially infinite data flow generated from disparate sources [6]. Formally, streaming data upper X equals left-parenthesis x Subscript t Superscript 1 Baseline comma ellipsis comma x Subscript t Superscript m Baseline right-parenthesis Superscript upper T is a set of count values of a variable x of an event that happened at timestamp t(0 < tT), where T is the lifetime of the streaming data [7]. Looking at the definitions of both data stream and streaming data in the context of data science, the two concepts are trickily similar. All the different schools of thought slightly agree with these slightly confusing and closely related concepts except for the Engineering school of thought that refers to data stream as an architecture. Although this is still left open for further exploration, we will use them interchangeably in this chapter.

Dimension Streaming data Static data
Hardware Typical single constrained measure of memory Multiple CPUs
Input Data streams or updates Data chunks
Time A few moments or even milliseconds Much longer
Data size Infinite or unknown in advance Known and finite
Processing A single or few pass over data Processes in multiple rounds
Storage Not store or store a significant portion in memory Store
Applications Web mining, traffic monitoring, sensor networks Widely adopted in many domains

      Source: Tozi, C. (2017). Dummy's guide to batch vs streaming. Retrieved from Trillium Software, Retrieved from http://blog.syncs ort.com/2017/07/bigdata/; Kolajo, T., Daramola, O. & Adebiyi, A. (2019). Big data stream analysis: A systematic literature review, Journal of Big Data 6(47).

      In the big data era, data stream mining serves as one of the vital fields. Since streaming data is continuous, unlimited, and with nonuniform distribution, there is the need for efficient data structures and algorithms to mine patterns from this high volume, high traffic, often imbalanced data stream that is also plagued with concept drift [11].

      This chapter intends to broaden the existing knowledge in the domain of data science, streaming data, and data streams. To do this, relevant themes including data stream mining issues, streaming data tools and technologies, streaming data pre‐processing, streaming data algorithms, strategies for processing data streams, best practices for managing data streams, and suggestions for the way forward are discussed in this chapter. The structure of the rest of this chapter is as follows. Section 2 presents a brief background on data stream computing; Section 3 discusses issues in data stream mining, tools, and technologies for data streaming are presented in Sections 4 while streaming data pre‐processing is discussed in Section 5. Sections 6 and 7 present streaming data algorithms and data stream processing strategies, respectively. This is followed by a discussion on best practices for managing data streams in Section 8, while the conclusion and some ideas on the way forward are presented in Section 9.

      The principal presumption of stream computing is that the likelihood estimation of data lies in its newness. Thus, the analysis of data is done the moment they arrive in a stream instead of what obtains in batch processing where data are first stored before they are analyzed. This is a serious requirement for suitable platforms for scalable computing with parallel architectures [14]. With stream computing, it is feasible for organizations to analyze and respond to speedily changing data in real‐time [15]. Integrating streaming data into the decision‐making process brings about a programming concept called stream computing. Stream processing solutions ought to have the option to deal with the high volume of data from different sources in real‐time by giving due consideration to accessibility, versatility, and adaptation to noncritical failure. Datastream analysis includes the ingestion of data as a boundless tuple, analysis, and creation of significant outcomes in a stream [16].

      In a stream processor, the representation of an application is done with the data flow graph, which is comprised of operations and interconnected streams. A stream processing workflow consists of programs. Formally, a composition upper C equals left-parenthesis script í’« comma less-than Subscript p Baseline right-parenthesis commawhere script í’« equals StartSet upper P 1 comma upper P 2 comma ellipsis comma upper P Subscript n Baseline EndSet is a set of transaction programs and <p is the program order, also called partial order. The partial order contains the dataflow and control order of the data stream. The composition graph script í’¢ upper C left-parenthesis upper C right-parenthesis is the acyclic graph representing the partial order.

Скачать книгу