Big Data. Seifedine Kadry
Чтение книги онлайн.
Читать онлайн книгу Big Data - Seifedine Kadry страница 16
3 What are the factors that explain the tremendous increase in the data volume? Multiple disparate data sources are responsible for the tremendous increase in the volume of big data. Much of the growth in data can be attributed to the digitization of almost anything and everything in the globe. Paying e‐bills, online shopping, communication through social media, e‐mail transactions in various organizations, a digital representation of the organizational data, and so forth, are some of the examples of this digitization around the globe.
4 What are the different data types of big data? Machine‐generated and human‐generated data can be represented by the following primitive types of big dataStructured dataUnstructured dataSemi‐Structured data
5 What is semi‐structured data? Semi‐structured data are that which have a structure but does not fit into the relational database. Semi‐structured data are organized, which makes it easier for analysis when compared to unstructured data. JSON and XML are examples of semi‐structured data.
6 What does the three Vs of big data mean? Volume–Size of the dataVelocity–Rate at which the data is generated and is being processedVariety–Heterogeneity of data: structured, unstructured, and semi‐structured
7 What is commodity hardware? Commodity hardware is a low‐cost, low‐performance, and low‐specification functional hardware with no distinctive features. Hadoop can run on commodity hardware and does not require any high‐end hardware or supercomputers to execute its jobs.
8 What is data aggregation? The data aggregation phase of the big data life cycle involves collecting the raw data, transmitting the data to a storage platform, and preprocessing them. Data acquisition in the big data world means acquiring the high‐volume data arriving at an ever increasing pace.
9 What is data preprocessing? Data preprocessing is an important process performed on raw data to transform it into an understandable format and provide access to a consistent and an accurate data. The data generated from multiple sources are erroneous, incomplete, and inconsistent because of their massive volume and heterogeneous sources, and it is pointless to store useless and dirty data. Additionally, some analytical applications have a crucial requirement for quality data. Hence, for effective, efficient, and accurate data analysis, systematic data preprocessing is essential.
10 What is data integration? Data integration involves combining data from different sources to give the end users a unified data view.
11 What is data cleaning? The data‐cleaning process fills in the missing values, corrects the errors and inconsistencies, and removes redundancy in the data to improve the data quality. The larger the heterogeneity of the data sources, the higher the degree of dirtiness. Consequently, more cleaning steps may be involved.
12 What is data reduction? Data processing on massive data volume may take a long time, making data analysis either infeasible or impractical. Data reduction is the concept of reducing the volume of data or reducing the dimension of the data, that is, the number of attributes. Data reduction techniques are adopted to analyze the data in reduced format without losing the integrity of the actual data and yet yield quality outputs.
13 What is data transformation? Data transformation refers to transforming or consolidating the data into an appropriate format that is acceptable by the big data database and converting them into logical and meaningful information for data management and analysis.
Frequently Asked Interview Questions
1 Give some examples of big data. Facebook is generating approximately 500 terabytes of data per day, about 10 terabytes of sensor data are generated every 30 minutes by airlines, the New York Stock Exchange is generating approximately 1 terabyte of data per day. These are examples of big data.
2 How is big data analysis useful for organizations? Big data analytics is useful for the organizations to make better decisions, find new business opportunities, compete against business rivals, improve performance and efficiency, and reduce cost by using advanced data analytics techniques.
2 Big Data Storage Concepts
CHAPTER OBJECTIVE
The various storage concepts of big data, namely, clusters and file system are given a brief overview. The data replication, which has made big the data storage concept a fault tolerant system is explained with master‐slave and peer‐peer types of replications. Various storage types of on‐disk storage are briefed. Scalability techniques, namely, scaling up and scaling out, adopted by various database systems are overviewed.
In big data storage, architecture data reaches users through multiple organization data structures. The big data revolution provides significant improvements to the data storage architecture. New tools such as Hadoop, an open‐source framework for storing data on clusters of commodity hardware, are developed, which allows organizations to effectively store and analyze large volumes of data.
In Figure 2.1 the data from the source flow through Hadoop, which acts as an online archive. Hadoop is highly suitable for unstructured and semi‐structured data. However, it is also suitable for some structured data, which are expensive to be stored and processed in traditional storage engines (e.g., call center records). The data stored in Hadoop is then fed into a data warehouse, which distributes the data to data marts and other systems in the downstream where the end users can query the data using query tools and analyze the data.
In modern BI architecture the raw data stored in Hadoop can be analyzed using MapReduce programs. MapReduce is the programming paradigm of Hadoop. It can be used to write applications to process the massive data stored in Hadoop.
Figure 2.1 Big data storage architecture.
2.1 Cluster Computing
Cluster computing is a distributed or parallel computing system comprising multiple stand‐alone PCs connected together working as a single, integrated, highly available resource. Multiple computing resources are connected together in a cluster to constitute a single larger and more powerful virtual computer with each computing resource running an instance of the OS. The cluster components are connected together through local area networks (LANs). Cluster computing technology is used for high availability as well as load balancing with better system performance and reliability. The benefits of massively parallel processors and cluster computers are high availability, scalable performance, fault tolerance, and the use of cost‐effective commodity hardware. Scalability is achieved by removing nodes or adding additional nodes as per the demand without hindering the system operation. A cluster of systems connects together a group of systems to share critical computational tasks. The servers in a cluster are called nodes. Cluster computing can be client‐server architecture or a peer‐peer model. It provides high‐speed computational power for processing data‐intensive applications related to big data technologies. Cluster computing with distributed computation infrastructure