Big Data. Seifedine Kadry

Чтение книги онлайн.

Читать онлайн книгу Big Data - Seifedine Kadry страница 12

Big Data - Seifedine  Kadry

Скачать книгу

data technologies are capable of capturing and analyzing them effectively. Big data infrastructure involves new computing models with the capability to process both distributed and parallel computations with highly scalable storage and performance. Some of the big data components include Hadoop (framework), HDFS (storage), and MapReduce (processing).

      1.8.1 Big Data Generation

image

      1.8.2 Data Aggregation

      The data aggregation phase of the big data life cycle involves collecting the raw data, transmitting the data to the storage platform, and preprocessing them. Data acquisition in the big data world means acquiring the high‐volume data arriving at an ever‐increasing pace. The raw data thus collected is transmitted to a proper storage infrastructure to support processing and various analytical applications. Preprocessing involves data cleansing, data integration, data transformation, and data reduction to make the data reliable, error free, consistent, and accurate. The data gathered may have redundancies, which occupy the storage space and increase the storage cost and can be handled by data preprocessing. Also, much of the data gathered may not be related to the analysis objective, and hence it needs to be compressed while being preprocessed. Hence, efficient data preprocessing is indispensable for cost‐effective and efficient data storage. The preprocessed data are then transmitted for various purposes such as data modeling and data analytics.

      1.8.3 Data Preprocessing

      Data preprocessing is an important process performed on raw data to transform it into an understandable format and provide access to consistent and accurate data. The data generated from multiple sources are erroneous, incomplete, and inconsistent because of their massive volume and heterogeneous sources, and it is meaningless to store useless and dirty data. Additionally, some analytical applications have a crucial requirement for quality data. Hence, for effective, efficient, and accurate data analysis, systematic data preprocessing is essential. The quality of the source data is affected by various factors. For instance, the data may have errors such as a salary field having a negative value (e.g., salary = −2000), which arises because of transmission errors or typos or intentional wrong data entry by users who do not wish to disclose their personal information. Incompleteness implies that the field lacks the attributes of interest (e.g., Education = “”), which may come from a not applicable field or software errors. Inconsistency in the data refers to the discrepancies in the data, say date of birth and age may be inconsistent. Inconsistencies in data arise when the data collected are from different sources, because of inconsistencies in naming conventions between different countries and inconsistencies in the input format (e.g., date field DD/MM when interpreted as MM/DD). Data sources often have redundant data in different forms, and hence duplicates in the data also have to be removed in data preprocessing to make the data meaningful and error free. There are several steps involved in data preprocessing:

      1 Data integration

      2 Data cleaning

      3 Data reduction

      4 Data transformation

      1.8.3.1 Data Integration

      1.8.3.2 Data Cleaning

      The data‐cleaning process fills in the missing values, corrects the errors and inconsistencies, and removes redundancy in the data to improve the data quality. The larger the heterogeneity of the data sources, the higher the degree of dirtiness. Consequently, more cleaning steps may be involved. Data cleaning involves several steps such as spotting or identifying the error, correcting the error or deleting the erroneous data, and documenting the error type. To detect the type of error and inconsistency present in the data, a detailed analysis of the data is required. Data redundancy is the data repetition, which increases storage cost and transmission expenses and decreases data accuracy and reliability. The various techniques involved in handling data redundancy are redundancy detection and data compression. Missing values can be filled in manually, but it is tedious, time‐consuming, and not appropriate for the massive volume of data. A global constant can be used to fill in all the missing values, but this method creates issues while integrating the data; hence, it is not a foolproof method. Noisy data can be handled by four methods, namely, regression, clustering, binning, and manual inspection.

      1.8.3.3 Data Reduction

      Data processing on massive data volume may take a long time, making data analysis either infeasible or impractical. Data reduction is the concept of reducing the volume of data or reducing the dimension of the data, that is, the number of attributes. Data reduction techniques are adopted to analyze the data in reduced format without losing the integrity of the actual data and yet

Скачать книгу