Data Cleaning. Ihab F. Ilyas
Чтение книги онлайн.
Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 6
Regardless of the type of data errors to be fixed, data cleaning activities usually consist of two phases: (1) error detection, where various errors and violations are identified and possibly validated by experts; and (2) error repair, where updates to the database are applied (or suggested to human experts) to bring the data to a cleaner state suitable for downstream applications and analytics. Error detection techniques can be either quantitative or qualitative. Specifically, quantitative error detection techniques often involve statistical methods to identify abnormal behaviors and errors [Hellerstein 2008] (e.g., “a salary that is three standard deviations away from the mean salary is an error”), and hence have been mostly studied in the context of outlier detection [Aggarwal 2013]. On the other hand, qualitative error detection techniques rely on descriptive approaches to specify patterns or constraints of a consistent data instance, and for that reason these techniques identify those data that violate such patterns or constraints as errors. For example, in a descriptive statement about a company HR database, “for two employees working at the same branch of the company, the senior employee cannot earn less salary than the junior employee,” if we find two employees with a violation of the rule, it is likely that there is an error in at least one of them.
Various surveys and books detail specific aspects of data quality and data cleaning. For example, Rahm and Do [2000] classify different types of errors occurring in an Extract-Transform-Load (ETL) process, and survey the tools available for cleaning data in an ETL process. Some work focuses on the effect of incompleteness data on query answering [Grahne 1991] and the use of a Chase procedure [Maier et al. 1979] for dealing with incomplete data [Greco et al. 2012]. Hellerstein [2008] focuses on cleaning quantitative numerical data using mainly statistical techniques. Bertossi [2011] provides complexity results for repairing inconsistent data and performing consistent query answering on inconsistent data. Fan and Geerts [2012] discuss the use of data quality rules in data consistency, data currency, and data completeness, and their interactions. Dasu and Johnson [2003] summarize how techniques in exploratory data mining can be integrated with data quality management. Ganti and Sarma [2013] focus on an operator-centric approach for developing a data cleaning solution, involving the development of customizable operators that can be used as building blocks for developing common solutions. Ilyas and Chu [2015] provide taxonomies and example algorithms for qualitative error detection and repairing techniques. Multiple surveys and tutorials have been published to summarize different definitions of outliers and the algorithms for detecting them [Hodge and Austin 2004, Chandola et al. 2009, Aggarwal 2013]. Data deduplication, a long-standing problem that has been studied for decades [Fellegi and Sunter 1969], has also been extensively surveyed [Koudas et al. 2006, Elmagarmid et al. 2007, Herzog et al. 2007, Dong and Naumann 2009, Naumann and Herschel 2010, Getoor and Machanavajjhala 2012].
This book, however, focuses on the end-to-end data cleaning process, describing various error detection and repair methods, and attempts to anchor these proposals with multiple taxonomies and views. Our goals are (1) to allow researchers and general readers to understand the scope of current techniques and highlight gaps and possible new directions of research; and (2) to give practitioners and system implementers a variety of choices and solutions for their data cleaning activities. In what follows, we give a brief overview of the book’s scope as well as a chapter outline.
Figure 1.1 A typical data cleaning workflow with an optional discovery step, error detection step, and error repair step.
1.1 Data Cleaning Workflow
Figure 1.1 shows a typical data cleaning workflow, consisting of an optional discovery and profiling step, an error detection step, and an error repair step. To clean a dirty dataset, we often need to model various aspects of this data, e.g., schema, patterns, probability distributions, and other metadata. One way to obtain such metadata is by consulting domain experts, typically a costly and time-consuming process. The discovery and profiling step is used to discover these metadata automatically. Given a dirty dataset and the associated metadata, the error detection step finds part of the data that does not conform to the metadata, and declares this subset to contain errors. The errors surfaced by the error detection step can be in various forms, such as outliers, violations, and duplicates. Finally, given the errors detected and the metadata that generate those errors, the error repair step produces data updates that are applied to the dirty dataset. Since there are many uncertainties in the data cleaning process, external sources such as knowledge bases and human experts are consulted whenever possible and feasible to ensure the accuracy of the cleaning workflow.
Example 1.1 Consider Table 1.1 containing employee records for a U.S. company. Every tuple specifies a person in a company with her id (GID), name (FN, LN), level (LVL), zip code (ZIP), state (ST), and salary (SAL). Suppose a domain expert supplies two data quality rules for this table. The first rule states that if two employees have the same zip code, they must be in the same state. The second rule states that among employees working in the same state, a senior employee cannot earn a smaller salary than a junior employee.
Table 1.1 An example employee table
Given these two data quality rules, the error detection step detects two violations. The first violation consists of four cells {t1[ZIP], t1[ST], t3[ZIP], t3[ST]}, which together violate the first data quality rule. The second violation consists of six cells {t1[ROLE], t1[ST], t1[SAL], t2[ROLE], t2[ST], t2[SAL]}, which together violate the second data quality rule. The data repair step takes the violations and produces an update that changes t1[ST] from “NM” to “NY”, and the new data now has no violation with respect to the two rules.
1.2 Book Scope
The aforementioned data cleaning workflow describes a general purpose data cleaning process, but there are different data cleaning topics that address one or multiple steps in the workflow. We cover some of the most common and practical cleaning topics in this book: outlier detection, data deduplication, data transformation, rule-based data cleaning, ML guided cleaning, and human involved data cleaning. We briefly explain these topics in the following subsections; we also highlight the book structure in Section 1.2.7.
1.2.1 Outlier Detection
Outlier detection refers to detecting “outlying” values. While an exact definition of an outlier depends on the application, there are some commonly used definitions, such as “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] and “an outlier observation is one that appears to deviate markedly from other members of