Data Cleaning. Ihab F. Ilyas
Чтение книги онлайн.
Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 7
Applications of outlier detection include network intrusion detection, financial fraud detection, and abnormal medical condition detection. As a concrete example, imagine a company that is interested in improving road safety by making drivers more aware of their driving habits. To achieve this, data is collected from many hundreds of thousands of vehicles equipped with sensors connected with smart-phones. The collected data includes phone model, trip length, battery drain, and so on. Outlier detection and explanation engines such as Macrobase [Bailis et al. 2017] can be used to analyze the collected data. For example, Macrobase can find that some trips showed abnormally short lengths, common in smartphones with Apple iOS 9.0 beta 1. This reason was reported to the engineers, who discovered that a buggy Bluetooth stack was introduced in OS 9.0 beta 1, preventing iOS devices from connecting to in-car sensors.
Outlier detection faces two main challenges. First, defining what is a normal data pattern and what is an outlier can be difficult as different data and applications differ in what is considered normal. Many different detection techniques have been proposed to define normal behavior. Second, many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the curse of dimensionality.
1.2.2 Data Deduplication
Duplicate records occur for many reasons. Data deduplication, also known as duplicate detection, record linkage, record matching, or entity resolution, refers to the process of identifying tuples in one or more relations that refer to the same real-world entity. For example, a customer might be recorded multiple times in a customer database if the customer used different names when purchasing; a single item might be represented multiple times in an online shopping site; and duplicate records might appear after a data integration project because that record had different representations in original data sources. A data deduplication process usually involves many steps and choices, including designing similarity metrics to evaluate the similarity for a pair of records, training classifiers to determine whether a pair of records are duplicates, clustering all records to obtain clusters of records that represent the same real-world entity, consolidating clusters of records to unique representations, designing blocking or distributed strategies to scale up the deduplication process, and involving humans to decide whether a record pair are duplicates when machines are uncertain.
To address this, many open-source and commercial data deduplication tools have been built, such as Magellan [Konda et al. 2016] and Data Tamer [Stonebraker et al. 2013] (later commercialized as Tamr4). Fortune 500 companies use these tools to make sense of their large procurement data. Large companies often have multiple business units, and each unit buys many parts and products from many suppliers. These units might be buying the same part and product from the same supplier without each other’s knowledge, and therefore cannot get the best pricing. Furthermore, the same part and product might be named differently across different units, which complicates the deduplication process. A great deal of money can saved by identifying multiple records from the same supplier.
Achieving good precision and recall at the same time is difficult in data deduplication—declaring all pairs are duplicates achieves perfect recall but poor precision while declaring no pairs as duplicates achieves perfect precision, but poor recall. The problem is especially challenging given the myriad of design choices in designing a data deduplication workflow. Furthermore, data deduplication is inherently a combinatorial task that has quadratic complexity. For example, when done naively, comparing all pairs of only 1000 records requires 499,500 comparisons. In addition, grouping records that refer to the same real-world entity can be even harder. For example, correlation clustering used for grouping tuples that represent the same real-world entity is an NP-hard problem [Elsner and Schudy 2009].
1.2.3 Data Transformation
Data transformation refers to the task of transforming data from one format to another, for example, transforming phone numbers to a standard format by adding “-” in between digits. Data transformations can be seen as error repair activities and are used at various stages of data analytics. For example, before running a data integration project, transformations are often used to standardize data formats, enforce standard patterns, or trim long strings. Transformations are also used at the end of the ETL process, for example, to merge clusters of duplicate records.
Transform-Data-by-Example is an Excel add-in that finds the desired transformation easily for a given transformation task.5 Only a few examples of the desired output are needed, and Transform-Data-by-Example automatically finds relevant data transformation functions from a large collection that it has already indexed. This collection is acquired from a variety of sources, including Github source codes, Stackoverflow code snippets, and .Net libraries. Users can also extend the collection to cover domain-specific transformation capabilities by providing their own transformation functions.
Often, no ground truth is available to train or create accurate transformation solutions, so there might be an infinite number of applicable transformation programs. Although the desired transformation might be clear to a human expert, running all of these possibilities is expensive. Thus, practical data transformation tools must effectively prune the space interactively to minimize cost.
1.2.4 Rule-Based Data Cleaning
Another common way to detect and rectify errors in databases is through the enforcement of data quality rules, often expressed as integrity constraints (ICs) [Chu et al. 2013b, Kolahi and Lakshmanan 2009, Fan and Geerts 2012]. An example data quality is “For two employees working at the same branch of a company, the senior employee cannot have less vacation time than the junior employee.” Given a dataset, data quality rules can either be manually designed by domain experts who understand the semantics of the data, or be automatically mined from the data [Chu et al. 2013a, Chiang and Miller 2008]. In this context, (qualitative) error detection is the process of identifying violations of ICs, namely, subsets of records that cannot co-exist, and error repair is the exercise of modifying the database, such that the violations are resolved and the new data conforms to the given ICs.
Rule-based data cleaning techniques using denial constraints [Chu et al. 2013a, Chu et al. 2013b] have been proven to be extremely valuable in detecting and repairing errors in real data in consultation with animal scientists at UC Berkeley studying the effects of firewood cutting on small terrestrial vertebrates (birds, amphibians, reptiles, and small mammals) [Tietje et al. 2018]. Each record in the dataset contains information about capturing an animal, including the time and the location of the capture, the tag ID of the animal captured, and the properties of the animal at the time of the capture, such as weight, species type, gender, and age group. The dataset was collected over the past 20 years. Since every capture was first recorded in a log book and then later transcribed into Excel sheets, there are missing values, typos, and transcribing errors in the dataset. A dozen discovered data quality rules, expressed in the form of denial constraints [Baudinet et al. 1999], are used to detect and repair data errors. This helped the animal data scientists to correct hundreds of errors that would otherwise be very difficult to spot by humans.
Deriving a comprehensive set of integrity constraints that accurately reflects an organization’s policies and domain semantics is a difficult task. Data is often scattered across silos; for example, different departments may keep their personnel records separate and may have different policies to determine an employee’s salary. Furthermore, data quality rules of varying expressiveness can be found in an organization, ranging from ad-hoc program scripts to simple sanity checks. Rather than consulting domain experts, techniques to automatically discover ICs can be used, which need to balance the trade-off between the expressiveness of the ICs and the efficiency of discovery. Given a set of defined ICs and their