Data Cleaning. Ihab F. Ilyas
Чтение книги онлайн.
Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 11
Example 2.1 Consider Table 2.1, including the name, age, income, and tax of employees. Domain knowledge suggests that the first value t1[age] and the last value t9[age] are outlying age values. We use Grubbs’ test with a significance level α = 0.05 to identify outliers in the age column.
The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. Grubbs’ test statistic
Removing t9[age], we are left with 8 age values. The mean of the 8 age values is 28.88, and the standard deviation of the 8 age values is 12.62. Grubbs’ test statistic is
Removing t1[age], we are left with 7 age values. The mean of the 7 age values is 32.86, and the standard deviation of the 7 age values is 6.15. Grubbs’ test statistic
Previous discussion assumes that the data follows an approximately normal distribution. To assess this, several graphical techniques can be used, including the Normal Probability Plot, the Run Sequence Plot, the Histogram, the Box Plot, and the Lag Plot.
Iglewicz and Hoaglin provide an extensive discussion of the outlier tests previously given [Iglewicz and Hoaglin 1993]. Barnett and Lewis [1994] provide a book length treatment of the subject. They provide additional tests when data is not normally distributed.
2.2.3 Fitting Distribution: Parametric Approaches
The other type of statistics-based approach first fits a statistical distribution to describe the normal behavior of the given data points, and then applies a statistical inference procedure to determine if a certain data point belongs to the learned model. Data points that have a low probability according to the learned statistical model are declared as anomalous outliers. In this section, we discuss parametric approaches for fitting a distribution to the data.
Univariate
We first consider univariate outlier detection, for example, for a set of values x1, x2, …, xn that appear in one column of a relational table. Assuming the data follows a normal distribution, fitting the values under a normal distribution essentially means computing the mean μ and the standard deviation σ from the current data points x1, x2, …, xn. Given μ and σ, a simple way to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean, namely z-score. Data values that have a z-score greater than a threshold, for example, of three, are declared to be outliers.
Since there might be outliers among x1, x2, … xn, the estimated μ and σ might be far off from their actual values, resulting in missing outliers in the data, as we show in Example 2.2.
Example 2.2 Consider again the age column in Table 2.1. The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. The procedure that identifies values that are more than 2 standard deviations away from the mean as outliers would mark values that are not in the range of [136.78 − 2 * 323.92, 136.78 + 2 * 323.92] = [–511.06, 784.62]. The last value t9[age] is not in the range, and thus is correctly marked as an outlier. The first value t1[age], however, is in the range and is thus missed.
This effect is called masking [Hellerstein 2008]; that is, a single data point has severely shifted the mean and standard deviation so much as to mask other outliers. To mitigate the effect of masking, robust statistics are often employed, which can correctly capture important properties of the underlying distribution even in the face of many outliers in the data values. Intuitively, the breakdown point of an estimator is the proportion of incorrect data values (e.g., arbitrarily large or small values) an estimator can tolerate before giving an incorrect estimate. The mean and standard deviation have the lowest breakdown point: a single bad value can distort the mean completely.
Robust Univariate Statistics. We now introduce two robust statistics: the median and the median absolute deviation (MAD) that can replace mean and standard deviation, respectively. The median of a set of n data points is the data point for which half of the data points are smaller, and half are larger; in the case of an even number of data points, the median is the average of the middle two data points. The median, also known as the 50th percentile, is of critical importance in robust statistics with a breakdown point of 50%; as long as no more than half the data are outliers, the median will not give an arbitrarily bad result. The median absolute deviation (MAD) is defined as the median of the absolute deviations from the data’s median, namely, MAD = mediani(|xi − medianj (xj)|). Similar to the median, MAD is a more robust statistic than the standard deviation. In the calculation of the standard deviation, the distances from xi to the mean are squared, so large deviations, which often are caused by outliers, are weighted heavily, while in the calculation of MAD, the deviations of a small number of outliers are irrelevant because MAD is using the median of the absolute deviations.
The median and MAD lead to a robust outlier detection technique known as Hampel X84 [Hampel et al. 2011] that is quite reliable in the face of outliers since it has a breakdown point of 50%. Hampel X84 marks outliers as those data points that are more than 1.4826θ MADs away from the median, where θ is the number of standard deviations away from the mean one would have used if there were no outliers in the dataset. The constant 1.4826 is derived under a normal distribution, where one standard deviation away from the mean is about 1.4826 MADs.
Example 2.3 For detecting outliers in the ages column in Table 2.1, we used 2 standard deviations away from the mean in Example 2.2. Therefore, we would like to flag points that are 1.4826 * 2 = 2.9652 away from the median as outliers.
The median of the set of values in Example 2.2 is 32 and the MAD is 7. The normal value range would be [32 − 7 * 2.9652, 32 + 7 * 2.9652] = [11.2436, 52.7564], which is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation. Under the new normal range, the first value and the last value are now correctly flagged as outliers.
Multivariate
So far, we have considered detecting univariate outliers in