Data Cleaning. Ihab F. Ilyas

Чтение книги онлайн.

Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 13

Автор:
Серия:
Издательство:
Data Cleaning - Ihab F. Ilyas

Скачать книгу

[Wessa 2012]. We report data points residing in bins that contain less than three data points as outliers.

      In Figure 2.2(a), the age values are split into 10 equal-width bins, namely (0, 100], (100, 200], …, (900, 1000]. Only the first bin and the last bin contain data points. Specifically, bin (0, 100] contains 8 points and bin (900, 1000] contains 1 point. Therefore, we report t9[age] ∈ (900, 1000] as an outlier.

      In Figure 2.2(b), the age values are split into 50 equal-width bins, namely (0, 20], (20, 40], …, (980, 1000]. Only the first three bins and the last bin contain data points. Specifically, bin (0, 20] contains 1 point, bin (20, 40] contains 6 points, bin (40, 60] contains 6 points, and bin (980, 1000] contains 1 point. Therefore, we report t1[age] ∈ (0, 20], t8[age] ∈ (40, 60], and t9[age] ∈ (980, 1000] as outliers.

      There are usually two ways to generalize a histogram to deal with multivariate data: (1) computing an outlier score for each dimension using the one-dimensional histogram, and then combining the score to obtain an overall outlier score for every data point; and (2) binning every dimension to divide multiple dimensions together; the number of data points belonging to each division is counted, and the data points that belong to divisions with very low frequency counts are reported as outliers. The performance of histogram methods usually deteriorates with increasing number of dimensions due to the sparsity of the grid structure with increasing dimensionality [Aggarwal 2013]. For example, given a 10-dimensional space with each dimension being split into 10 bins, the grid will contain 1010 grid cells. It would require 1010 available data points to allow every grid cell to have one data point in expectation.

      Kernel Density Estimation

      KDE [Rosenblatt 1956, Parzen 1962, Gan and Bailis 2017] is a nonparametric way to estimate the pdf f (x) of a random variable X based on a sample of n data points, which is the same as histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

      Let x1, x2, …, xn be an independent and identically distributed sample drawn from some distribution with unknown pdf f (x). KDE estimates f (x) using image defined as follows:

image

      where K(•) is the kernel, and h is a smoothing parameter called the bandwidth. A kernel has to be a non-negative function that integrates to one and has mean zero, namely, image and K(x) = K(−x). A kernel with a subscript h is a scaled kernel that satisfies Kh(x) = 1/hK (x/h). Some commonly used kernel functions include the normal, uniform, triangular, biweight, triweight, and Epanechnikov kernels. Just as the choice of the width of the bin influences the result of histograms, the bandwidth of the kernel is a free parameter strongly influencing the KDE as well.

      Example 2.6 To compare KDEs with histograms, consider 9 data points x1 = 1, x2 = 1, x3 = 2, x4 = 2, x5 = 2, x6 = 4, x7 = 6, x8 = 9, and x9 = 10. Figure 2.3 compares the pdf derived using both the equi-width histogram with a bin width equal to 2 and three KDEs using different kernel functions and bandwidths. As we can see, KDEs in general produce much smoother estimates than the histogram. Comparing Figures 2.3(b), 2.3(c), and 2.3(d), we can see that bandwidth clearly controls the smoothness of the pdf produced. Given the estimated pdf, any points that have low probability are flagged as outliers.

      The generalization of multivariate KDE from univariate KDE is similar to the generalization of multivariate histograms from univariate histograms, and is discussed in detail in Simonoff [2012].

Image

      In contrast to statistics-based outlier detection techniques, distance-based outlier techniques do not assume an underlying model that generates the data and are often nonparametric. These approaches often define a distance between data points, which is used for defining a normal behavior, such as normal data points should be close to many other data points, and any data point that deviates from such normal behavior is declared an outlier [Knorr and Ng 1998, 1999, Breunig et al. 2000].

      Distance-based outlier detection methods can be further divided into global or local methods depending on the reference population used when determining whether a point is an outlier. A global distance-based outlier detection method determines whether a point is an outlier based on the distance between that data point and all other data points in the dataset. On the other hand, a local method considers the distance between a point and its neighborhood points when determining outliers.

      2.3.1 Global Distance-Based Outlier Detection

      Distance-based methods were first introduced by Knorr and Ng [Knorr and Ng 1998, 1999]. An object O in a dataset I is a distance-based outlier, i.e., DB(p, D) outlier, if at least fraction p of the objects in I lies greater than distance D from O. DB(p, D) can be seen as a generalization of some existing statistics-based outlier detection definitions. For example, let I be observations from an univariate normal distribution N (μ, δ) and O be a point from I, then the z-score of O is greater than 3 if and only if O is a DB(0.9988, 0.13δ) outlier [Knorr and Ng 1998]. Knorr and Ng proved that statistics-based outlier detection definitions based on other underlying distributions, such as Student t-distribution and Poisson distribution, are also equivalent to DB(p, D) outliers with suitable choice of p and D.

      The definition of DB(p, D) does not provide a ranking or a score of outliers and requires specifying a distance parameter D, which could be difficult to determine and may involve a trial and error process. There are other distance-based outlier definitions that look at the k nearest neighborhood of a data point. Kollios et al. [2003] defines an object O as outlier in a dataset I if at most k objects in I lie at distance at most D from O. Ramaswamy et al. [2000] defines outliers as the top few data elements whose distance to the kth nearest neighbor is greatest.

      The simplest category of algorithms [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000] for finding distance-based outliers are those using nested loops to find the distance between every point and every other point, which has a quadratic complexity. Another category of algorithms uses spatial index structures such as KD-trees, R-trees, or X-trees to find the nearest neighbors of each point [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000]. This category of algorithms works well for low-dimensional datasets, and has a complexity of O(n log n) if the index tree can allow a O(log n) lookup for finding a point’s nearest neighbor. However, the index structures become increasingly complex as the dimensionality increases. For example, Breunig et al. [2000] used an X-tree variant to do a nearest neighbor search; they found

Скачать книгу