Data Cleaning. Ihab F. Ilyas

Чтение книги онлайн.

Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 12

Автор:
Серия:
Издательство:
Data Cleaning - Ihab F. Ilyas

Скачать книгу

data to a normal distribution. However, some outliers can only be revealed when considering multiple dimensions; we call these multivariate outliers.

      Suppose we have a table that has n rows and d columns; we model each column as a random variable Xi, ∀i ∈ [1, d]. Let the mean of the d variables be a d-dimensional vector μ = (μ1, μ2, …, μd)T and the covariance matrix be Σ, where Σ i,j is the covariance between the two random variables Xi and Xj. Given the mean and the covariance matrix, the questions are how to measure the distance between a particular data point to the mean using the covariance matrix, and how much distance should qualify a point as an outlier. The Mahalanobis distance [Mahalanobis 1936] is the multidimensional generalization of measuring how many standard deviations away a point is from the mean of the population. Specifically, the Mahalanobis distance of a point x to the mean vector μ is defined as Mahalanobis (x) = image. Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ. If every dimension is a normal distribution, then the square of the Mahalanobis distance follows a image distribution with d degrees of freedom. Using the probability cumulative distribution function of image, we define the threshold of the Mahalanobis distance to be a number where most of the Mahalanobis distance (e.g., 95%) is smaller than that threshold distance.

      Robust Multivariate Statistics. Similar to the univariate mean and variance, the multivariate mean and covariance matrix are not robust, i.e., a single bad data point might severely distort the mean and the covariance matrix, and thus robust estimators must be used. The most famous method for the robustification of multivariate mean and covariance matrix is called the Minimum Covariance Determinant (MCD) by Rousseeuw and Driessen [1999]. MCD chooses a subset of h points that minimizes the determinant of the covariance matrix over all possible subsets of size h (h is usually chosen to reflect the belief about how many points in the dataset are outliers). For example, assume 10% of n points are outliers, then h is chosen to be 0.9n. The multivariate mean and covariance matrix of the h points are used to compute the Mahalanobis distance for all n points. A brute force way to find the covariance matrix with the smallest determinant is to enumerate all possible subsets of size h from the n data points, which is obviously very expensive as one has to evaluate all the possible subsets. A greedy algorithm called FASTMCD [Rousseeuw and Driessen 1999] tries to solve the efficiency problem. Algorithm 2.1 describes FASTMCD. It starts by randomly selecting a subset of h data points, and the mean μ and the covariance matrix Σ are computed using the random sample. Then, the Mahalanobis distances for all points in I are computed using the μ and Σ. The random sample is updated using h points with the smallest Mahalanobis distances. The sample is updated in iterations until it remains unchanged between two iterations. The above process is repeated by using a different random sample as the initial seed, and the best result is used.

Image

      The correctness of the algorithm can be proven by showing that det(H1) ≤ det (H0) in the iterative portion. The mean and the covariance matrix computed using FASTMCD can be shown to have a 50% breakdown point [Rousseeuw and Driessen 1999].

      Most of our discussions of parametric approaches have been based on normal distributions. In practice, not all datasets are normally distributed. Two other distributions are frequently observed: multimodal distributions and Zipfian distributions. In some cases, data appears to have many “peaks,” such distributions are typically referred as being multimodal. Oftentimes, these multimodal distributions can be seen as a superimposed normal distributions, known as a mixture of Gaussians. In some other cases, a large fraction of the data is condensed into a small fraction of values, while the remainder of the data values are spread across a long tail of rare values. The Zipfian distribution exhibits such a phenomenon. It is important to choose the right distribution to detect outliers using parametric approaches.

      An obvious drawback of using parametric approaches for fitting a distribution is that they assume the data to follow an underlying distribution. In contrast, non-parametric approaches make no assumption about the distribution that generates the data; instead, they infer the distribution from the data itself. There are mainly two types of techniques in this category: histogram-based techniques and kernel density-based techniques.

      Histograms

      A histogram [Pearson 1894] is a graphical representation of the distribution of numerical data values, and is often used to estimate the probability distribution of a continuous random variable. The first step toward creating a histogram is to discretize or bin the range of values by dividing the range of the random variable into a series of intervals, which are usually consecutive and non-overlapping. The second step is to count how many values fall under each bin. If the bins are of equal size, a bin is created with height proportional to the frequency of each bin, i.e., the number of values a bin contains; this type of histogram is called an equi-width histogram. In general, however, bins can be of varying width. If the width of bins are chosen such that every bin has the same frequency, the histogram is called an equi-depth histogram.

      Equi-width histograms can be used to detect outliers; data points that belong to bins that have very low frequency are reported as outliers. A major challenge with histogram-based outlier detection approaches is that it is often very difficult to come up with the “right” width of the interval. If the width of the bins is too narrow, the frequency of some bins might be low, and normal data points belong to those bins are reported as outliers, and if the width of the bins is too wide, outlier data points might get absorbed in bins that have normal data points, and thus fail to report the real outliers. The problem is further exacerbated when adapting histograms to multivariate data, since the frequency of every division diminishes as the number of dimensions increases, as we mentioned earlier as a curse of dimensionality.

Image

      Example 2.5 Figure 2.2 shows two equi-width histograms we built for the age column in Table 2.1 using

Скачать книгу