Data Cleaning. Ihab F. Ilyas

Чтение книги онлайн.

Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 10

Автор:
Серия:
Издательство:
Data Cleaning - Ihab F. Ilyas

Скачать книгу

The testing phase of model-based techniques is fast, since each test instance needs to be compared against the precomputed model.

      The major disadvantages of model-based techniques is that they must rely on the availability of accurate labels for various normal classes, which is often not possible.

      In this section, we discuss statistics-based outlier detection methods. First, we review some basics about data distributions in Section 2.2.1. Section 2.2.2 then shows how hypothesis testings, such as Grubbs’ Test, are used for detecting outliers. In Section 2.2.3, we present parametric approaches for fitting a distribution to the data, and also introduce the notion of robust statistics, which can capture important properties of the underlying distribution in the presence of outliers in the data. In Section 2.2.4, we introduce nonparametric approaches for fitting a distribution to the data by using histograms and kernel density estimations (KDEs).

      Database practitioners usually think of data as a collection of records. They are interested in descriptive statistics about the data (e.g., “what is the min, max, and average salary for all the employees?”), and they usually do not care much about how the data is generated. On the other hand, statisticians usually treat data as a sample of some data-generating process. In many cases, they try to find a model or a distribution of that process that best describes the available sample data.

      The most commonly used distribution is the Gaussian or normal distribution, which is characterized by a mean μ and a standard variance σ [Barnett and Lewis 1994]. While the normal distribution might not be the exact distribution of a given dataset, it is the cornerstone of modern statistics, and it accurately models the sum of many independently identically distributed variables according to the central limit theorem. The following equation shows the pdf of the normal distribution N (μ, σ2):

image

      Multivariate Gaussian distribution or multivariate normal distribution is a generalization of the univariate normal distribution to higher dimensions. A random vector consisting of d random variables Xi, where i ∊ [1, d], is said to be d-variate normally distributed if every linear combination of its d components has a univariate normal distribution. The mean of the d variables is a d-dimensional vector μ = (μ1, μ2, …, μd)T, where μi is the mean of the random variable Xi. The covariance matrix Σ of these d random variables (also known as dispersion matrix or variance-covariance matrix) is a matrix, where Σί,j (Row i and Column j of Σ) is the covariance between the two random variables Xi and Xj, namely, Σi, j = cov(Xi, Xj) = E[(Xiμiμj]. Intuitively, the covariance matrix generalizes the notion of variance of a single random variable or the covariance of two random variables to d dimensions.

      A statistical hypothesis is an assumption about a population that may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses, which usually consists of the following steps [Lehmann and Romano 2006]: (1) formulate the null and the alternative hypothesis; (2) decide which test is appropriate and state the relevant test statistic T; (3) choose a significance level α, namely, a probability threshold below which the null hypothesis will be rejected; (4) determine the critical region for the test statistic T, namely, the range of values of the test statistic for which the null hypothesis is rejected; (5) compute from the current data the observed value tobs of the test statistic T; and (6) reject the null hypothesis if tobs falls under the critical region. Different known test statistics exist for testing different properties of a population. For example, the chi-square test for independence is used to determine whether there is a significant relationship between two categorical variables, the one-sample t-test is commonly used to determine whether the hypothesized mean differs significantly from the observed sample mean, and the two-sample t-test is used to determine whether the difference between two means found in two samples is significantly different from the hypothesized difference between two means.

      A number of formal hypothesis tests for detecting outliers have been proposed in the literature. These tests usually differ according to the following characteristics.

      • What is the underlying distribution model for the data?

      • Is the test designed for testing a single outlier or multiple outliers?

      • If the test is for detecting multiple outliers, does the number of outliers need to be specified exactly in the test?

      For example, the Grubbs Test [Grubbs 1969] is used to detect a single outlier in a univariate dataset that follows an approximately normal distribution. The Tietjen-Moore Test [Tietjen and Moore 1972] is a generalization of the Grubbs test to the case of more than one outlier; however, it requires the number of outliers to be specified exactly. The Generalized Extreme Studentized Deviate (ESD) Test [Rosner 1983] requires only an upper bound on the suspected number of outliers and is often used when the exact number of outliers is unknown.

      We use Grubbs’ test as an example to show how hypothesis testing is used for detecting outliers, as it is a standard test when testing for a single outlier. This outlier is expunged from the dataset and the test is repeated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test is not to be used for a small sample size. Grubbs’ test is defined for the following hypothesis, where H0 is the null hypothesis and Ha is the alternative hypothesis:

      H0 : there are no outliers in the dataset

      Ha : there is exactly one outlier in the dataset.

      Grubbs’ test statistic is defined as image with image and s denoting the sample mean and standard deviation, respectively. Grubbs’ test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. The above test statistic is the two-sided version of the test. Grubbs’ test can also be used as a one-sided test for determining whether the minimum value is an outlier or the maximum value is an outlier. To test whether the minimum value Ymin is an outlier, the test statistic is defined as image; similarly, to test whether the maximum value Ymax is an outlier, the test statistic is defined as image. For the two-sided test, the null hypothesis of no outliers is rejected if image with image denoting the upper critical value of the t-distribution with N − 2 degrees of freedom and a significance level of α/2N. For a one-sided test, replace α/2N with α/N.

Скачать книгу