Data Cleaning. Ihab F. Ilyas

Чтение книги онлайн.

Читать онлайн книгу Data Cleaning - Ihab F. Ilyas страница 18

Автор:
Серия:
Издательство:
Data Cleaning - Ihab F. Ilyas

Скачать книгу

dataset with two dimensions: max_daily_temp, which denotes the maximum outside temperature on a given day, and num_fever, which denotes how many people are admitted to an emergency room due to high fever on a given day. Clearly, max_daily_temp is not a direct indicator of disease outbreak; however, it should be taken into account as an environmental attribute when detecting outliers, as max_daily_temp directly affects num_fever.

      A dataset is shown in Figure 2.7. Point A is considered as an outlier by most outlier detection methods if we consider num_fever alone, since it has an abnormally high number for num_fever. However, Point A is not interesting for monitoring disease outbreaks, since it is expected that the num_fever will be high on a cold day, as is Point A.

      Point B will not usually be considered as an outlier if we consider num_fever alone. However, it is an interesting outlier for monitoring disease outbreak, since it has abnormally high numfever for the max_daily_temp it has.

      The HOT algorithm proposed by Wei et al. [2003] and the COD algorithm proposed by Tang et al. [2013] have explored contextual outliers on categorical datasets, where the environmental attributes and the behavioral attributes are not given. Both approaches consist of two steps: context enumeration and detecting contextual outliers within each enumerated context. Contexts in both approaches are essentially conjunctions of predicates, namely, attribute-value pairs, and they are enumerated using a lattice data structure. An example candidate context is A = a1B = b1, which contains all the tuples that have value a1 for attribute A and have value b1 for attribute B. The differences between these two approaches lie in the space of contexts they are interested in and how outliers are defined in a certain context. Algorithm 2.4 gives the details of the HOT algorithm as a concrete example. First, all contexts are enumerated and frequent contexts are mined using the Apriori algorithm [Agrawal and Srikant 1994]. For each frequent context C, the histogram of frequencies associated with each attribute A not in C is stored, which is then used to compute the deviation of each value taken by A in the database. Finally, the objects assuming a value on the attribute A whose deviation is smaller than a user-provided threshold are returned as outliers. Example 2.12 shows an example contextual outlier detected by Algorithm 2.4.

Image Image

      Despite recent advances of contextual outlier detection, it still faces many challenges. (1) Most current contextual outlier detection techniques assume the attributes in the data to be either entirely categorical [Wei et al. 2003, Angiulli et al. 2009, Tang et al. 2013] or entirely numerical [Song et al. 2007, Liang and Parthasarathy 2016]. It remains an open question how to handle mixed types of attributes for contextual outlier detection. (2) Context enumeration is usually expensive to process, and is exponential with respect to the number of environmental attributes. Current techniques treat context enumeration and detecting outliers in every context as two separate processes with little interaction [Wei et al. 2003, Tang et al. 2013]. One future direction is to interleave the two processes to avoid enumerating contexts that do not have outliers without actually running an outlier detection method in them. (3) A data point might be considered to be a contextual outlier by treating different subsets of attributes as environmental attributes or behavioral attributes. In most scenarios, users must investigate outliers reported by an automatic technique. Therefore, we need techniques to filter, rank, and report different contextual outliers discovered so most interesting outliers will be examined first.

      Outlier detection is a rich topic that has been studied within several communities and application domains, and there exist several surveys and books on this topic [Barnett and Lewis 1994, Hellerstein 2008, Chandola et al. 2009, Aggarwal 2013]. This chapter is complementary to those surveys. Our goal is: (1) to provide a classification of major types of outlier detection techniques, namely, statistics-based methods, distance-based methods, and model-based methods; and (2) to discuss outlier detection in high dimensional data.

      Statistics-based outlier detection techniques assume that normal data points would appear in high probability regions of a stochastic model, while outliers would occur in low probability regions of a stochastic model. We distinguish between two different types of statistics-based approaches: the hypothesis testing approach and the distribution fitting approach. The hypothesis testing approach usually calculates a test statistic based on observed data points to determine whether the null hypothesis (there is no outlier in the dataset) should be rejected. The distribution fitting approach, as the name suggests, fits a pdf to the observed data and marks as outliers those points with low probability according to the pdf. Distance-based outlier detection techniques are based on a definition of distances between data points. In general, normal points should be close to each other, and outlying points should be distant from normal points. Distance-based methods can be further divided into global methods and local methods depending on the reference population used when determining whether a point is an outlier. Model-based approaches follow the classic ML paradigm to learn one or more classifiers to distinguish between normal points and outliers. Model-based approaches are dependent on the availability of labeled training data. We further discuss techniques for handling high dimensional data, including dimensionality reduction, subspace outlier detection, and contextual outlier detection.

      As discussed, outlier detection techniques define “normal” differently and also make different assumptions about the underlying data set. It is important for practitioners to choose the outlier detection methods that best match their use case. If probabilistic interpretations of results are needed, then statistics-based techniques provide a formal framework for statistical reasoning. If the underlying data follows a known distribution, such as normal distribution, then techniques such as Grubbs’ test or z-score are applicable, as shown in Section 2.2.2; otherwise, KDE, as shown in Section 2.2.4, provides a nonparametric way for estimating the probability distribution, which can then be used for outlier detection. If probabilistic interpretations of results are not needed or hard to know, then distance-based (cf. Section 2.3) and model-based techniques (cf. Section 2.4) provide alternative approaches. Distance-based approaches usually require the user to specify a distance parameter or threshold, which may involve a trial and error process. Model-based approaches usually require labeled training data to train supervised

Скачать книгу