Data Cleaning. Ihab F. IlyasЧитать онлайн книгу.
not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven.
2. Adapting distance-based techniques to different data types is straightforward, and primarily requires defining an appropriate distance measure for the given data.
There are disadvantages of distance-based techniques.
1. If the data has normal instances that do not have enough close neighbors, or if the data has anomalies that have enough close data points, distance-based techniques will fail to label them correctly.
2. The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of every pair of data points.
3. Performance of a nearest neighbor-based technique greatly relies on a distance measure, defined between a pair of data instances, which can effectively distinguish between normal and anomalous instances. Defining distance measures between instances can be challenging when the data is complex, for example, graphs, sequences, and so on.
Model-Based Outlier Detection Methods. Model-based outlier detection techniques first learn a classifier model from a set of labeled data points, and then apply the trained classifier to a test data point to determine whether it is an outlier. Model-based approaches assume that a classifier can be trained to distinguish between the normal data points and the anomalous data points using the given feature space. They label data points as outliers if none of the learned models classify them as normal points. Based on the labels available to train the classifier, model-based approaches can be further divided into two subcategories: multi-class model-based techniques and one-class model-based techniques. Multi-class modelbased techniques assume that the training data points contain labeled instances belonging to multiple normal classes. On the other hand, one-class model-based techniques assume that all the training data points belong to one normal class. The advantages of model-based techniques include:
1. Model-based techniques, especially the multi-class techniques, can make use of powerful algorithms that can distinguish between instances belonging to different classes.
2. The testing phase of model-based techniques is fast, since each test instance needs to be compared against the precomputed model.
The major disadvantages of model-based techniques is that they must rely on the availability of accurate labels for various normal classes, which is often not possible.
2.2 Statistics-Based Outlier Detection
In this section, we discuss statistics-based outlier detection methods. First, we review some basics about data distributions in Section 2.2.1. Section 2.2.2 then shows how hypothesis testings, such as Grubbs’ Test, are used for detecting outliers. In Section 2.2.3, we present parametric approaches for fitting a distribution to the data, and also introduce the notion of robust statistics, which can capture important properties of the underlying distribution in the presence of outliers in the data. In Section 2.2.4, we introduce nonparametric approaches for fitting a distribution to the data by using histograms and kernel density estimations (KDEs).
2.2.1 A Note on Data Distribution
Database practitioners usually think of data as a collection of records. They are interested in descriptive statistics about the data (e.g., “what is the min, max, and average salary for all the employees?”), and they usually do not care much about how the data is generated. On the other hand, statisticians usually treat data as a sample of some data-generating process. In many cases, they try to find a model or a distribution of that process that best describes the available sample data.
The most commonly used distribution is the Gaussian or normal distribution, which is characterized by a mean μ and a standard variance σ [Barnett and Lewis 1994]. While the normal distribution might not be the exact distribution of a given dataset, it is the cornerstone of modern statistics, and it accurately models the sum of many independently identically distributed variables according to the central limit theorem. The following equation shows the pdf of the normal distribution N (μ, σ2):
Multivariate Gaussian distribution or multivariate normal distribution is a generalization of the univariate normal distribution to higher dimensions. A random vector consisting of d random variables Xi, where i ∊ [1, d], is said to be d-variate normally distributed if every linear combination of its d components has a univariate normal distribution. The mean of the d variables is a d-dimensional vector μ = (μ1, μ2, …, μd)T, where μi is the mean of the random variable Xi. The covariance matrix Σ of these d random variables (also known as dispersion matrix or variance-covariance matrix) is a matrix, where Σί,j (Row i and Column j of Σ) is the covariance between the two random variables Xi and Xj, namely, Σi, j = cov(Xi, Xj) = E[(Xi − μi − μj]. Intuitively, the covariance matrix generalizes the notion of variance of a single random variable or the covariance of two random variables to d dimensions.
2.2.2 Hypothesis Testing for Outlier Detection
A statistical hypothesis is an assumption about a population that may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses, which usually consists of the following steps [Lehmann and Romano 2006]: (1) formulate the null and the alternative hypothesis; (2) decide which test is appropriate and state the relevant test statistic T; (3) choose a significance level α, namely, a probability threshold below which the null hypothesis will be rejected; (4) determine the critical region for the test statistic T, namely, the range of values of the test statistic for which the null hypothesis is rejected; (5) compute from the current data the observed value tobs of the test statistic T; and (6) reject the null hypothesis if tobs falls under the critical region. Different known test statistics exist for testing different properties of a population. For example, the chi-square test for independence is used to determine whether there is a significant relationship between two categorical variables, the one-sample t-test is commonly used to determine whether the hypothesized mean differs significantly from the observed sample mean, and the two-sample t-test is used to determine whether the difference between two means found in two samples is significantly different from the hypothesized difference between two means.
A number of formal hypothesis tests for detecting outliers have been proposed in the literature. These tests usually differ according to the following characteristics.
• What is the underlying distribution model for the data?
• Is the test designed for testing a single outlier or multiple outliers?
• If the test is for detecting multiple outliers, does the number of outliers need to be specified exactly in the test?
For example, the Grubbs Test [Grubbs 1969] is used to detect a single outlier in a univariate dataset that follows an approximately normal distribution. The Tietjen-Moore Test [Tietjen and Moore 1972] is a generalization of the Grubbs test to the case of more than one outlier; however, it requires the number of outliers to be specified exactly. The Generalized Extreme Studentized Deviate (ESD) Test [Rosner 1983] requires only an upper bound on the suspected number of outliers and is often used when the exact number of outliers is unknown.
We use Grubbs’ test as an example to show how hypothesis testing is used for detecting outliers, as it is a standard test when testing for a single outlier. This outlier is expunged from the dataset and the test is repeated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test is not to be