Data Cleaning. Ihab F. IlyasЧитать онлайн книгу.
deviations.
The median and MAD lead to a robust outlier detection technique known as Hampel X84 [Hampel et al. 2011] that is quite reliable in the face of outliers since it has a breakdown point of 50%. Hampel X84 marks outliers as those data points that are more than 1.4826θ MADs away from the median, where θ is the number of standard deviations away from the mean one would have used if there were no outliers in the dataset. The constant 1.4826 is derived under a normal distribution, where one standard deviation away from the mean is about 1.4826 MADs.
Example 2.3 For detecting outliers in the ages column in Table 2.1, we used 2 standard deviations away from the mean in Example 2.2. Therefore, we would like to flag points that are 1.4826 * 2 = 2.9652 away from the median as outliers.
The median of the set of values in Example 2.2 is 32 and the MAD is 7. The normal value range would be [32 − 7 * 2.9652, 32 + 7 * 2.9652] = [11.2436, 52.7564], which is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation. Under the new normal range, the first value and the last value are now correctly flagged as outliers.
Multivariate
So far, we have considered detecting univariate outliers in one dimension by fitting the data to a normal distribution. However, some outliers can only be revealed when considering multiple dimensions; we call these multivariate outliers.
Example 2.4 Consider the income and tax columns in Table 2.1. In general, income varies for different persons, and so do tax rates. However, one would expect a positive correlation between income and tax rate, namely, a higher income is usually associated with a higher tax rate. Therefore, a tax record that has a high income but low tax rate should be flagged as an outlier, even though the tax income or the tax rate in that record might not be an outlier when considered separately in their respective columns. Consider t5 in Table 2.1. t5[income] is not an outlier compared with the values in the income column, and t5[tax] is also not an outlier compared with the values in the tax column. However, taking the income and tax columns into account simultaneously, t5 is an outlier, since it has a relatively low tax for a relatively high income value.
Suppose we have a table that has n rows and d columns; we model each column as a random variable Xi, ∀i ∈ [1, d]. Let the mean of the d variables be a d-dimensional vector μ = (μ1, μ2, …, μd)T and the covariance matrix be Σ, where Σ i,j is the covariance between the two random variables Xi and Xj. Given the mean and the covariance matrix, the questions are how to measure the distance between a particular data point to the mean using the covariance matrix, and how much distance should qualify a point as an outlier. The Mahalanobis distance [Mahalanobis 1936] is the multidimensional generalization of measuring how many standard deviations away a point is from the mean of the population. Specifically, the Mahalanobis distance of a point x to the mean vector μ is defined as Mahalanobis (x) =
Robust Multivariate Statistics. Similar to the univariate mean and variance, the multivariate mean and covariance matrix are not robust, i.e., a single bad data point might severely distort the mean and the covariance matrix, and thus robust estimators must be used. The most famous method for the robustification of multivariate mean and covariance matrix is called the Minimum Covariance Determinant (MCD) by Rousseeuw and Driessen [1999]. MCD chooses a subset of h points that minimizes the determinant of the covariance matrix over all possible subsets of size h (h is usually chosen to reflect the belief about how many points in the dataset are outliers). For example, assume 10% of n points are outliers, then h is chosen to be 0.9n. The multivariate mean and covariance matrix of the h points are used to compute the Mahalanobis distance for all n points. A brute force way to find the covariance matrix with the smallest determinant is to enumerate all possible subsets of size h from the n data points, which is obviously very expensive as one has to evaluate all the possible subsets. A greedy algorithm called FASTMCD [Rousseeuw and Driessen 1999] tries to solve the efficiency problem. Algorithm 2.1 describes FASTMCD. It starts by randomly selecting a subset of h data points, and the mean μ and the covariance matrix Σ are computed using the random sample. Then, the Mahalanobis distances for all points in I are computed using the μ and Σ. The random sample is updated using h points with the smallest Mahalanobis distances. The sample is updated in iterations until it remains unchanged between two iterations. The above process is repeated by using a different random sample as the initial seed, and the best result is used.
Algorithm 2.1 FASTMCD
The correctness of the algorithm can be proven by showing that det(H1) ≤ det (H0) in the iterative portion. The mean and the covariance matrix computed using FASTMCD can be shown to have a 50% breakdown point [Rousseeuw and Driessen 1999].
Most of our discussions of parametric approaches have been based on normal distributions. In practice, not all datasets are normally distributed. Two other distributions are frequently observed: multimodal distributions and Zipfian distributions. In some cases, data appears to have many “peaks,” such distributions are typically referred as being multimodal. Oftentimes, these multimodal distributions can be seen as a superimposed normal distributions, known as a mixture of Gaussians. In some other cases, a large fraction of the data is condensed into a small fraction of values, while the remainder of the data values are spread across a long tail of rare values. The Zipfian distribution exhibits such a phenomenon. It is important to choose the right distribution to detect outliers using parametric approaches.
2.2.4 Fitting Distribution: Nonparametric Approaches
An obvious drawback of using parametric approaches for fitting a distribution is that they assume the data to follow an underlying distribution. In contrast, non-parametric approaches make no assumption about the distribution that generates the data; instead, they infer the distribution from the data itself. There are mainly two types of techniques in this category: histogram-based techniques and kernel density-based techniques.
Histograms
A histogram [Pearson 1894] is a graphical representation of the distribution of numerical data values, and is often used to estimate the probability distribution of a continuous random variable. The first step toward creating a histogram is to discretize or bin the range of values by dividing the range of the random variable into a series of intervals, which are usually consecutive and non-overlapping. The second step is to count how many values fall under each bin. If the bins are of equal size, a bin is created with height proportional to the frequency of each bin, i.e., the number of values a bin contains; this type of histogram is called an equi-width histogram. In general, however, bins can be of varying width. If the width of bins are chosen