Data Cleaning. Ihab F. IlyasЧитать онлайн книгу.
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
5. https://www.microsoft.com/en-us/research/project/transform-data-by-example/
2
Outlier Detection
Quantitative error detection often targets data anomalies with respect to some definition of “normal” data values. While an exact definition of an outlier depends on the application, there are some commonly used definitions, such as “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] and “an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett and Lewis 1994]. Different outlier detection methods will generate different “candidate” outliers, possibly along with some confidence scores.
Many applications of outlier detection exist. In the context of computer networks, different kinds of data, such as operating system calls and network traffic, are collected in large volumes. Outlier detection can help with detecting possible intrusions and malicious activities in the collected data. In the context of credit card fraud, unauthorized users often exhibit unusual spending patterns, such as a buying spree from a distant location. Fraud detection refers to the detection of criminal activities in such financial transactions. Last, in the case of medical data records, such as MRI scans, PET scans, and ECG time series, automatic identification of abnormal patterns or records in these data often signals the presence of a disease and can help with early diagnosis.
There are many challenges in detecting outliers. First, defining normal data patterns for a normative standard is challenging. For example, when data is generated from wearable devices, spikes in recorded readings might be considered normal if they are within a specific range dictated by the sensors used. On the other hand, spikes in salary values are probably interesting outliers when analyzing employee data. Therefore, understanding the assumptions and limitations of each outlier detection method is essential for choosing the right tool for a given application domain. Second, many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the “curse of dimensionality.” As the number of dimensions increases, it becomes increasingly difficult to accurately estimate the multidimensional distribution of the data points [Scott 2008], and the distances between points approach zero and become meaningless [Beyer et al. 1999]. We give some concrete examples and more details on these challenges in the next section.
In Section 2.1, we present a taxonomy of outlier detection techniques. We discuss in detail each of these categories in Section 2.2, 2.3, and 2.4, respectively. In Section 2.5, we discuss outlier detection techniques for high-dimensional data that address the “curse of dimensionality.”
2.1 A Taxonomy of Outlier Detection Methods
Outlier detection techniques mainly differ in how they define normal behavior. Figure 2.1 depicts the taxonomy we adopt to classify outlier detection techniques, which can be divided into three main categories: statistics-based outlier detection techniques, distance-based outlier detection techniques, and model-based outlier detection techniques [Aggarwal 2013, Chandola et al. 2009, Hodge and Austin 2004]. In this section, we give an overview of each category and their pros and cons, which we discuss in detail.
Statistics-Based Outlier Detection Methods. Statistics-based outlier detection techniques assume that the normal data points would appear in high probability regions of a stochastic model, while outliers would occur in the low probability regions of a stochastic model [Chandola et al. 2009]. There are two commonly used categories of approaches for statistics-based outlier detection. The first category is based on hypothesis testing methods, such as the Grubbs Test [Grubbs 1969] and the Tietjen-Moore Test [Tietjen and Moore 1972]; they usually calculate a test statistic, based on observed data points, which is used to determine whether the null hypothesis (there is no outlier in the dataset) should be rejected. The second category of statistics-based outlier detection techniques aims at fitting a distribution or inferring a probability density function (pdf) based on the observed data. Data points that have low probability according to the pdf are declared to be outliers. Techniques for fitting a distribution can be further divided into parametric approaches and non-parametric approaches. Parametric approaches for fitting a distribution assume that the data follows an underlying distribution and aim at finding the parameters of the distribution from the observed data. For example, assuming the data follows a normal distribution, parametric approaches would need to learn the mean and variance for the normal distribution. In contrast, nonparametric approaches make no assumption about the distribution that generates the data; instead, they infer the distribution from the data itself.
Figure 2.1 A taxonomy of outlier detection techniques.
There are advantages of statistics-based techniques.
1. If the underlying data follows a specific distribution, then the statistical outlier detection techniques can provide a statistical interpretation for discovered outliers.
2. Statistical techniques usually provide a score or a confidence interval for every data point, rather than making a binary decision. The score can be used as additional information while making a decision for a test data point.
3. Statistical techniques usually operate in an unsupervised fashion without any need for labeled training data.
There are also some disadvantages of statistics-based techniques.
1. Statistical techniques usually rely on the assumption that the data is generated from a particular distribution. This assumption often does not hold true, especially for high-dimensional real datasets.
2. Even when the statistical assumption can be reasonably justified, there are several hypothesis test statistics that can be applied to detect anomalies; choosing the best statistic is often not a straightforward task. In particular, constructing hypothesis tests for complex distributions that are required to fit high-dimensional datasets is nontrivial.
Distance-Based Outlier Detection Methods. Distance-based outlier detection techniques often define a distance between data points that is used for defining a normal behavior. For example, a normal data point should be close to many other data points, and data points that deviate from such normal behavior are declared outliers [Knorr and Ng 1998, 1999, Breunig et al. 2000]. Distance-based outlier detection methods can be further divided into global or local methods depending on the reference population used when determining whether a point is an outlier. A global distance-based outlier detection method determines whether a point is an outlier based on the distance between that data point and all other data points in the dataset. On the other hand, a local method considers the distance between a point and its neighborhood points when determining outliers. There are advantages of distance-based techniques.
1. A major advantage of distance-based techniques is that