Statistical Significance Testing for Natural Language Processing. Rotem DrorЧитать онлайн книгу.
in performance between the two algorithms. For example, concluding that the LSTM is superior to the phrase-based system in the explored setting when, in fact, that is not the case in general.
• Type II error—non-rejection of the null hypothesis when the alternative hypothesis is true. For example, missing the fact that the LSTM is in fact superior to the phrase-based system.
Knowing which one of the hypotheses is correct with full certainty is practically impossible, as that would require us to create a sample of all possible scenarios, i.e., observe the complete data generating distribution. Therefore, in practice, we can never know which one of the two algorithms is superior, and so the statistical significance testing framework actually strives to minimize the probability of type I and type II errors. We will touch on this in the following section.
Note, however, that reducing the probability of one of the errors may cause an increase of the probability of the other. The classical approach to hypothesis testing is to find a test that guarantees that the probability of making a type I error is upper bounded by a predefined constant α—the significance level of the test—while keeping the probability of a type II error as low as possible. The last is also referred to as designing a test that is as statistically powerful as possible.
A statistical test is called valid if it controls a certain type I error criterion, i.e., it guarantees to bound the error criterion by a predefined constant. By this definition, however, high validity can be obtained by never rejecting any null hypothesis. Hence, the quality of a statistical test is measured not only by its validity, but also by its power: the probability that it would in fact reject a false null hypothesis. This probability is called the statistical power of the test. In general, we wish to design tests that are both valid and powerful.
In the following section we will introduce the concept of p-value, a statistical instrument that allows us to test whether or not the null hypothesis holds, based on a data sample that is available.
2.2 P-VALUE IN THE WORLD OF NLP
We will now discuss a practical approach for deciding whether or not to reject the null hypothesis. We focus on the setup where the performance of two algorithms, A and B, on a dataset X, is compared using an evaluation measure M. Let us denote with M(ALG, X) the value of the evaluation measure M when algorithm ALG is applied to the dataset X. Without loss of generality, we assume that higher values of the measure are better. We define the difference in performance between the two algorithms according to the measure M on the dataset X as:
In our example, A could be the LSTM and B the phrase-based MT system, and M could be the BLEU metric. According to Equation (2.3), δ(X) would be the difference in performance between our two MT algorithms with respect to the BLEU metric. We would like to test whether δ(X) > 0, which would indicate a higher BLEU score (i.e., better performance) for the LSTM. However, we would also like to assess whether this result is likely to happen again in a new experiment, or whether the current experiment does not reflect the actual relationship between the algorithms.
We will refer to δ(X) as our test statistic—a quantity derived from the experiment and used for the statistical hypothesis testing. Using this notation we formulate the following statistical hypothesis testing problem3:
The null hypothesis, H0, is that δ(X) is smaller than or equal to zero, meaning that algorithm B is better than A, or that B is as good as A. In contrast, the alternative hypothesis, H1, is that there is in fact a difference in performance and that algorithm A is superior. In order to decide whether or not to reject the null hypothesis, we can ask the following question.
Considering the test statistic that we chose and its distribution under the null hypothesis, how likely would it be to encounter the δ(X) value that we have observed in our test, given that the null hypothesis is indeed correct?
After all, if δ(X) is a very large number, then algorithm A strongly outperformed algorithm B, and that would be unlikely under the hypothesis that algorithm B is better. To answer this question we will need to compute a probability term where δ(X) is a random variable, which requires some prior knowledge regarding its distribution under the null hypothesis—we will discuss this further later on this book. We therefore phrase our decision in terms of the probability of observing the δobserved value if the null hypothesis was in fact true. This probability is exactly the p-value of the test.
The p-value is defined as the probability, under the null hypothesis H0, of obtaining a result equal to or even more extreme than what was actually observed. For the hypothesis testing framework defined here, the p-value is defined as:
where δobserved is the performance difference between the algorithms (according to M) when they are applied to X. Going back to our example, we could describe the p-value as the probability that the LSTM shows such stronger performance in this setting (i.e., to observe such a δobserved) when the phrase-based MT system is actually a better model. If δobserved is small, meaning the LSTM’s BLEU score is only slightly better than that of the phrase-based system, it may very well be a statistical “fluke”, such that if we were to repeat the experiment with a slightly different dataset from the same distribution we could probably encounter the opposite result of the phrase-based MT performing better. However, as δobserved increases, the probability of encountering such values under the assumption that the phrase-based MT system is better becomes smaller and smaller.
The smaller the p-value, the stronger is the indication that the observed outcome is unlikely under the null hypothesis, H0. In order to decide whether H0 should be rejected, the researcher should pre-define an arbitrary, fixed threshold value α a.k.a the significance level. Only if p-value < α then the null hypothesis is rejected.
For example, let us say that the probability to encounter a difference of 10 points between BLEU(LSTM) and BLEU (phrase-based) under the assumption that the phrase-based MT system is better, is 0:05. For a significance level of 0:1 we would reject the null hypothesis, since p-value < α. For a significance level of 0:03 we would not reject the null hypothesis. A lower α is a stronger demand, equivalent to saying “We need to see a stronger, more extreme improvement in the LSTM in order to determine that it is a superior model. We want to see such a strong improvement (such a large δobserved), that would only have a probability of 0:03 or less under the null hypothesis.”
How should we choose an α? As noted above, it is impossible to actually know which hypothesis is correct, H0 or H1, and hence we can only strive to minimize the probability of choosing the wrong hypothesis. A small α ensures that we do not reject the null hypothesis easily, but it may also cause us to not reject the null hypothesis when we should. More technically, a small α yields a lower probability of a type I error and a higher probability of a type II error. A common practice is to choose an α that guarantees that the probability of making a type I error is upper bounded by a pre-defined desired value, while achieving the highest possible power, i.e., the lowest possible probability of making a type II error. Popular α values in the literature are 0.05 and 0.01.
1 In this book we use the terms evaluation metric and evaluation measure interchangeably.
2 To keep the discussion concise, throughout this book we assume that only one evaluation measure is used. Our framework