Statistical Significance Testing for Natural Language Processing. Rotem Dror

Чтение книги онлайн.

Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 6

Statistical Significance Testing for Natural Language Processing - Rotem Dror Synthesis Lectures on Human Language Technologies

Скачать книгу

empirical machine learning research in general, and in the NLP community in particular, we would often like to prove the superiority of one algorithm over the other, and present this superiority in terms of a statistically significant improvement according to an evaluation metric, such as accuracy or F-score.1 Therefore, we begin by formulating a general hypothesis testing framework for the comparison between two algorithms. This is the common type of hypothesis testing framework applied in NLP, and its detailed formulation will help us develop our ideas.

      We wish to compare two algorithms, A and B. As an example, let us consider a comparison between two machine translation (MT) algorithms: phrase-based MT (such as the Moses MT system [Koehn et al., 2007]) vs. an LSTM Neural Encoder-decoder Network (e.g., the model described in Cho et al. [2014]). In order to compare between the two algorithms, we would experiment with several different parallel corpora. Let X be the set of such corpora, i.e., a collection of datasets X = {X1, X2,…, XN}, where each data set Xi is comprised of sentence pairs, one from the source language and one from the target language. That is, for all i є {1,…, N}, Xi = {xi,1,…, xi,ni}. where xi,j is a source language sentence and its translation.

      The difference in performance between the two algorithms is measured with one or more evaluation metrics. In our example, when evaluating the performance of machine translation systems, we may use several evaluation measures to assess the quality of translation from various angles. For example, we would probably like our MT system to provide an accurate translation but we may also want to encourage creativity and linguistic richness, and prefer systems that do not excessively repeat the same words and phrases. Accordingly, we would evaluate it using two vastly used different metrics: BLEU [Papineni et al., 2002] and PINC [Chen et al., 2011]. We denote our set of metrics as M = {M1,…., Mm}.2

      So far, we have our two MT algorithms A and B, trained and evaluated on a set of metrics M = {M1,…, Mm}. We denote with Mj(ALG, Xi) the value of the measure Mj when algorithm ALG is applied to the dataset Xi. Without loss of generality, we assume that higher values of the measure are better.

      We define the difference in performance between two algorithms, A and B, according to the measure Mj on the dataset Xi as:

      Finally, using this notation we formulate the following statistical hypothesis testing problem:

      The goal of testing the above hypotheses is to determine if algorithm A is significantly better than algorithm B on the dataset Xi using the evaluation measure Mj. In our example, this translates to the following question: “Is the LSTM-based MT system better than the Phrasebased one on the Wikipedia parallel corpus when considering the BLEU metric?”

      If we strive to show that the LSTM is superior to the phrase-based system (in the specific setup of the Wikipedia Corpus and the BLEU metric), we would need to provide statistically valid evidence. Our hypotheses can be described as follows: The (somewhat pessimistic) null hypothesis would state that there is no significant performance difference between the LSTM and the phrase-based system, or that the latter performs even better, while the alternative hypothesis would state that the LSTM performs significantly better.

      More generally, in our formulation the null hypothesis, H0, states that there is no difference between the performance of algorithm A and algorithm B, or that B performs better. This hypothesis is tested vs. the alternative statement, H1—that A is superior. If the statistical test results in rejecting the null hypothesis, one concludes that A outperforms B in this setup—i.e., on dataset Xi with respect to the evaluation metric Mj. Otherwise, there is not enough evidence in the data to make the conclusion of rejecting the null hypothesis. In this case, it is uncustomary to claim that we accept the null hypothesis, since the null hypothesis is the starting point, and by posing an alternative hypothesis we try to challenge the idealized state.

      Naturally, we could be wrong in our conclusion. Our specific experiments may show that the LSTM outperforms the phrase-based system in a certain setup, but this does not necessarily reflect the true nature of things. Let us now properly define the two types of errors that we may encounter in our hypothesis test.

      • Type I error—rejection of the null hypothesis when it is true, i.e., there is no difference in performance between the two algorithms. For example, concluding that the LSTM is superior to the phrase-based system in the explored setting when, in fact, that is not the case in general.

      • Type II error—non-rejection of the null hypothesis when the alternative hypothesis is true. For example, missing the fact that the LSTM is in fact superior to the phrase-based system.

      Knowing which one of the hypotheses is correct with full certainty is practically impossible, as that would require us to create a sample of all possible scenarios, i.e., observe the complete data generating distribution. Therefore, in practice, we can never know which one of the two algorithms is superior, and so the statistical significance testing framework actually strives to minimize the probability of type I and type II errors. We will touch on this in the following section.

      Note, however, that reducing the probability of one of the errors may cause an increase of the probability of the other. The classical approach to hypothesis testing is to find a test that guarantees that the probability of making a type I error is upper bounded by a predefined constant α—the significance level of the test—while keeping the probability of a type II error as low as possible. The last is also referred to as designing a test that is as statistically powerful as possible.

      A statistical test is called valid if it controls a certain type I error criterion, i.e., it guarantees to bound the error criterion by a predefined constant. By this definition, however, high validity can be obtained by never rejecting any null hypothesis. Hence, the quality of a statistical test is measured not only by its validity, but also by its power: the probability that it would in fact reject a false null hypothesis. This probability is called the statistical power of the test. In general, we wish to design tests that are both valid and powerful.

      In the following section we will introduce the concept of p-value, a statistical instrument that allows us to test whether or not the null hypothesis holds, based on a data sample that is available.

      We will now discuss a practical approach for deciding whether or not to reject the null hypothesis. We focus on the setup where the performance of two algorithms, A and B, on a dataset X, is compared using an evaluation measure M. Let us denote with M(ALG, X) the value of the evaluation measure M when algorithm ALG is applied to the dataset X. Without loss of generality, we assume that higher values of the measure are better. We define the difference in performance between the two algorithms according to the measure M on the dataset X as:

Image

      In our example, A could be the LSTM and B the phrase-based MT system, and M could be the BLEU metric. According to Equation (2.3), δ(X) would be the difference in performance between our two MT algorithms with

Скачать книгу