Statistical Significance Testing for Natural Language Processing. Rotem Dror
Чтение книги онлайн.
Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 9
6:Calculate the p-value—the probability, under the null hypothesis H0, of observing a test statistic at least as extreme as that which was observed.
7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if and only if the p-value is less than (or equal to) α.
The notion of paired vs. independent samples is crucial in NLP. Oftentimes we are comparing between several algorithms on the same dataset and hence paired tests are more common. In what follows, we survey prominent parametric and nonparametric tests, emphasizing the paired setup. In addition, Algorithms 3.1 and 3.2 display a pseudo code of the general testing process that is applied when testing for statistical significance. The two processes are equivalent.
3.2 PARAMETRIC TESTS
As previously defined, parametric tests are statistical significance tests that assume prior knowledge regarding the test statistic’s distribution under the null hypothesis. When using such tests, we utilize the test statistic’s assumed distribution in order to ensure a bound on the type I error and a low probability of making a type II error. We will now elaborate on several prominent parametric tests that are suitable for the setup of paired samples.
Algorithm 3.3 The Paired Z-test
Input : Paired samples {xi},
Output : p—the p-value.
Notations : n sample size.
1:Calculate the mean of the paired differences
2:Calculate the test statistic
3:Calculate p = P(Z ≥ z) where Z ∼ N(0, 1).
We begin with tests that are highly relevant to NLP setups, accounting for cases where the metric values come from a normal distribution. Example relevant NLP metrics are sentence level accuracy, recall, unlabeled attachment score (UAS) and labeled attachment score (LAS) [Yeh, 2000].
Paired Z-test In this test, the sample is assumed to be normally distributed and the standard deviation of the population is known. This test is used to validate the hypothesis that the sample drawn belongs to the same population through checking if the sample mean is the same as the population mean. This test is not very applicable in NLP since the population standard deviation is rarely known, but we define it here for completion. In addition, the statistical test which is used to validate the same hypothesis without the assumption on the known standard deviation in one of the most commonly used tests in NLP, the t-test which is described next. The Z-test is defined in Algorithm 3.3.
Paired Student’s t-test This test aims to assess whether the population means of two sets of measurements differ from each other, and is based on the assumption that both samples come from a normal distribution [Fisher, 1937]. The calculations of the test statistic and the p-value for this test are shown in Algorithm 3.4.
Since this test assumes a normal distribution and is computed over population means, one may argue that based on the Central Limit Theorem (CLT) it can be applied to compare between any large enough measurement sets; however, in NLP setups the test examples (e.g., sentences from the same document) are often dependent, violating the independence assumption of CLT.
In practice, t-test is often applied with evaluation measures such as accuracy, UAS and LAS, that compute the mean number of correct predictions per input example. When comparing two dependency parsers, for example, we can apply the test to check if the averaged difference of their UAS scores is significantly larger than zero, which can serve as an indication that one parser is better than the other. Using t-test with such metrics can be justified based on CLT.
Algorithm 3.4 The Paired Sample t-test
Input : Paired samples.
Output : p—the p-value.
Notations : D differences between two paired samples, di the ith observation in D, n the sample size,
1:Calculate the sample mean
2:Calculate the sample standard deviation
3:Calculate the test statistic
4:Find the p-value in the t-distribution table, using the predefined significance level and n — 1 degrees of freedom.
That is, accuracy measures in structured tasks tend to be normally distributed when the number of individual predictions (e.g., number of words in a sentence when considering sentence-level UAS) is large enough.
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.