Statistical Significance Testing for Natural Language Processing. Rotem Dror

Чтение книги онлайн.

Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 9

Statistical Significance Testing for Natural Language Processing - Rotem Dror Synthesis Lectures on Human Language Technologies

Скачать книгу

: Decision to either reject the null hypothesis in favor of the alternative or not reject it Notice: steps 1–5 are the same as in Algorithm 3.1.

      6:Calculate the p-value—the probability, under the null hypothesis H0, of observing a test statistic at least as extreme as that which was observed.

      7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if and only if the p-value is less than (or equal to) α.

      The notion of paired vs. independent samples is crucial in NLP. Oftentimes we are comparing between several algorithms on the same dataset and hence paired tests are more common. In what follows, we survey prominent parametric and nonparametric tests, emphasizing the paired setup. In addition, Algorithms 3.1 and 3.2 display a pseudo code of the general testing process that is applied when testing for statistical significance. The two processes are equivalent.

      As previously defined, parametric tests are statistical significance tests that assume prior knowledge regarding the test statistic’s distribution under the null hypothesis. When using such tests, we utilize the test statistic’s assumed distribution in order to ensure a bound on the type I error and a low probability of making a type II error. We will now elaborate on several prominent parametric tests that are suitable for the setup of paired samples.

      Input : Paired samples {xi}, Image—standard deviation of the paired differences.

      Output : p—the p-value.

      Notations : n sample size.

      1:Calculate the mean of the paired differencesImage

      2:Calculate the test statisticImage

      3:Calculate p = P(Zz) where ZN(0, 1).

      We begin with tests that are highly relevant to NLP setups, accounting for cases where the metric values come from a normal distribution. Example relevant NLP metrics are sentence level accuracy, recall, unlabeled attachment score (UAS) and labeled attachment score (LAS) [Yeh, 2000].

      Paired Z-test In this test, the sample is assumed to be normally distributed and the standard deviation of the population is known. This test is used to validate the hypothesis that the sample drawn belongs to the same population through checking if the sample mean is the same as the population mean. This test is not very applicable in NLP since the population standard deviation is rarely known, but we define it here for completion. In addition, the statistical test which is used to validate the same hypothesis without the assumption on the known standard deviation in one of the most commonly used tests in NLP, the t-test which is described next. The Z-test is defined in Algorithm 3.3.

      Paired Student’s t-test This test aims to assess whether the population means of two sets of measurements differ from each other, and is based on the assumption that both samples come from a normal distribution [Fisher, 1937]. The calculations of the test statistic and the p-value for this test are shown in Algorithm 3.4.

      Since this test assumes a normal distribution and is computed over population means, one may argue that based on the Central Limit Theorem (CLT) it can be applied to compare between any large enough measurement sets; however, in NLP setups the test examples (e.g., sentences from the same document) are often dependent, violating the independence assumption of CLT.

      In practice, t-test is often applied with evaluation measures such as accuracy, UAS and LAS, that compute the mean number of correct predictions per input example. When comparing two dependency parsers, for example, we can apply the test to check if the averaged difference of their UAS scores is significantly larger than zero, which can serve as an indication that one parser is better than the other. Using t-test with such metrics can be justified based on CLT.

      Input : Paired samples.

      Output : p—the p-value.

      Notations : D differences between two paired samples, di the ith observation in D, n the sample size, Image the sample mean of the differences, Image the sample standard deviation of the differences, T the critical value of a t-distribution with n – 1 degrees of freedom, t the t-statistic (t-test statistic) for a paired sample t-test.

      1:Calculate the sample meanImage

      2:Calculate the sample standard deviationImage

      3:Calculate the test statisticImage

      4:Find the p-value in the t-distribution table, using the predefined significance level and n — 1 degrees of freedom.

      That is, accuracy measures in structured tasks tend to be normally distributed when the number of individual predictions (e.g., number of words in a sentence when considering sentence-level UAS) is large enough.

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

      Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4RwERXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAA agEoAAMAAAABAAIAAAExAAIAAAAeAAAAcgEyAAIAAAAUAAAAkIdpAAQAAAABAAAApAAAANAALcbA AAAnEAAtxsAAACcQQWRvYmUgUGhvdG9zaG9wIENTNiAoV2luZG93cykAMjAyMDowNDoxNSAxMzox MDoyNQAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAIyK

Скачать книгу