Statistical Significance Testing for Natural Language Processing. Rotem Dror
Чтение книги онлайн.
Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 5
The anonymous reviewers of the book and original papers provided detailed comments on various aspects of this work, from minor technical details to valuable suggestions on the structure, that dramatically improved its quality. Graeme Hirst, Michael Morgan, and Christine Kiilerich orchestrated the book-writing effort and provided valuable guidance.
Finally, we would like to thank the generous support of the Technion Graduate School. Rotem Dror has also been supported by a generous Google Ph.D. fellowship.
Needless to say that all the mistakes and shortcomings of the book are ours. Please let us know if you find any.
Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart
April 2020
CHAPTER 1
Introduction
The field of Natural Language Processing (NLP) has made substantial progress in the last two decades. This progress stems from multiple reasons: the data revolution that has made abundant amounts of textual data from a variety of languages and linguistic domains available, the development of increasingly effective predictive statistical models, and the availability of hardware that can apply these models to large datasets. This dramatic improvement in the capabilities of NLP algorithms carry the potential for a great impact.
The extended reach of NLP algorithms has also resulted in NLP papers giving more and more emphasis to the experiment and result sections by showing comparisons between multiple algorithms on various datasets from different languages and domains. It can be safely argued that the ultimate test for the quality of an NLP algorithm is its performance on well-accepted datasets, sometimes referred to as “leader-boards”. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.
The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Particularly, we aim to briefly summarize the main concepts so that they are readily available to the interested researcher, address the key challenges of hypothesis testing in the context of NLP tasks and data, and discuss open issues and the main directions for future work.
We start with two introductory chapters that present the basic concepts of statistical significance testing: Chapter 2 provides a brief presentation of the hypothesis testing framework and Chapter 3 introduces common statistical significance tests. Then, Chapter 4 discusses the application of statistical significance testing to NLP. In this chapter we assume that two algorithms are compared on a single dataset, based on a single output that each of them produces, and discuss the relevant significance tests for various NLP tasks and evaluation measures. The chapter puts an emphasis on the aspects in which NLP tasks and data differ from common examples in the statistical literature, e.g., the non-Gaussian distribution of the data and the dependence between the participating examples, e.g., sentences in the same corpus. This chapter, that extends our ACL 2018 paper [Dror et al, 2018], provides our recommended matching between NLP tasks with their evaluation measures and statistical significance tests.
The next two chapters relax two of the basic assumptions of Chapter 4: (a) that each of the compared algorithms produces a single output for each test example (e.g., a single parse tree for a given input sentence); and (b) that the comparison between the two algorithms is performed on a single dataset. Particularly, Chapter 5 addresses the comparison between two algorithms based on multiple solutions where each of them produces for a single dataset, while Chapter 6 addresses the comparison between two algorithms across several datasets.
The first challenge stems from the recent emergence of Deep Neural Networks (DNNs), which has made data-driven performance comparison much more complicated. This is because these models are non-deterministic due to their non-convex objective functions, complex hyperparameter tuning process, and training heuristics such as random dropouts that are often applied in their implementation. Chapter 5 hence defines a framework for a statistically valid comparison between two DNNs based on multiple solutions each of them produces for a given dataset. The chapter summarizes previous attempts in the NLP literature to perform this comparison task and evaluates them in light of the proposed framework. Then, it presents a new comparison method that is better fitted to the pre-defined framework. This chapter is based on our ACL 2019 paper [Dror et al., 2019].
The second challenge is crucial for the efforts to extend the reach of NLP technology to multiple domains and languages. These well-justified efforts result in a large number of comparisons between algorithms, across corpora from a large number of languages and domains. The goal of this chapter is to provide the NLP community with a statistical analysis framework, termed Replicability Analysis, which will allow us to draw statistically sound conclusions in evaluation setups that involve multiple comparisons. The classical goal of replicability analysis is to examine the consistency of findings across studies in order to address the basic dogma of science, namely finding is more convincingly true if it is replicated in at least one more study [Heller et al., 2014, Patil et al., 2016]. We adapt this goal to NLP, where we wish to ascertain the superiority of one algorithm over another across multiple datasets, which may come from different languages, domains, and genres. This chapter is based on our TACL paper [Dror et al., 2017].
Finally, while this book aims to provide a basic framework for proper statistical significance testing in NLP research, it is by no means the final word on this topic. Indeed, Chapter 7 presents a list of open questions that are still to be addressed in future research. We hope that this book will contribute to the evaluation practices in our community and eventually to the development of more effective NLP technology.
CHAPTER 2
Statistical Hypothesis Testing
We begin with a definition of the statistical hypothesis testing framework. This fundamental framework will then allow us to discuss statistical significance tests (Chapter 3) and later on their application to experimental research in NLP.
A statistical hypothesis is defined as an hypothesis that is testable by observing and analyzing a process modeled by a set of random variables. In the basic setting, two datasets are compared and a hypothesis is proposed for the statistical relationship between them. This hypothesis is usually suggested as an alternative to an ideal null hypothesis that (often) proposes no relationship between two datasets. If the relationship between the datasets seems unlikely under the null hypothesis according to a threshold probability—the significance level—the null hypothesis will be rejected.
In order to distinguish between the null hypothesis and the alternative hypothesis, we consider two conceptual types of errors. The first type of error occurs when the null hypothesis is wrongly rejected while the second occurs when we wrongfully do not reject the null hypothesis. These two types of errors are known as type I and type II errors, and we will further elaborate on them later on.