Statistical Significance Testing for Natural Language Processing. Rotem Dror

Чтение книги онлайн.

Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 4

Statistical Significance Testing for Natural Language Processing - Rotem Dror Synthesis Lectures on Human Language Technologies

Скачать книгу

framework, and Chapter 3 introduces common statistical significance tests. Then, Chapter 4 discusses the application of statistical significance testing to NLP. In Chapter 4, we assume that two algorithms are compared on a single dataset, based on a single output that each of them produces, and discuss the relevant significance tests for various NLP tasks and evaluation measures. The chapter puts an emphasis on the aspects in which NLP tasks and data differ from common examples in the statistical literature, e.g., the non–Gaussian distribution of the data and the dependence between the participating examples, e.g., sentences in the same corpus. This chapter, which extends our ACL 2018 paper [Dror et al, 2018], provides our recommended matching between NLP tasks with their evaluation measures and statistical significance tests.

      The next two chapters relax two of the basic assumptions of Chapter 4: (a) that each of the compared algorithms produces a single output for each test example (e.g., a single parse tree for a given input sentence), and (b) that the comparison between the two algorithms is performed on a single dataset. Particularly, Chapter 5 addresses the comparison between two algorithms based on multiple solutions where each of them produces for a single dataset, while Chapter 6 addresses the comparison between two algorithms across several datasets.

      The first challenge stems from the recent emergence of Deep Neural Networks (DNNs), which has made data-driven performance comparison much more complicated. This is because these models are non-deterministic due to their non-convex objective functions, complex hyperparameter tuning process and training heuristics such as random dropouts, that are often applied in their implementation. Chapter 5, therefore, defines a framework for a statistically valid comparison between two DNNs based on multiple solutions each of them produces for a given dataset. The chapter summarizes previous attempts in the NLP literature to perform this comparison task and evaluates them in light of the proposed framework. Then, it presents a new comparison method that is better fitted to the pre-defined framework. This chapter is based on our ACL 2019 paper [Dror et al., 2019].

      The second challenge is crucial for the efforts to extend the reach of NLP technology to multiple domains and languages. These well-justified efforts result in a large number of comparisons between algorithms, across corpora from a large number of languages and domains. The goal of this chapter is to provide the NLP community with a statistical analysis framework, termed Replicability Analysis, which will allow us to draw statistically sound conclusions in evaluation setups that involve multiple comparisons. The classical goal of replicability analysis is to examine the consistency of findings across studies in order to address the basic dogma of science, namely that a finding is more convincingly true if it is replicated in at least one more study [Heller et al., 2014, Patil et al., 2016]. We adapt this goal to NLP, where we wish to ascertain the superiority of one algorithm over another across multiple datasets, which may come from different languages, domains, and genres. This chapter is based on our TACL paper [Dror et al., 2017].

      Finally, while this book aims to provide a basic framework for proper statistical significance testing in NLP research, it is by no means the final word on this topic. Indeed, Chapter 7 presents a list of open questions that are still to be addressed in future research. We hope that this book will contribute to the evaluation practices in our community and eventually to the development of more effective NLP technology.

       INTENDED READERSHIP

      The book is intended for researchers and practitioners in NLP who would like to analyze their experimental results in a statistically sound manner. Hence, we assume technical background in computer science and related areas such as statistics and probability, mostly at the undergraduate level. Moreover, while in Chapter 4 we discuss various NLP tasks and their proposed significance tests, our discussion of these tasks is quite shallow. Furthermore, when we analyze experimental results with NLP tasks in Chapters 5 and 6 we do not provide the details of the tasks because we assume the reader is familiar with the basic tasks of NLP. Despite these assumptions about the reader’s background, we are trying as much as possible to be self-contained when it comes to statistical hypothesis testing and the derived concepts and methodology, as presenting these ideas to the NLP audience is a core objective of this book.

      Further Reading For broader and more in-depth reading on the fundamental concepts of statistics, we refer the reader to other existing resources such as Montgomery and Runger [2007] (which provides an engineering perspective) and Johnson and Bhattacharyya [2019]. For further reading on the topic of multiple comparisons in statistics, we recommend the book by Bretz et al. [2016] which demonstrates the basic concepts and provides examples with R code.

      This book evolved from a series of conference and journal papers—Dror et al. [2017], Dror et al [2018], Dror et al. [2019]—which have been greatly expanded in order to form this book. First, we added background chapters that discuss the foundations of statistical hypothesis testing and provide the details of the statistical significance tests that we find most relevant for NLP. Then, we take the handbook approach and provide the pseudocode of the various methods discussed throughout the book, along with concrete recommendations and guidelines—our goal is to allow the practitioner to directly and easily implement the methods described in this book. Finally, in Chapter 7, we critically discuss the ideas presented in this book and point to challenges that are yet to be addressed in order to perform statistically sound analysis of NLP experimental results.

       FOCUS OF THIS BOOK

      This book is intended to be self-contained, presenting the framework of statistical hypothesis testing and its derived concepts and methodology in the context of NLP research. However, the main focus of the book is on this statistical framework and its application to the analysis of NLP experimental results, rather than on providing in-depth coverage of the NLP field.

      Most of the book takes the handbook approach and aims to provide concrete solutions to practical problems. As such, it does not provide in-depth technical coverage of statistical hypothesis testing to a level that will allow the reader to propose alternative solutions to those proposed here, or to solve some of the open challenges we point to. Yet, our hope is that highlighting the challenges of statistically sound evaluation of NLP experiments, both those that already have decent solutions and those that are still open, will attract the attention of the community to these issues and facilitate future development of additional methods and techniques.

      Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart

      April 2020

      This book is an outcome of three years of exploration. The journey started with a course by Dr. Marina Bogomolov on multiple hypothesis testing, which was given in the fall of 2017 at the Faculty of Industrial Engineering and Management (IE&M) of the Technion. Marina, as well as Gili Baumer, her M.Sc. student and the tutor of the course at the time, were instrumental in the research that resulted in Chapter 6 of this book.

      Many people commented on the ideas we discuss in the book, read drafts of the papers that were eventually extended into this book as well as versions of the book itself, and provided valuable feedback. Among these are David Azriel, Eustasio Del Barrio, Yuval Pinter, David Traum (who, as the program chair of ACL 2019, made a substantial

Скачать книгу