Automatic Text Simplification. Horacio Saggion

Чтение книги онлайн.

Читать онлайн книгу Automatic Text Simplification - Horacio Saggion страница 4

Automatic Text Simplification - Horacio Saggion Synthesis Lectures on Human Language Technologies

Скачать книгу

Author’s Biography

       Acknowledgments

      I am indebted to my fellow colleagues Stefan, Sanja, Biljana, Susana, Luz, Daniel, Simon, and Montserrat for sharing their knowledge and expertise with me.

      Horacio Saggion

      January 2017

      CHAPTER 1

       Introduction

      Automatic text simplification is a research field in computational linguistics that studies methods and techniques to simplify textual content. Text simplification methods should facilitate or at least speed up the adaptation of available and future textual material, making accessible information for all a reality. Usually (but not necessarily), adapted texts would have information loss and a simplistic style, which is not necessarily a bad thing if the message of the text, which was in the beginning complicated, can in the end be understood by the target reader. Text simplification has also been suggested as a potential pre-processing step for making texts easier to handle by generic text processors such as parsers, or to be used in specific information access tasks such as information extraction. Simplifying for people is more challenging than the second use of simplification because the output of the automatic system could be perceived as inadequate in the presence of the least error.

      The interest in automatic text simplification has grown in recent years and in spite of the many approaches and techniques proposed, automatic text simplification is, as of today, far from perfect. The growing interest in text simplification is evidenced by the number of languages which are targeted by researchers worldwide. Simplification systems and simplification studies exist at least for English [Carroll et al., 1998, Chandrasekar et al., 1996, Siddharthan, 2002], Brazilian Portuguese [Aluísio and Gasperin, 2010], Japanese [Inui et al., 2003], French [Seretan, 2012], Italian [Barlacchi and Tonelli, 2013, Dell’Orletta et al., 2011], Basque [Aranzabe et al., 2012], and Spanish [Saggion et al.].

      Although there are many text characteristics which can be modified in order to make a text more readable or understandable, including the way in which the text is presented, automatic text simplification has usually concentrated on two different tasks—lexical simplification and syntactic simplification—each addressing different sub-problems.

      Lexical simplification will attempt to either modify the vocabulary of the text by choosing words which are thought to be more appropriate for the reader (i.e., transforming the sentence “The book was magnificent” into “The book was excellent”) or to include appropriate definitions (e.g., transforming the sentence “The boy had tuberculosis.” into “The boy had tuberculosis, a disease of the lungs.”). Changing words in context is not an easy task because it is almost certain that the original meaning will be confused.

      Syntactic simplification will try to identify syntactic phenomena in sentences which may hinder readability and comprehension in an effort to possibly transform the sentence into more readable or understandable equivalents. For example, relative or subordinate clauses or passive constructions, which may be very difficult to read by certain readers, could be transformed into simpler sentences or into active form. For example, the sentence “The festival was held in New Orleans, which was recovering from Hurricane Katrina” could be transformed without altering the original too much into “The festival was held in New Orleans. New Orleans was recovering from Hurricane Katrina.”

      As we shall later see, automatic text simplification is related to other natural language processing tasks such as text summarization and machine translation. The objective of text summarization is to reduce a text to its essential content which might be useful in simplification on occasions where the text to simplify has too many unnecessary details. The objective of machine translation is to translate a text into a semantic equivalent in another language. A number of recent automatic text simplification approaches cast text simplification as statistical machine translation; however, this approach to simplification is currently limited by the scarcity of parallel simplification data.

      There is an important point to mention here: although lexical and syntactic simplification usually have been addressed separately, they are naturally related. If during syntactic simplification a particular syntactic structure is chosen to replace a complex construction, it also might be necessary to apply transformations at the lexical level to keep the text grammatical. Furthermore, with a text being a coherent and cohesive unit, any change at a local level (words or sentences) certainly will affect in one way or another textual properties (at the local and global level): for example replacing a masculine noun with a feminine synonym during lexical simplification will certainly require some languages to repair local elements such as determiners and adjectives, as well as pronouns or definite expressions in following or preceding sentences. Pragmatic aspects of the text, such as the way in which the original text has been created to communicate a message to specific audiences, are generally ignored by current systems.

      As we shall see in this book, most approaches treat text simplification as a sequence of transformations at the word or sentence level, disregarding the global textual content (previous and following text units), thereby affecting important properties such as cohesion and coherence.

      Various studies have investigated ways in which a given text is transformed into an easier-to-read version. In order to understand what text transformations would be needed and what transformations could be implemented automatically, Petersen and Ostendorf [2007] performed an analysis of a corpus of original and abridged CNN news articles in English (114 pairs), distributed by the Literacyworks organization,1 aimed at adult learners (i.e., native speakers of English with poor reading skills). They first aligned the original and abridged versions of the news articles looking for the occurrence of an original-version sentence corresponding to a sentence in the abridged version. After having aligned the corpus, they observed that sentences from the original documents can be dropped (around 30%) or aligned to one (47% of same sentences) or more sentences (19%) in the abridged version (splits). The one-to-one alignments correspond to cases where the original sentence is kept practically untouched, cases where only part of the original sentence is kept, and cases of major re-writing operations. A small fraction of pairs of the original sentences were also aligned to a single abridged sentence, accounting for merges. Petersen and Ostendorf’s study also tries to automatically identify sentences in the original document which should be split since those would be good candidates for simplification. Their approach consists of training a decision-tree learning algorithm (C4.5 [Quinlan, 1993]) to classify a sentence into split or nonsplit. They used various features including sentence length and several statistics on POS tags and syntactic constructions. Cross-validation evaluation experiments show that it is difficult to differentiate between the two classes; moreover, sentence length is the most informative feature, which explains much of the classification performance. Another interesting contribution is the study of dropped sentences, for which they train a classifier with some features borrowed from summarization research; however, the classifier is only slightly better than a majority baseline (i.e., not drop).

      In a similar way, Bott and Saggion [2011b] and Drndarevic and Saggion [2012a,b] identified a series of transformations that trained editors apply to produce simplified versions of documents. Their case in notably different from Petersen and Ostendorf [2007] given the characteristics of the language—Spanish—and target population of the simplified text version: people with cognitive disabilities. Bott and Saggion [2011b] analyzed a sample of sentence-aligned original and simplified documents to identify expected simplification operations such as sentence split, sentence deletion, and various types of change operations (syntactic, lexical,

Скачать книгу