Automatic Text Simplification. Horacio Saggion

Чтение книги онлайн.

Читать онлайн книгу Automatic Text Simplification - Horacio Saggion страница 5

Automatic Text Simplification - Horacio Saggion Synthesis Lectures on Human Language Technologies

Скачать книгу

as insertion and reordering were also documented. Drndarevic and Saggion [2012a,b] specifically concentrate on identifying lexical changes, in addition to synonym substitution, cases of numerical expression re-writing (e.g., rounding), named entity reformulation, and insertion of simple definitions. Like Petersen and Ostendorf [2007], Drndarevic and Saggion train a Support Vector Machine (SVM) algorithm [Joachims, 1998] to identify sentences which could be deleted, improving over a robust baseline that always deletes the last sentence of the document.

      The creation of text simplification tools without considering a particular target population could be justifiable in that aspects of text complexity affect a large range of users with reading difficulties. For example, long and syntactically complex sentences are generally hard to process. Some particular sentence constructions, such as syntactic constructions which do not follow the canonical subject-verb-object (e.g., passive constructions), may be an obstacle for people with aphasia [Devlin and Unthank, 2006] or an autism spectrum disorder (ASD) [Yaneva et al., 2016b]. The same is true for very difficult or specialized vocabulary and infrequent words which can also prove difficult to understand for people with aphasia [Carroll et al., 1998, Devlin and Unthank, 2006] and ASD [Norbury, 2005]. Moreover, there are also certain aspects of language that prove difficult to specific groups of readers. Language learners, for example, may have a good capacity to infer information, although they may have a very restricted lexicon and may not be able to understand certain grammatical constructions. Dyslexic readers, in turn, do not have a problem with language understanding per se, but with the understanding of the written representation of language. In addition, readers with dyslexia were found to read better when using more frequent and shorter words [Rello et al., 2013b]. Finally, people with intellectual disabilities may have problems processing and retaining large amounts of information [Fajardo et al., 2014, Feng et al., 2009].

      In order to create adapted versions for specific populations, various initiatives exist which promote accessible texts. An early proposal is Basic English, a language of reduced vocabulary of just over 800 word forms and a restricted number of grammatical rules. It was conceived after World War II as a tool for international communication or a kind of interlingua [Ogden, 1937]. Other initiatives are Plain English (see “Language for Special Purposes” in Crystal [1987]), for English in the U.S. and U.K., and the Rational French, a French-controlled language to make technical documentation more accessible in the context of the aerospace industry [Barthe et al., 1999]. In Europe, there are associations dedicated to the adaptation of text materials (books, leaflets, laws, official documents, etc.) for people with disabilities or low literacy levels, examples of which are the Easy-to-Read Network in Scandinavian countries, the Asociación Lectura Fácil2 in Spain, and the Centrum för Lättläst in Sweden.3 These associations usually provide guidance or recommendation about how to prepare or adapt textual material. Some such recommendations are as follows:

      • use simple and direct language;

      • use one idea per sentence;

      • avoid jargon and technical terms;

      • avoid abbreviations;

      • structure text in a clear and coherent way;

      • use one word per concept;

      • use personalization; and

      • use active voice.

      These recommendations, although intuitive, are sometimes difficult to operationalize (for both humans and machines) and sometimes even impossible to follow, especially in the case of adapting an existing piece of text.

      Although adapted texts have been produced for many years, nowadays there is a plethora of simplified material on the Web. The Swedish “easy-to-read” newspaper 8 Sidor4 is published by the Centrum för Lättläst to allow people access to “easy news.” Other examples of similarly oriented online newspapers and magazines are the Norwegian Klar Tale,5 the Belgian l’Essentiel6 and Wablie,7 the Danish Radio Ligetil,8 the Italian Due Parole,9 and the Finnish Selo-Uutiset.10 For Spanish, the Noticias Fácil website11 provides easy-to-read news for people with disabilities. The Literacyworks website12 offers CNN news stories in original and abridged (or simplified) formats, which can be used as learning resources for adults with poor reading skills. At the European level, the Inclusion Europe website13 provides good examples of how full text simplifications and simplified summaries in various European languages can provide improved access to relevant information. The Simple English Wikipedia14 provides encyclopedic content which is more accessible than plain Wikipedia articles because of the use of simple language and simple grammatical structures. There are also initiatives which aim to give access to easy-to-read material in particular and web accessibility in general the status of a legal right.

      The number of websites containing manually simplified material pointed out above clearly indicates a need for simplified texts. However, manual simplification of written documents is very expensive and manual methods will be not cost-effective, especially if we consider that news is constantly being produced and therefore simplification would, in turn, need to keep the same pace. Nevertheless, there is a growing need for methods and techniques to make texts more accessible. For example, people with learning disabilities who need simplified text constitute 5% of the population. However, according to data from the Easy-to-Read Network,15 if we consider people who cannot read documents with heavy information load or documents from authorities or governmental sources, the percentage of people in need of simplification jumps to 25% of the population.16 In addition, the need for simplified texts is becoming more important as the incidence of disability increases as the population ages.

      Having briefly introduced what automatic text simplification is and the need for such technology, the rest of the book will cover a number of relevant research methods in the field which have been the object of scientific inquiry for more than 20 years. Needless to say, many relevant works will not be addressed here; however, we have tried to cover most of the techniques which have been used, or are being used, at the time of writing. In Chapter 2, we will provide an overview of the topic of readability assessment given its current relevance in many approaches to automatic text simplification. In Chapter 3, we will address techniques which have been proposed to address the problem of replacing words and phrases by simpler equivalents: the lexical simplification problem. In Chapter 4, we will cover techniques which can be used to simplify the syntactic structure of sentences and phases, with special emphasis on rule-based linguistically motivated approaches. Then in Chapter 5, machine learning techiques, optimization, and other statistical techniques to “learn” simplification systems will be described. Chapters 6 and 7 cover very related topics—in Chapter 6 we will present fully fledged text simplification systems which have as users specific target populations, while in Chapter 7, we will cover sub-systems or methods specifically based on targeted tasks or user characteristics. In Chapter 8, we will cover two important topics: the available datasets for experimentation in text simplification and the current text

Скачать книгу