Linked Lexical Knowledge Bases. Iryna Gurevych

Чтение книги онлайн.

Читать онлайн книгу Linked Lexical Knowledge Bases - Iryna Gurevych страница 5

Linked Lexical Knowledge Bases - Iryna Gurevych Synthesis Lectures on Human Language Technologies

Скачать книгу

professionals using their personal introspection, corpus evidence, or other means to obtain the knowledge. Collaboratively constructed resources, on the other hand, are open for every volunteer to edit, with no or only few restrictions such as registration for a website. Intuitively, the quality of the entries should be lower when laypeople are involved in the creation of a resource, but it has been shown that the collaborative process of correcting errors and extending articles (also known as the “wisdom of the crowds”; Surowiecki [2005]) can lead to results of remarkable quality [Giles, 2005]. The most prominent example is Wikipedia, the largest encyclopedia and one of the largest knowledge sources known. Although originally not meant for that purpose, it has also become a major source of knowledge for all kinds of NLP applications, many of which we will discuss in this book [Medelyan et al., 2009].

      Apart from the basic distinction with regard to the production process, LKBs exist in many flavors. Some are focusing on encyclopedic knowledge (Wikipedia), others resemble language dictionaries (Wiktionary) or aim to describe the concepts used in human language and the relationships between them from a psycholinguistic (Princeton WordNet [Fellbaum, 1998a]) or a semantic (FrameNet [Ruppenhofer et al., 2010]) perspective. Another important distinction is between monolingual resources, i.e., those covering only one language, and multilingual ones, which not only feature entries in different languages but usually also provide translations. However, despite the large number of existing LKBs, the growing demand for large-scale LKBs in different languages is still not met. While Princeton WordNet has emerged as a de facto standard for English NLP, for most languages corresponding resources are either considerably smaller or missing altogether. For instance, the Open Multilingual Wordnet project lists only 25 wordnets in languages other than English, and only few of them (like the Finnish or Polish versions) match or surpass Princeton WordNet’s size [Bond and Foster, 2013]. Multilingual efforts such as Wiktionary or OmegaWiki provide a viable option for such cases and seem especially suitable for smaller languages due to their open construction paradigm and low entry requirements [Matuschek et al., 2013], but there are still considerable gaps in coverage which the corresponding language communities are struggling to fill.

      A closely related problem is that, even if comprehensive resources are available for a specific language, there usually does not exist a single resource which works best for all application scenarios or purposes, as different LKBs cover not only different words and senses, but sometimes even completely different information types. For instance, the knowledge about verb classes (i.e., groups of verbs which share certain properties) contained in VerbNet is not covered by WordNet, although it might be useful depending on the task, for example to provide subcategorization information when parsing low frequency verbs.

      These considerations have led to the insight that, to make the best possible use of the available knowledge, the orchestrated exploitation of different LKBs is necessary. This lets us not only extend the range of covered words and senses, but more importantly, gives us the opportunity to obtain a richer knowledge representation when a particular meaning of a word is covered in more than one resource.

      Examples where such a joint usage of LKBs proved beneficial include WSD using aligned WordNet and Wikipedia in BabelNet [Navigli and Ponzetto, 2012a], semantic role labeling (SRL) using a mapping between PropBank, VerbNet and FrameNet [Palmer, 2009], and the construction of a semantic parser using a combination of FrameNet, WordNet, and VerbNet [Shi and Mihalcea, 2005]. These combined resources, known as Linked Lexical Knowledge Bases (LLKB), are the focus of this book, and we shed light on their different aspects from various angles.

       TARGET AUDIENCE AND FOCUS

      This book is intended to convey a fundamental understanding of Linked Lexical Knowledge Bases, in particular their construction and use, in the context of NLP. Our target audience are students and researchers from NLP and related fields who are interested in knowledge-based approaches. We assume only basic familiarity with NLP methods and thus this book can be used both for self-study and for teaching at an introductory level.

      Note that the focus of this book is mostly on sense linking between general-purpose LKBs, which are most commonly used in NLP. While we acknowledge that there are many efforts of linking LKBs, for instance, to ontologies or domain-specific resources, we only discuss them briefly where appropriate and provide references for readers interested in these more specific linking scenarios. The same is true for the recent efforts in creating ontologies from LKBs and formalizing the relationships between them—while we give an introduction to this topic in Section 1.3, we realize that this diverse area of research deserves a book of its own, which indeed has been published recently [Chiarcos et al., 2012]. Our attention is rather on the actual algorithmic linking process, and the benefits it brings for applications. Furthermore, we put an emphasis on monolingual linking efforts (i.e., between resources in the same language), as the vast majority of algorithms have covered this scenario in the past and cross-lingual approaches were mostly direct derivatives thereof, for instance by introducing machine translation as an intermediate component (cf. Chapter 3). Nevertheless, we recognize the increasing importance of multilingual NLP and thus provide a dedicated chapter covering applications in this area (Chapter 6).

       OUTLINE

      After providing a brief description of the typographic conventions which we applied throughout this book, we start by introducing and comparatively analyzing a selection of LKBs which have been widely used in NLP (Chapter 1). Our description of these LKBs provides a foundation for the main part of this book, where their integration into LLKBs is considered from various different angles. We include expert-built LKBs, such as WordNet, as well as collaboratively constructed resources, such as Wikipedia and Wiktionary, and also cover established standards and representation formats which are relevant in this context.

      Then, in Chapter 2, we give a more formal definition of LLKBs, and also of word sense linking, which is crucial for combining different resources semantically, and thus is of utmost importance. We go on by describing various LLKBs which have been suggested, putting a focus on current large-scale projects which dominate the field, but also considering smaller, more specialized initiatives which have yielded important insights and paved the way for large-scale resource integration.

      In Chapter 3, we approach the core issue of automatic word sense linking. While the notion of similar or even equivalent word senses in different resources is intuitively understandable and often (but now always) quite easily grasped by humans, it poses a complex challenge for automatic processing due to word ambiguities, different sense granularities and information types [Navigli, 2006]. First, to contextualize the challenge, we describe some related tasks in NLP and other fields, and outline how word sense linking relates to them. Then, we discuss in detail different ways to automatically create sense links between LKBs, based on textual descriptions of senses (i.e., glosses), the structure of the resources, or a combination thereof. The broader context of LLKBs lies of course not in the mere linking of resources for its own sake, but in the potential it holds for NLP applications.

      Thus, in the following chapters, we present a selection of methods and applications where the use of LLKBs leads to particular benefits for NLP. In Chapter 4, we describe how the disambiguation of textual units benefits from the richer structure and combined knowledge, and also how the clustering of fine-grained word senses by exploiting 1:n links improves WSD accuracy. Building on that, we present more advanced disambiguation techniques in Chapter 5, including a discussion of using LLKBs for distant supervision and in neural vector space models, which are two

Скачать книгу