Linked Lexical Knowledge Bases. Iryna Gurevych

Чтение книги онлайн.

Читать онлайн книгу Linked Lexical Knowledge Bases - Iryna Gurevych страница 4

Linked Lexical Knowledge Bases - Iryna Gurevych Synthesis Lectures on Human Language Technologies

Скачать книгу

to modeling fine-grained lexical knowledge such as distinct word senses or concrete lexical semantic relationships. Human encoding, on the other hand, provides more precise knowledge at the fine-grained level. The ongoing popular use of LKBs, and particularly of WordNet, seems to indicate that they still provide substantial complementary information relative to corpus-based methods (see Shwartz et al. [2015] for a concrete evaluation showing the complementary behavior of corpus-based word embeddings and information from multiple LKBs).

      While WordNet has been by far the most widely-used lexical resource, it does not provide the full spectrum of needed lexical knowledge, which brings us to the theme of the current book. As reviewed in Chapter 2, additional lexical information has been encoded in quite a few LKBs, either by experts or by web communities through collaborative efforts. In particular, collaborative resources provide the opportunity to obtain much larger and more frequently updated resources than is possible with expert work. Knowledge resources like Wikipedia1 or Wikidata2 include vast lexical information about individual entities and domain specific terminology across many domains, which falls beyond the scope of WordNet. Hence, it would be ideal for NLP technology to utilize in an integrated manner the union of information available in a multitude of lexical resources. As an illustrating example, consider an application setting, like a question answering scenario, which requires knowing that Deep Purple was a group of people. We may find in Wikipedia that it was a “band,” map this term to its right sense in WordNet and then follow a hypernymy chain to “organization,” whose definition includes “a group of people.”

      As hinted in the above example, to allow such resource integration we need effective methods for linking, or aligning, the word senses or concepts encoded in various resources. Accordingly, the main technical focus of this book is about existing resource integration efforts, resource linking algorithms, and the utility of such algorithms within disambiguation tasks. Hence, this book would first be of high value for researchers interested in creating or linking LKBs, as well as for developers of NLP algorithms and applications who would like to leverage linked lexical resources. An important aspect is the development and use of linked lexical resources in multiple languages, addressed in Chapter 7.

      Looking forward, may be the most interesting research prospect for linked lexical knowledge bases is their integration with corpus-based machine learning approaches. A relatively simple form of combining the information in LKBs with corpus-based information is to use the former, via distant supervision, to create training data for the latter (discussed in Section 6.2). A more fundamental research direction is to create a unified knowledge representation framework, which integrates directly the human-encoded information in LKBs with information obtained by corpus-based methods. A promising framework for such integrated representation has emerged recently, under the “embedding” paradigm, where dense continuous vectors are used to represent linguistic objects, as reviewed in Section 6.3. Such representations, i.e., embeddings, have been initially created separately from corpus data—based on corpus co-occurrences, as well as from knowledge bases—based on and leveraging their rich internal structure. Further research suggested methods for creating unified representations, based on hybrid objective functions that consider both corpus and knowledge base structure. While this research line is still in initial phases, it has the potential to truly integrate corpus-based and human-encoded knowledge, and thus unify these two research endeavors which have been pursued mostly separately in the past. From this perspective, and assuming that human-encoded lexical knowledge can provide useful additional information on top of corpus-based information, the current book should be useful for any researcher who aims to advance state of the art in lexical semantics.

      While considering the integration of implicit corpus-based and explicit human-encoded information, we may notice that the joint embedding approach goes the “implicit way.” While joint embeddings do encode information coming from both types of resources, this information is encoded in opaque continuous vectors, which are not immediately interpretable, thus losing the transparency of the original symbolically-encoded human knowledge. Indeed, developing methods for interpreting embedding-based representations is an actively pursued theme, but it is yet to be seen whether such attempts will succeed to preserve the interpretability of LKB information. Alternatively, one might imagine developing integrated corpus-based and knowledge-based representations that would inherently involve explicit symbolic representations, even though, currently, this might be seen as wishful thinking.

      Finally, one would hope that the current book, and work on new lexical representations in general, would encourage researchers to better connect the development of knowledge resources with generic aspects of their utility for NLP tasks. Consider for example the common use of the lexical semantic relationships in WordNet for lexical inference. Typically, WordNet relations are utilized in an application to infer the meaning of one word from another in order to bridge lexical gaps, such as when different words are used in a question and in an answer passage. While this type of inference has been applied in numerous works, surprisingly there are no well-defined methods that indicate how to optimally exploit WordNet for lexical inference. Instead, each work applies its own heuristics, with respect to the types of WordNet links that should be followed, the length of link chains, the senses to be considered, etc. In this state of affairs, it is hard for LKB developers to assess which components of the knowledge and representations that they create are truly useful. Similar challenges are faced when trying to assess the utility of vector-based representations.3

      Eventually, one might expect that generic methods for utilizing and assessing lexical knowledge representations would guide their development and reveal their optimal form, based on either implicit or explicit representations, or both.

      Ido Dagan

      Department of Computer Science

      Bar-Ilan University, Israel

       1 https://www.wikipedia.org

       2 https://www.wikidata.org

      3One effort to address these challenges is the ACL 2016 workshop on Evaluating Vector Space Representations for NLP, whose mission statement is “To develop new and improved ways of measuring the quality or understanding the properties of vector-space representations in NLP.” https://sites.google.com/site/repevalacl16/.

       Preface

       MOTIVATION

      Lexical Knowledge Bases (LKBs) are indispensable in many areas of natural language processing (NLP). They strive to encode the human knowledge of language in machine-readable form, and as such they are required as a reference when machines are supposed to interpret natural language in accordance with the human perception. Examples for such tasks are word sense disambiguation (WSD) and information retrieval (IR). The aim of WSD is to determine the correct meaning of ambiguous words in context, and in order to formalize this task, a so-called sense inventory is required, i.e., a resource encoding the different meanings a word can express. In IR, the goal is to retrieve, given a user query formulating a specific information need, the documents from a collection which fulfill this need best. Here, knowledge is also necessary to correctly interpret short and often ambiguous queries, and to relate them to the set of documents.

      Nowadays, LKBs exist in many variations. For instance, the META-SHARE repository4 lists over 1,000 different lexical resources, and the LRE Map5 contains more than 3,900 resources which have been proposed as a knowledge source for natural language processing systems. A main distinction, which is also made in this book, is between expert-built and collaboratively constructed resources. While the distinction is not always clean-cut, the former are generally resources which are created by a limited set of expert editors

Скачать книгу