Linked Lexical Knowledge Bases. Iryna Gurevych

Чтение книги онлайн.

Читать онлайн книгу Linked Lexical Knowledge Bases - Iryna Gurevych страница 10

Linked Lexical Knowledge Bases - Iryna Gurevych Synthesis Lectures on Human Language Technologies

Скачать книгу

COLLABORATIVELY CONSTRUCTED KNOWLEDGE BASES

      More recently, the rapid development of Web technologies and especially collaborative participation channels (often labeled “Web 2.0”) has offered new possibilities for the construction of language resources. The basic idea is that, instead of a small group of experts, a community of users (“crowd”) collaboratively gathers and edits the lexical information in an open and equitable process. The resulting knowledge is in turn also free to use, adapt and extend for everyone. This open approach has turned out to be very promising to handle the enormous effort of building language resources, as a large community can quickly adapt to new language phenomena like neologisms while at the same time maintaining a high quality by continuous revision—a phenomenon which has become known as the “wisdom of crowds” [Surowiecki, 2005]. The approach also seems to be suitable for multilingual resources, as users speaking any language and from any culture can easily contribute. This is very helpful for minor, usually resource-poor languages where expert-built resources are small or not available at all.

      Wikipedia10 is a collaboratively constructed online encyclopedia and one of the largest freely available knowledge sources. It has long surpassed traditional printed encyclopedias in size, while maintaining a comparative quality [Giles, 2005]. The current English version contains around 4,700,000 articles and is by far the largest one, while there are many language editions of significant size. Some, like the German or French editions, also contain more than 1,000,000 articles, each of which usually describes a particular concept.

      Although Wikipedia has not been designed as a sense inventory, we can interpret the pairing of an article title and the concept described in the article text as a sense. This interpretation is in accordance with the disambiguation provided in Wikipedia, either as part of the title or on separate disambiguation pages. An example of the former are some articles for Java where its different meanings are marked by “bracketed disambiguations” in the article title such as Java (programming language) and Java (town). An example of the latter is the dedicated disambiguation page for Java which explicitly lists all Java senses contained in Wikipedia.

      Due to its focus on encyclopedic knowledge, Wikipedia almost exclusively contains nouns. Similar as for word senses, the interpretation of Wikipedia as a LKB gives rise to the induction of further lexical information types, such as sense relations of translations. Since the original purpose of Wikipedia is not to serve as a LKB, this induction process might also lead to inaccurate lexical information. For instance, the links to corresponding articles in other languages provided for Wikipedia articles can be used to derive translations (i.e., equivalents) of an article “sense” into other languages. An example where this leads to an inaccurate translation is the English article Vanilla extract which links to a subsection titled Vanilleextrakt within the German article Vanille (Gewürz); according to our lexical interpretation of Wikipedia, this leads to the inaccurate German equivalent Vanille (Gewürz) for Vanilla extract.

      Nevertheless, Wikipedia is commonly used as a lexical resource in computational linguistics where it was introduced as such by Zesch et al. [2007], and has subsequently been used for knowledge mining [Erdmann et al., 2009, Medelyan et al., 2009] and various other tasks [Gurevych and Kim, 2012].

      Information Types We can derive the following lexical information types from Wikipedia.

      • Sense definition—While by design one article describes one particular concept, the first paragraph of an article usually gives a concise summary of the concept, which can therefore fulfill the role of a sense definition for NLP purposes.

      • Sense examples—While usage examples are not explicitly encoded in Wikipedia, they are also inferable by considering the Wikipedia link structure. If a term is linked within an article, the surrounding sentence can be considered as a usage example for the target concept of the link.

      • Sense relations—Related articles, i.e., senses, are connected via hyperlinks within the article text. However, since the type of the relation is usually missing, these hyperlinks cannot be considered full-fledged sense relations. Nevertheless, they express a certain degree of semantic relatedness. The same observation holds for the Wikipedia category structure which links articles belonging to particular domains.

      • Equivalents—The different language editions of Wikipedia are interlinked at the article level—the article titles in other languages can thus be used as translation equivalents.

      Related Projects As Wikipedia has nowadays become one of the largest and most widely used knowledge sources, there have been numerous efforts to make it better accessible for automatic processing. These include projects such as YAGO [Suchanek et al., 2007], DBPedia [Bizer et al., 2009], WikiNet [Nastase et al., 2010], MENTA [de Melo and Weikum, 2010], or DBPedia [Bizer et al., 2009]. Most of them aim at deriving a concept network from Wikipedia (“ontologizing”) and making it available for Semantic Web applications. WikiData,11—a project directly rooted in Wikimedia—has similar goals, but within the framework given by Wikipedia. The goal here is to provide a language-independent repository of structured world knowledge, which all language editions can easily integrate.

      These related projects basically contain the same knowledge as Wikipedia, only in a different representation format (e.g., suitable for Semantic Web applications), hence we will not discuss them further in this chapter. However, some of the Wikipedia derivatives have reached a wide audience in different communities, including NLP (e.g., DBPedia), and have also been used in different linking efforts, especially in the domain of ontology construction. We will describe corresponding efforts in Chapter 2

      Wiktionary12 is a dictionary “side project” of Wikipedia that was created in order to better cater for the need to represent specific lexicographic knowledge, which is not well suited for an encyclopedia, e.g., lexical knowledge about verbs and adjectives. Wiktionary is available in over 500 languages, and currently the English edition of Wiktionary contains almost 4,000,000 lexical entry pages, while many other language editions achieve a considerable size of over 100,000 entries. Meyer and Gurevych [2012b] found that the collaborative construction approach of Wiktionary yields language versions covering the majority of language families and regions of the world, and that it especially covers a vast amount of domain-specific descriptions not found in wordnets for these languages.

      For each lexeme, multiple senses can be encoded, and these are usually described by glosses. Wiktionary contains hyperlinks which lead to semantically related lexemes, such as synonyms, hypernyms, or meronyms, and provides a variety of other information types such as etymology or translations to other languages. However, the link targets are not disambiguated in all language editions, e.g., in the English edition, the links merely lead to pages for the lexical entries, which is problematic for NLP applications as we will see later on. The ambiguity of the links is due to the fact that Wiktionary has been primarily designed to be used by humans rather than machines. The entries are thus formatted for easy perception using appropriate font sizes and bold, italic, or colored text styles. In contrast, for machines, data needs to be available in a structured and unambiguous manner in order to become directly accessible. For instance, an easily accessible data structure for machines would be a list of all translations of a given sense, and encoding the translations by their corresponding sense identifiers in the target language LKBs would make the representation unambiguous.

      This kind of explicit and unambiguous structure does not exist in Wiktionary, but needs to be inferred from the wiki markup.13 Although there are guidelines on how to properly structure a Wiktionary entry, Wiktionary editors are permitted to choose from multiple variants or to deviate

Скачать книгу