Linked Lexical Knowledge Bases. Iryna Gurevych
Чтение книги онлайн.
Читать онлайн книгу Linked Lexical Knowledge Bases - Iryna Gurevych страница 7
• Predicate argument structure information—For predicate-like words, such as verbs, this refers to a definition of the semantic predicate and information on the semantic arguments, including:
– their semantic role according to an inventory of semantic roles given in the context of a particular linguistic theory. There is no standard inventory of semantic roles, i.e., there are linguistic theories assuming small sets of about 40 roles, and others specifying very large sets of several hundred roles. Examples of typical semantic roles are Agent or Patient; and
– selectional preference information, which specifies the preferred semantic category of an argument, e.g., whether it is a human or an artifact.
For example, the sense change (“cause to change”) corresponds to a semantic predicate which can be described in natural language as “an Agent causes an Entity to change;” Agent and Entity are semantic roles of this predicate: She[Agent] changed the rules[Entity]; the preferred semantic category of Agent is human.
• Related forms—Word forms that are morphologically related, such as compounds or verbs derived from nouns; for example, the verb buy (“purchase”) is derivationally related to the noun buy, while on the other hand buy (“accept as true” e.g., I can’t buy this story) is not derivationally related to the noun buy.
• Equivalents—Translations of the sense in other languages; for example, kaufen is the German translation of buy (“purchase”), while abkaufen is the German translation of buy (“accept as true”)
• Sense links—Mappings of senses to equivalent senses in other LKBs; for example, the sense change (Cause_change) in FrameNet can be linked to the equivalent sense change (“cause to change”) in WordNet.
There are different ways to organize a LKB, for example, by grouping synonymous senses, or by grouping senses with the same lemma. The latter organization is the traditional head-word based organization used in dictionaries [Atkins and Rundell, 2008] where a LKB consists of lexical entries which group senses under a common headword (the lemma).
There is a large number of so-called Machine-readable Dictionaries (MRD), mostly digitized versions of traditional print dictionaries [Lew, 2011, Soanes and Stevenson, 2003], but also some MRDs are only available in digitized form, such as DANTE [Kilgarriff, 2010] or DWDS4 for German [Klein and Geyken, 2010]. We will not include them in our overview for the following reasons: MRDs have traditionally been built by lexicographers and are targeted toward human use, rather than toward use by automatic processing components in NLP. While MRDs provide information useful in NLP, such as sense definitions, sense examples, as well as grammatical information (e.g., about syntactic behavior), the representation of this information in MRDs usually lacks a strict, formal structure, and thus the information usually suffers from ambiguities. Although such ambiguities can easily be resolved by humans, they are a source of noise when the dictionary entries are processed fully automatically.
Our definition of LKBs also covers domain-specific terminology resources (e.g., the Unified Medical Language System (UMLS) metathesaurus of medical terms [Bodenreider, 2004]) that provide domain-specific terms and sense relations between them. However, we do not include these domain-specific resources in our overview, because we used general language LKBs to develop and evaluate the linking algorithms presented in Chapter 3.
1.1 EXPERT-BUILT LEXICAL KNOWLEDGE BASES
Expert-built LKBs, in our definition of this term, are resources which are designed, created and edited by a group of designated experts, e.g., (computational) lexicographers, (computational) linguists, or psycho-linguists. While it is possible that there is influence on the editorial process from the outside (e.g., via suggestions provided by users or readers), there is usually no direct means of public participation. This form of resource creation has been predominant since the earliest days of lexicography (or, more broadly, creation of language resources), and while the reliance on expert knowledge produces high quality resources, an obvious disadvantage are the slow production cycles—for all of the resources discussed in this section, it usually takes months (if not years) until a new version is published, while at the same time most of the information remains unchanged. This is due to the extensive effort needed for the creation of a resource of considerable size, in most cases provided by a very small group of people. Nevertheless, these resources play a major role in NLP. One reason is that up until recent years there were no real alternatives available, and some of these LKBs also cover aspects of language which are rather specific and not easily accessible for layman editors. We will present the most pertinent examples in this section.
1.1.1 WORDNETS
Wordnets define senses primarily by their relations to other senses, most notably the synonymy relation that is used to group synonymous senses into so-called synsets. Accordingly, synsets are the main organizational units in wordnets. In addition to synonymy, wordnets provide a large variety of additional sense relations. Most of the sense relations are defined on the synset level, i.e., between synsets, such as hypernymy or meronymy. Other sense relations, such as antonymy, are defined between individual senses, rather than between synsets. For example, while evil and unworthy are synonymous (“morally reprehensible” according to WordNet), their antonyms are different; good is the antonym of evil and worthy is the antonym of unworthy.
The Princeton WordNet for English [Fellbaum, 1998a] was the first such wordnet. It became the most popular wordnet and the most widely used LKB today. The creation of the Princeton WordNet is psycholinguisticially motivated, i.e., it aims to represent real-world concepts and relations between them as they are commonly perceived. Version 3.0 contains 117,659 synsets. Apart from its richness in sense relations, WordNet also contains coarse information about the syntactic behavior of verbs in the form of sentence frames (e.g., Somebody –_s something).
There are various works based on the Princeton WordNet, such as the eXtended Word-Net [Mihalcea and Moldovan, 2001a], where all open class words in the sense definitions have been annotated with their WordNet sense to capture further relations between senses, WordNet Domains [Bentivogli et al., 2004] which includes domain labels for senses, or SentiWordNet [Baccianella et al., 2010] which assigns sentiment scores to each synset of WordNet.
Wordnets in Other Languages The Princeton WordNet for English inspired the creation of wordnets in many other languages worldwide and many of them also provide a linking of their senses to the Princeton WordNet. Examples include the Italian wordnet [Toral et al., 2010a], the Japanese wordnet [Isahara et al.], or the German wordnet GermaNet [Hamp and Feldweg, 1997].5
Often, wordnets in other languages have particular characteristics that distinguish them from the Princeton WordNet. GermaNet, for example, containing around 70,000 synsets in version 7.0, originally contained very few sense definitions, but unlike most other wordnets, provides detailed information on the syntactic behavior of verbs. For each verb sense, it lists possible subcat frames, distinguishing more than 200 different types.
It is important to point out, however, that in general, the Princeton WordNet provides richer information than the other wordnets. For example, it includes not only derivational morphological information, but also inflectional morphology analysis within its associated tools. It also provides an ordering