Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.
Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 5
When considering information contained in text, there are several types of information that can be of interest. Often regarded as the key components of text are proper names, also called named entities (NEs), such as persons, locations, and organizations. Along with proper names, temporal expressions, such as dates and times, are also often considered as named entities. Figure 1.1 shows some simple Named Entities in a sentence. Named entities are connected together by means of relations. Furthermore, there can be relations between relations, for example the relation denoting that someone is CEO of a company is connected to the relation that someone is an employee of a company, by means of a sub-property relation, since a CEO is a kind of employee. A more complex type of information is the event, which can be seen as a group of relations grounded in time. Events usually have participants, a start and an end date, and a location, though some of this information may be only implicit. An example for this is the opening of a restaurant. Figure 1.2 shows how entities are connected to form relations, which form events when grounded in time.
Figure 1.1: Examples of named entities.
Figure 1.2: Examples of relations and events.
Information extraction is difficult because there are many ways of expressing the same facts:
• BNC Holdings Inc. named Ms. G. Torretta as its new chairman.
• Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc.
• Ms. Gina Torretta took the helm at BNC Holdings Inc.
Furthermore, information may need to be combined across several sentences, which may additionally not be consecutive.
• After a long boardroom struggle, Mr. Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms. Torretta.
Information extraction typically consists of a sequence of tasks, comprising:
1. linguistic pre-processing (described in Chapter 2);
2. named entity recognition (described in Chapter 3);
3. relation and/or event extraction (described in Chapter 4).
Named entity recognition (NER) is the task of recognizing that a word or a sequence of words is a proper name. It is often solved jointly with the task of assigning types to named entities, such as Person, Location, or Organization, which is known as named entity classification (NEC). If the tasks are performed at the same time, this is referred to as NERC. NERC can either be an annotation task, i.e., to annotate a text with NEs, or the task can be to populate a knowledge base with these NEs. When the named entities are not simply a flat structure, but linked to a corresponding entity in an ontology, this is known as semantic annotation or named entity linking (NEL). Semantic annotation is much more powerful than flat NE recognition, because it enables inferences and generalizations to be made, as the linking of information provides access to knowledge not explicit in the text. When semantic annotation is part of the process, the information extraction task is often referred to as Ontology-Based Information Extraction (OBIE) or Ontology Guided Information Extraction (see Chapter 5). Closely associated with this is the process of ontology learning and population (OLP) as described in Chapter 6. Information extraction tasks are also a pre-requisite for many opinion mining tasks, especially where these require the identification of relations between opinions and their targets, and where they are based on ontologies, as explained in Chapter 7.
1.2 AMBIGUITY
It is impossible for computers to analyze language correctly 100% of the time, because language is highly ambiguous. Ambiguous language means that more than one interpretation is possible, either syntactically or semantically. As humans, we can often use world knowledge to resolve these ambiguities and pick the correct interpretation. Computers cannot easily rely on world knowledge and common sense, so they have to use statistical or other techniques to resolve ambiguity. Some kinds of text, such as newspaper headlines and messages on social media, are often designed to be deliberately ambiguous for entertainment value or to make them more memorable. Some classic examples of this are shown below:
• Foot Heads Arms Body.
• Hospitals Sued by 7 Foot Doctors.
• British Left Waffles on Falkland Islands.
• Stolen Painting Found by Tree.
In the first headline, there is syntactic ambiguity between the proper noun (Michael) Foot, a person, and the common noun foot, a body part; between the verb and plural noun heads, and the same for arms. There is also semantic ambgiuity between two meanings of both arms (weapons and body parts), and body (physical structure and a large collection). In the second headline, there is semantic ambiguity between two meanings of foot (the body part and the measurement), and also syntactic ambiguity in the attachment of modifiers (7 [Foot Doctors] or [7 Foot] Doctors). In the third example, there is both syntactic and semantic ambiguity in the word Left (past tense of the verb, or a collective noun referring to left-wing politicians). In the fourth example, there is ambiguity in the role of the preposition by (as agent or location). In each of these examples, for a human, one meaning is possible, and the other is either impossible or extremely unlikely (doctors who are 7-foot tall, for instance). For a machine, understanding without additional context that leaving pastries in the Falkland Islands, though perfectly possible, is an unlikely news item, is almost impossible.
1.3 PERFORMANCE
Due not only to ambiguity, but a variety of other issues, as will be discussed throughout the book, performance on NLP tasks varies widely, both between different tasks and between different tools. Reasons for the variable performance of different tools will be discussed in the relevant sections, but in general, the reason for this is that some tools are good at some elements of the task but bad at others, and there are many issues regarding performance when tools are trained on one kind of data and tested on another. The reason for performance between tasks varying so widely is largely based on complexity, however.
The influence of domain dependence on the effectiveness of NLP tools is an issue that is all too frequently overlooked. For the technology to be suitable for real-world applications, systems need to be easily customizable to new domains. Some NLP tasks in particular, such as