Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.
Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 6
An independent, though related, issue concerns the adaptation of existing systems to different text genres. By this we mean not just changes in domain, but different media (e.g., email, spoken text, written text, web pages, social media), text type (e.g., reports, letters, books), and structure (e.g., layout). The genre of a text may be influenced by a number of factors, such as author, intended audience, and degree of formality. For example, less formal texts may not follow standard capitalization, punctuation, or even spelling formats, all of which can be problematic for the intricate mechanisms of IE systems. These issues will be discussed in detail in Chapter 8.
Many natural language processing tasks, especially the more complex ones, only become really accurate and usable when they are tightly focused and restricted to particular applications and domains. Figure 1.3 shows a three-dimensional tradeoff graph between generality vs. specificity of domain, complexity of the task, and performance level. From this we can see that the highest performance levels are achieved in language processing tasks that are focused on a specific domain and that are relatively simple (for example, identifying named entities is much simpler than identifying events).
Figure 1.3: Performance tradeoffs for NLP tasks.
In order to make feasible the integration of semantic web applications, there must be some kind of understanding reached between semantic web and NLP practitioners as to what constitutes a reasonable expectation. This is of course true for all applications where NLP should be integrated. For example, some applications involving NLP may not be realistically usable in the real world as standalone automatic systems without human intervention. This is not necessarily the case, however, for other kinds of semantic web applications which do not rely on NLP. Some applications are designed to assist a human user rather than to perform the task completely autonomously. There is often a tradeoff between the amount of autonomy that will most benefit the end user. For example, information extraction systems enable the end user to avoid having to read in detail hundreds or even thousands of documents in order to find the information they want. For humans to search manually through millions of documents is virtually impossible. On the other hand, the user has to bear in mind that a fully automated system will not be 100% accurate, and it is important for the design of the system to be flexible in terms of the tradeoff between precision and recall. For some applications, it may be more important to retrieve everything, although some of the information retrieved may be incorrect; on the other hand, it may be more important that everything retrieved is accurate, even if some things are missed.
1.4 STRUCTURE OF THE BOOK
Each chapter in the book is designed to introduce a new concept in the NLP pipeline, and to show how each component builds on the previous components described. In each chapter we outline the concept behind the component and give examples of common methods and tools. While each chapter stands alone to some extent, in that it refers to a specific task, the chapters build on each other. The first five chapters are therefore best read sequentially.
Chapter 2 describes the main approaches used for NLP tasks, and explains the concept of an NLP processing pipeline. The linguistic processing components comprising this pipeline—language identification, tokenization, sentence splitting, part-of-speech tagging, morphological analysis, and parsing and chunking—are then described, and examples are given from some of the major NLP toolkits.
Chapter 3 introduces the task of named entity recognition and classification (NERC), which is a key component of information extraction and semantic annotation systems, and discusses its importance and limitations. The main approaches to the task are summarized, and a typical NERC pipeline is described.
Chapter 4 describes the task of extracting relations between entities, explaining how and why this is useful for automatic knowledge base population. The task can involve either extracting binary relations between named entities, or extracting more complex relations, such as events. It describes a variety of methodologies and a typical extraction pipeline, showing the interaction between the tasks of named entity and relation extraction and discussing the major research challenges.
Chapter 5 explains how to perform entity linking by adding semantics into a standard flat information extraction system, of the kind that has been described in the preceding chapters. It discusses why this flat information extraction is not sufficient for many tasks that require greater richness and reasoning and demonstrates how to link the entities found to an ontology and to Linked Open Data resources such as DBpedia and Freebase. Examples of a typical semantic annotation pipeline and of real-world applications are provided.
Chapter 6 introduces the concept of automated ontology development from unstructured text, which comprises three related components: learning, population, and refinement. Some discussion of these terms and their interaction is given, the relationship between ontology development and semantic annotation is discussed, and some typical approaches are described, again building on the notions introduced in the previous chapters.
Chapter 7 describes methods and tools for the detection and classification of various kinds of opinion, sentiment, and emotion, again showing how the NLP processes described in previous chapters can be applied to this task. In particular, aspect-based sentiment analysis (such as which elements of a product are liked and disliked) can benefit from the integration of product ontologies into the processing. Examples of real applications in various domains are given, showing how sentiment analysis can also be slotted into wider applications for social media analysis. Because sentiment analysis is often performed on social media, this chapter is best read in conjunction with Chapter 8.
Chapter 8 discusses the main problems faced when applying traditional NLP techniques to social media texts, given their unusual and inconsistent usage of spelling, grammar, and punctuation amongst other things. Because traditional tools often do not perform well on such texts, they often need to be adapted to this genre. In particular, the core pre-processing components described in Chapters 2 and 3 can have a serious knock-on effect on other elements in the processing pipeline if errors are introduced in these early stages. This chapter introduces some state-of-the-art approaches for processing social media and gives examples of some real applications.
Chapter 9 brings together all the components described in the previous chapters by defining and describing a number of application areas in which semantic annotations are required, such as semantically enhanced information retrieval