Domain-Sensitive Temporal Tagging. Jannik Strötgen

Чтение книги онлайн.

Читать онлайн книгу Domain-Sensitive Temporal Tagging - Jannik Strötgen страница 9

Domain-Sensitive Temporal Tagging - Jannik Strötgen Synthesis Lectures on Human Language Technologies

Скачать книгу

can locally be normalized to XXXX-12 and XXXX-12-25, respectively, that is, without specifying the year. Assuming that the reference time is “November 2013” (2013-11) and the relation to the reference time is “after”, then the two examples can be normalized to 2013-12 and 2013-12-25, respectively.

       ALTERNATIVE NAMINGS

      As mentioned above, the categorization of temporal expressions referring to points in time has quite a long tradition in the literature. While the set of expressions which we call explicit expressions is usually a fixed set and only the names to refer to such expressions differ—e.g., explicit [e.g., Alonso et al., 2007, Schilder and Habel, 2001], fully specified [e.g., Pustejovsky et al., 2003a], absolute [e.g., Derczynski, 2013, Jurafsky and Martin, 2008], complete [e.g., Hinrichs, 1986], and independent [e.g., Hinrichs, 1986]—expressions we call implicit are less frequently discussed. Grouping the other expressions (i.e., the ones we refer to as relative and underspecified) results in different, partially overlapping sets with multiple names in the literature.

      In the following, we present Mazur’s [2012] overview of the terminology used in the literature. For this, the following three example expressions are used:

      (i) “tomorrow”,

      (ii) “2 days later”, and

      (iii) “May 21st”.

      While some authors summarize all three types of expressions, e.g., as indexical expressions [e.g., Schilder and Habel, 2001] or relative expressions [e.g., Alonso et al., 2007], they were already separated into three groups by Smith [1978] and Hinrichs [1986]. Expressions such as (i) are frequently referred to as deictic expressions [e.g., Ahn et al., 2005, Busemann et al., 1997, Hinrichs, 1986, Smith, 1978]. Expressions such as (ii) are referred to as anaphoric expressions by some authors [Busemann et al., 1997], while others use the same term to refer to expressions such as (ii) and (iii) [e.g., Ahn et al., 2005]. In our categorization, we follow Busemann et al. [1997] referring to expressions such as (iii) as underspecified expressions.

      Some authors include so-called “vague expressions” as a separate group of point expressions. For instance, Mani and Wilson [2000b] use the term to refer to expressions such as “Monday morning” or season names (e.g., “fall”, “winter”) as vague expressions since their boundaries are fuzzy. That is, there are no exact start and end times. However, we agree with Mazur [2012] that the vagueness of such expressions should not result in a specific type of expressions since it “is not the expression that is vague […] [but] the entity referred to that has vague boundaries” [Mazur, 2012].

       UNCERTAINTY OF TEMPORAL EXPRESSIONS

      Standard date and time expressions are also often used without referring to the full duration of the expression. That is, the actual meaning of them is uncertain, or more specifically, it is not clear which exact time interval they actually refer to [Berberich et al., 2010]. For instance, in “he visited Germany in 2010”, it is rather unlikely that the visit took place the whole year. The exact point or period in 2010 is not known. Thus, all expressions of a larger granularity than a timestamp could be regarded as fuzzy. As will be described in Chapter 3, according to annotation standards, date and time expressions are typically assigned a single normalized value so that we also refer to them as points in time (with specific granularities). However, as pointed out by Berberich et al. [2010]—and as we will also discuss later in Section 3.1 when describing annotation standards—for some applications it may be useful to consider every time and date expression as an interval and to assign lower and upper bounds for the start and end times instead of a single value, that is, to take care of the fuzziness issue.

      Figure 2.4: Different realization types of date expressions in documents.

       EXAMPLES OF DATE EXPRESSIONS IN A NEWS ARTICLE

      In order to become familiar with the naming of realization types of date and time expressions, we give some examples in Figure 2.4. In some excerpts of the news article, which was already shown in Figure 1.1, temporal expressions are marked as either explicit, underspecified, or relative. Since there has been no implicit temporal expression in the original article, we added the last sentence to the example to cover all four realization types of temporal expressions in this example.

      As already pointed out above, there are differences in how temporal expressions of the four realization types are to be normalized. Since these differences are one of the key challenges of temporal tagging, we will cover them in detail in Chapter 4. Before that, we will first lay some further foundations (annotation standards and evaluation methods) and present an overview of relevant research competitions as well as existing annotated data sets in the next chapter.

      The most important characteristic of temporal information in the context of temporal tagging is that it can be normalized. For applications exploiting normalized temporal information, it is furthermore important that temporal information is well defined and that it can be organized hierarchically. While there are four types of temporal expressions (date, time, duration, and set expressions), several namings of the realizations of date and time expressions have been suggested in the literature. However, in the context of temporal tagging, we suggest to distinguish between explicit, implicit, relative, and underspecified date and time expressions.

      CHAPTER 3

       Foundations of Temporal Tagging

      In this chapter, we lay the theoretical foundations to fully understand the discipline of temporal tagging and the challenges that approaches to temporal tagging are faced with. For this, we survey annotation standards, evaluation methods, research competitions, and temporally annotated corpora.

      As introduced in the previous chapter, there are different types of temporal expressions: date, time, duration, and set expressions. In addition, temporal expressions can carry their meaning explicitly or implicitly, or they can be underspecified or relative to some context information. When addressing the task of temporal tagging, it is necessary that it is well defined: (i) what types of temporal expressions are “markable” [Ferro et al., 2005b] and should thus be annotated; (ii) what extents should be annotated; and (iii) how the semantics of the expressions can be captured by using normalization attributes requiring some values in a standard format. Thus, annotation standards with precise specifications are a prerequisite when dealing with the task of temporal tagging.

      Currently, there are two widely used annotation standards for annotating temporal expressions in documents: TIDES TIMEX2 [Ferro et al., 2001, 2005b] and TimeML [Pustejovsky et al., 2003a, 2005, 2010], a specification language for temporal annotation using TIMEX3 tags for temporal expressions. Both standards present guidelines for the annotation of temporal expressions, including how to determine the extents of expressions and their normalizations. In both cases, the normalization is defined according to the ISO 8601 standard for temporal information with some extensions. For instance, a date expression of granularity day is

Скачать книгу