Domain-Sensitive Temporal Tagging. Jannik Strötgen
Чтение книги онлайн.
Читать онлайн книгу Domain-Sensitive Temporal Tagging - Jannik Strötgen страница 10
TIDES TIMEX2
While there have been several TIMEX definitions reaching from extent-only coverage [see, e.g., Chinchor, 1998], up to inclusion of some normalization information [see, e.g., Mani and Wilson, 2000a, Setzer and Gaizauskas, 2000], the TIDES TIMEX2 definitions were the first annotation guidelines that were well defined with sufficient detail to become broadly accepted as a standard. The annotation guidelines are based on the principles that temporal expressions should be tagged “if a human can determine a value for [it]” and that the value “must be based on evidence internal to the document” [Ferro et al., 2001]. Covering extent and normalization information, both questions What is a temporal expression? and What is the meaning of a temporal expression? are addressed. For the normalization, TIMEX2 tags may contain the following attributes [Ferro et al., 2005b]:
• VAL: a normalized form of the date/time [or duration/set];
• MOD: captures temporal modifiers;
• ANCHOR_VAL: a normalized form of an anchoring date/time [of a duration];
• ANCHOR_DIR: the relative direction between VAL and ANCHOR_VAL; and
• SET: identifies expressions denoting sets of times.
Except for the SET attribute, there is no concrete attribute for the type of temporal expressions in general. Nevertheless, since it can be determined based on the VAL attribute whether an expression is a time, a date or a duration, the classification of temporal expressions into these four types is implicitly covered by TIMEX2 annotations. However, it is rather difficult to use TIMEX2 annotations if only the extraction and classification of temporal expressions is targeted without the full normalization of temporal expressions.
TIMEML WITH TIMEX3 TAGS FOR TEMPORAL EXPRESSIONS
TimeML, which has more recently been formalized to create the ISO standard ISO-TimeML1[Pustejovsky et al., 2010], is based on the TIDES standard and was developed to capture further types of temporal information in documents. In contrast to TIDES that has only one tag for temporal expressions, TimeML contains tags for annotating events, temporal links (i.e., temporal relations), and temporal signals in addition to the TIMEX3 tag for temporal expressions [Pustejovsky et al., 2003a, 2005, 2010]. In the following, we focus on a description of TimeML aspects that are relevant for the task of temporal tagging.
Due to the fact that TimeML focuses on temporal information in general and not only temporal expressions, there are significant differences between TIMEX2 and TIMEX3. These differences concern both the attributes and the extents of temporal expressions. For example, events can be part of temporal expressions in TIMEX2 (<TIMEX2>two days after the revolution</TIMEX2>
), while they are not part of temporal expressions following TimeML (<TIMEX3>two days</TIMEX3> after the revolution
).
In particular, specific types of pre- and post-modifiers of temporal expressions are part of TIMEX2 tags while in TimeML they are outside TIMEX3 tags [Mazur, 2012]. Such constructs are handled using the newly introduced tags for annotating relations between temporal expressions and events. In addition, TIMEX3 tags cannot be nested. However, TIMEX3 tags with no extent are introduced, for example, to deal with unspecified time points, which are sometimes needed to anchor durations. Note that despite the fact that such abstract tags, that is, annotations without any extent, are described in the TimeML annotation guidelines, they have not been used [cf. Mazur, 2012]—neither in annotated corpora nor by TIMEX3-compliant temporal taggers—until the Italian temporal tagging challenge EVENTI in 2014 [Caselli et al., 2014]. In addition, abstract tags have been annotated in the 2016 released MEANTIME corpus [Minard et al., 2016], which was developed in the context of the NewsReader project.2 Before that, empty TIMEX3 tags have been mostly ignored.
To describe the semantics of temporal expressions, the most important attributes of TIMEX3 tags3 are:
• type: defines whether the expression is of type date, time, duration, or set;
• value: a normalized form of the expression;
• mod: captures temporal modifiers;
• quant and freq: specify the quantity and frequency of set expressions;
• beginpoint and endpoint: anchor begin and end of a duration; and
• tid: automatically assigned id number.
While the attribute type—with possible values “date”, “time”, “duration”, and “set”—is newly introduced in TIMEX3, the attributes value and mod are similar to the VAL and MOD attributes in TIMEX2. These two attributes already capture a large part of the information of temporal expressions, and for many expressions—in particular for many date and time expressions—the value attribute is the only attribute besides type that is needed for normalization. This is also the reason why in several evaluations of temporal taggers, the value attribute is the focus of interest [see, e.g., UzZaman et al., 2013].
In particular for explicit date and time expressions, forming the value attribute (or the VAL attribute in TIMEX2) is straightforward, for example, the values of the expressions “September 13, 2009” and “Oct 12, 2014 7:00 am” are 2009-09-13
and 2014-10-12T07:00
, respectively. For underspecified and relative date and time expressions, setting the value attribute is more challenging, because the information covered by their own extents is not sufficient. In contrast, a reference time has to be used along with a temporal function to calculate the content of the value attribute. For instance, in a document published on November 27, 2014 (2014-11-27
), the expression “yesterday” can be normalized to 2014-11-26
.4
Value attributes in TIMEX3 (as VAL attributes in TIMEX2) assigned to duration expressions start with “P” (period), followed by an amount and an abbreviated unit, e.g., the value of “three years” is P3Y
. If the unit of the duration is smaller than a day, the value attribute starts with “PT” (period, time), e.g., PT5H
for the expression “five hours”. Thus, the value attribute of durations represents the length of the duration. If a duration can be anchored to some point in time, the attribute beginpoint or endpoint can be used to cover this information. Finally, the value attributes of set expressions are often similar to the ones of duration expressions. However, set expressions are additionally assigned at least one of the attributes quant and freq to cover the characteristics of set expressions. For instance, “twice a week”, has a value attribute of P1W
and a freq attribute of 2X
.
In contrast to the other attributes, the tid attribute does not contain any normalized information about an expression, but is just an id number that is automatically generated. It can be used to refer from other TimeML objects to a particular TIMEX3 object. Due to the relations between annotated instances within TimeML, for example, a temporal relation between an event and a temporal expression, an id is assigned to all objects in TimeML.
For many temporal expressions, only an identifier, a type, and a value are assigned. In addition, although the different attributes and definitions of extents between TIMEX2 and TIMEX3 are significant, the annotations for many temporal expressions are very similar, and an automated conversion works reasonably well