Domain-Sensitive Temporal Tagging. Jannik Strötgen

[Alonso et al., 2011], temporal relation extraction [UzZaman et al., 2013], and (time-related) question answering [Llorens et al., 2015]. Much more common to evaluate temporal taggers, however, are intrinsic evaluations, that is, using manually annotated corpora and directly evaluating a temporal tagger’s extraction and normalization quality.


      For intrinsic evaluations, temporal tagging is considered as a specific sequential tagging task, and the confusion matrix (also called contingency table or contingency matrix) can be used to describe a system’s output when compared to a gold standard. As shown in Table 3.1, all decisions of a temporal tagger can be grouped with the confusion matrix into one of the following four classes of a binary classification [Manning and Schütze, 2003]:

      • true positives (TP): annotated by the system and in the gold standard;

      • true negatives (TN): neither annotated by the system nor in the gold standard;

      • false positives (FP): annotated by the system but not in the gold standard; and

      • false negatives (FN): not annotated by the system but in the gold standard.

      Note that because many temporal expressions consist of more than one token, it is also common to distinguish between strict and relaxed matching. Details about the differences will be explained at the end of the section (page 29).

System Prediction Gold Standard (Ground Truth)
Positive Negative
Positive TP FP
Negative FN TN


      Both tasks of temporal taggers—the extraction and the normalization of temporal expressions—can be evaluated based on the confusion matrix. For the extraction, true positives are all instances that are correctly extracted by the system, while for the normalization, only instances that are correctly extracted and normalized are considered as true positives. Typically, in an evaluation the measures of precision, recall, and f1-score are determined.

      Precision is a measure to indicate how many of the expressions extracted by the system are correct (Equation 3.1). If all instances marked as positive by the system are correct, then precision equals 1, and if all instances marked as positive by the system are incorrectly marked, then precision equals 0:

      In contrast, recall indicates how many of the expressions that should be extracted are correctly extracted by the system (Equation 3.2). Thus, recall equals 0 if none of the instances that should be marked as positive is marked as positive by the system, and recall equals 1 if all instances that should be marked as positive are indeed marked as positive by the system:

