Domain-Sensitive Temporal Tagging. Jannik Strötgen
Чтение книги онлайн.
Читать онлайн книгу Domain-Sensitive Temporal Tagging - Jannik Strötgen страница 6
TEMPORAL TAGGING FOR TOPIC DETECTION AND TRACKING
The goal of topic detection and tracking (TDT) is to organize news documents in an event-based way by building clusters of topics [Allan, 2002]. In this context, a topic is typically defined as “a seminal event or activity, along with all directly related events and activities” [Fiscus and Doddington, 2002]. For instance, the very first news article about a plane crash opens a new topic, and following news articles such as reports about the number of fatalities belong to the same topic. In contrast, news articles reporting about another plane crash do not belong to the same cluster. To decide whether an upcoming news document belongs to an existing cluster or opens a new cluster, the similarity between documents is typically determined based on some information extracted from the documents. For instance, Makkonen et al. [2003] create event vectors consisting of (i) names, (ii) locations, (iii) temporals, and (iv) content words.
Figure 1.2: Excerpts of the Wikipedia page about “Heidelberg University” and a timeline to which occurring temporal expressions are mapped. The content is not reported in a chronological order due to different topical sections about Heidelberg University. Thus, temporal tagging is crucial to correctly extract and order event information in a chronological way.
Figure 1.3: Excerpts of the CNNMoney article of Figure 1.1. After reporting on a recent happening, it refers to an event from the past in its last paragraph. Again temporal tagging is crucial to correctly extract and order event information.
In general, ambiguous expressions—such as “Tuesday”, “Friday”, and “March” in the news article shown in Figure 1.3—are quite frequent in news documents. To be able to exploit information about temporal expressions occurring in documents, temporal tagging is again a prerequisite because not just the detection but in particular the normalization of temporal expressions is crucial for successful topic detection and tracking.
TEMPORAL TAGGING FOR INFORMATION RETRIEVAL
During recent years, the value of temporal information has been increasingly exploited in the context of information retrieval research and applications [Alonso et al., 2007, 2011, Campos et al., 2014, Derczynski et al., 2015, Kanhabua et al., 2015]. Note, however, that there are different types of temporal information that can be used in information retrieval scenarios. The two main aspects are (i) time as a dimension of relevance and (ii) time as query topic.
On the one hand, when time is used as a dimension of relevance, temporal tagging is not needed. However, information about the document creation time is typically utilized to improve the ranking of documents. For example, for news-related queries, the freshness of search results may be important [see, e.g., Li and Croft, 2003]. In addition to improving search results, time as contextual information can be used to perform time-sensitive query auto-completion [Sengstock and Gertz, 2011, Shokouhi and Radinsky, 2012].
On the other hand, temporal tagging plays a crucial role when time is a query topic. No matter whether the temporal part of a query is provided explicitly or implicitly, temporal expressions occurring in potentially relevant documents have to be detected, normalized, and compared to the temporal aspect of the query. Berberich et al. [2010], for instance, integrate temporal expressions into a language modeling approach, and Strötgen and Gertz [2012a] present a query model to explicitly formulate temporal queries in a flexible way. Note that time as query topic must be handled by search engines, because temporal queries occur frequently as was shown by some query log analyses of web search engines: Nunes et al. [2008] found 1.5% queries with explicit temporal information, Metzler et al. [2009] determined 7% queries with implicit temporal intent, and Zhang et al. [2010] reported 13.8% for queries with explicit time and 17.1% with implicit time.
Note that sometimes the document creation time of a document might be a good indicator for detecting whether a document is relevant for a given query. However, using a temporal tagger to analyze the documents’ content is often crucial to successfully find relevant documents. For instance, both documents shown in Figure 1.4 can be considered as relevant for the example information need “Germanwings” with the time interval of interest being set to “1st of March 2015 to 30th of April 2015”. While the first document is a news document also published during the time interval of interest, the second document is a news article published in November 2015, that is, outside of the time interval of interest. However, both documents contain temporal expressions referring to the Germanwings plane crash in March 2015 (“Tuesday” and “March”, respectively), and they thus satisfy the information need.
Figure 1.4: Temporal information retrieval example. Given the query 〈“Germanwings”, “1st of March 2015 to 30th of April 2015”〉, both documents can be identified as relevant if a temporal tagger is used to extract and normalize the temporal expressions in the documents’ content.
A further interesting observation from Figure 1.4 is that the term “Tuesday” in the first document refers to a date within the time interval of interest (March 24, 2015) while the same term in the second document does not (here, it refers to November 10, 2015).
TEMPORAL TAGGING FOR QUESTION ANSWERING
A further area in which time is a crucial dimension is question answering. While this is one commonality with information retrieval, the two tasks share further aspects: In both areas, a user is faced with an information need, and the goal of both information retrieval and question answering is to satisfy this information need. In contrast, the main differences between them is that in information retrieval, the information need is typically formulated as a query consisting of keywords—possibly enriched with time intervals of interest in the area of temporal information retrieval—but in question answering, the information need is formulated as a natural language question. Analogously, the presentation of results is also different: in information retrieval, a ranked list of relevant documents is typically presented to the user while in question answering, the answer to the information need is directly provided.
On the border between both areas lies so-called entity-oriented search [Balog et al., 2012]. A typical information retrieval query is to ask for a specific entity or fact about an entity. Thus, the goal of entity-oriented search is—as in question answering—to directly provide an answer, in the ideal case together with a justification, e.g., in the form of small text nuggets rather than full-length documents [Pasca, 2008]. An example of such a query with a temporal dimension is the query “Golden Gate bridge built” with the answer “1937”.
A research competition dealing with temporal (and geographic) information needs is NTCIR GeoTime [Gey et al., 2010, 2011]. As in question answering, the information needs are formulated as natural language questions. Due to the temporal and geographic focus of the competition, the questions contain “where” and “when” aspects. However, unlike in standard question answering, systems are not evaluated based on whether they provide the correct answer, but on