et al., 1991] proposed measures that compare the phrase-structure bracketings5 produced by the parser with bracketings in the annotated corpus (treebank). One computes the number of bracketing matches M with respect to the number of bracketings P returned by the parser (expressed as precision M/P) and with respect to the number C of bracketings in the corpus (expressed as recall M/C). Their harmonic mean, the F-measure, is most often reported for parsers. In addition, the mean number of crossing brackets per sentence could be reported, to count the number of cases when a bracketed sequence from the parser overlaps with one from the treebank (i.e., neither is properly contained in the other). For chunking, the accuracy can be reported as the tag correctness for each chunk (labeled accuracy), or separately for each token in each chunk (token-level accuracy). The former is stricter because it does not give credit to a chunk that is partially correct but incomplete, for example one or more words too short or too long.


       Adapting Parsers

      Parsing performance also decreases on social media text. Foster et al. [2011] tested four dependency parsers and showed that their performance decreases from 90% F-score on newspaper text to 70–80% on social media text (70% on Twitter data and 80% on discussion forum texts). After retraining on a small amount of social media training data (1,000 manually corrected parses) plus a large amount of unannotated social media text, the performance increased to 80–83%. Ovrelid and Skjærholt [2012] also show the labeled accuracy of dependency parsers decreasing from newspaper data to Twitter data.

      Ritter et al. [2011] also explored shallow parsing and noun phrase chunking for Twitter data. The token-level accuracy for the shallow parsing of tweets was 83% with the OpenNLP chunker and 87% with their shallow parser T-chunk. Both were re-trained on a small amount of annotated Twitter data plus the Conference on Natural Language Learning (CoNLL) 2000 shared task data [Tjong Kim Sang and Buchholz, 2000].

      Khan et al. [2013] reported experiments on parser adaptation to social media texts and other kinds of Web texts. They found that text normalization helps increase performance by a few percentage points, and that a tree reviser based on grammar comparison helps to a small degree. A dependency parser named TweeboParser6 was developed specifically on a recently annotated Twitter treebank for 929 tweets [Kong et al., 2014]. It uses the POS tagset from Gimpel et al. [2011] presented in Table 2.4. Table 2.5 shows an example of output of the parser for the tweet: “They say you are what you eat, but it’s Friday and I don’t care! #TGIF (@ Ogalo Crows Nest)”

      The columns represent, in order: ID is the token counter, starting at 1 for each new sentence; FORM is the word form or punctuation symbol; CPOSTAG is the coarse-grained part-of-speech tag, where the tagset depends on the language; POSTAG is the fine-grained part-of-speech tag, where the tagset depends on the language, or it is identical to the coarse-grained part-of-speech tag, if not available; HEAD is the head of the current token, which is either an ID (–1 indicates that the word is not included in the parse tree; some treebanks also used zero as ID); and finally, DEPREL is the dependency relation to the HEAD. The set of dependency relations depends on the particular language. Depending on the original treebank annotation, the dependency relation may be meaningful or simply “ROOT.” So, for this tweet, the dependency relations are MWE (Multiword expression), CONJ (Conjunct), and many other relations between the word IDs, but they are not named (probably due to the limited training data used when the parser was trained). The dependency relations from the Stanford dependency parser are included, if they can be detected in a tweet. If they cannot be named, they are still in the table, but without a label.


      A named entity recognizer (NER) detects names in the texts, as well as dates, currency amounts, and other kinds of entities. NER tools often focus on three types of names: Person, Organization, and Location, by detecting the boundaries of these phrases. There are a few other types of tools that can be useful in the early stages of NLP applications. One example is a co-reference resolution tool that can be used to detect the noun that a pronoun refers to or to detect different noun phrases that refer to the same entity. In fact, NER is a semantic task, not a linguistic pre-processing task, but we introduce it this chapter because it became part of many of the recent NLP tools discussed in this chapter. We will talk more about specific kind of entities in Sections 3.2 and 3.3, in the context of integrating more and more semantic knowledge when solving the respective tasks.

       Methods for NER

      NER is composed of two sub-tasks: detecting entities (the span of text where a name starts and where it ends) and determining/classifying the type of entity. The methods used in NER are either based on linguistic grammars for each type of entity, either based on statistical methods. Semi-supervised learning techniques were proposed, but supervised learning, especially based on CRFs for sequence learning, are the most prevalent. Hand-crafted grammar-based systems typically obtain good precision, but at the cost of lower recall and months of work by experienced computational linguists. Supervised learning techniques were used more recently due the availability of annotated training datasets, mostly for newspaper texts, such as data from MUC 6, MUC 7, and ACE,7 and also the CoNLL 2003 English NER dataset [Tjong Kim Sang and De Meulder, 2003].

      Tkachenko et al. [2013] described a supervised learning method for named-entity recognition. Feature engineering and learning algorithm selection are critical factors when designing a NER system. Possible features could include word lemmas, part-of-speech tags, and occurrence in some dictionary that encodes characteristic attributes of words relevant for the classification task. Tkachenko et al. [2013] included morphological, dictionary-based, WordNet-based, and global features. For their learning algorithm, the researchers chose CRFs, which have a sequential nature and ability to handle a large number of features. As also mentioned above, CRFs are widely used for the task of NER. For the Estonian dataset, the system produced a gold standard NER corpus, on which their CRF-based model achieved an overall F1-score of 0.87.

      He and Sun [2017] developed a semi-supervised leaning model based on deep neural networks (B-LSTM). This system combined transition probabilities with deep learning to train the model directly on F-score and label accuracy. The researchers used a modified, labeled corpus which corrected labeling errors in data developed by Peng and Dredze [2016] for NER in Chinese social media. They evaluated their model on NER and nominal mention tasks. The result for NER on the dataset of Peng and Dredze [2016] is the state-of-the-art NER system in Chinese Social Media. Their B-LSTM model achieved F-scores of 0.53.

       Evaluation Measures for NER

      The precision, recall, and F-measure can be calculated at sequence level (whole span of text) or at token level. The former is stricter because each named entity that is longer than one word has to have an exact start and end point. Once entities have been determined, the accuracy of assigning them to tags such as Person, Organization, etc., can be calculated.

       Adaptation for Named Entity Recognition

      Named entity recognition methods typically have 85–90% accuracy on long and carefully edited texts, but their performance decreases to 30–50% on tweets [Li et al., 2012a, Liu et al., 2012b, Ritter et al., 2011].


