Natural Language Processing for Social Media. Diana Inkpen

Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for Social Media - Diana Inkpen страница 15

Natural Language Processing for Social Media - Diana  Inkpen Synthesis Lectures on Human Language Technologies

Скачать книгу

and disambiguation for the Arabic language that includes Arabic tokenization, discretization, disambiguation, POS tagging, stemming, and lemmatization. MADA selects the best analysis result within all possible analyses for each word in the current context by using SVM models classifying into 19 weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary, and morphological information. TOKEN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA depends on three resources: BAMA, the SRILM toolkit, and SVMTools [Habash et al., 2009].

      Going back to the problem of AD identification, we give here a detailed example, with results. Sadat et al. [2014c] provided a framework for AD classification using probabilistic models across social media datasets. They incorporated the two popular techniques for language identification: the character n-gram Markov language model and Naïve Bayes classifiers.29

      The Markov model calculates the probability that an input text is derived from a given language model built from training data [Dunning, 1994]. This model enables the computation of the probability P(S) or likelihood, of a sentence S, by using the following chain formula in the following equation:

image

      The sequence (w1, w2, …, wn) represents the sequence of characters in a sentence S. P(wi|w1, …wi−1) represents the probability of the character wi given the sequence w1, …wi−1.

      A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. In text classification, this classifier assigns the most likely category or class to a given document d from a set of pre-defined N classes as c1, c2, …, cN. The classification function f maps a document to a category (f : DC) by maximizing the probability of the following equation [Peng and Schuurmans, 2003]:

image

      where d and c denote the document and the category, respectively. In text classification a document d can be represented by a vector of T attributes d = (t1, t2,…, tT). Assuming that all attributes ti are independent given the category c, we can calculate P(d|c) with the following equation:

image

      The attribute term ti can be a vocabulary term, local n-gram, word average length, or a global syntactic and semantic property [Peng and Schuurmans, 2003].

      Sadat et al. [2014c] presented a set of experiments using these techniques with detailed examination of what models perform best under different conditions in a social media context. Experimental results showed that the Naïve Bayes classifier based on character bigrams can identify the 18 different Arabic dialects considered with an overall accuracy of 98%. The dataset used in the experiments was manually collected from forums and blogs, for each of the 18 dialects.

      To look at the problem in more detail, Sadat et al. [2014a] applied both the n-gram Markov language model and the Naïve Bayes classifier to classify the eighteen Arabic dialects. The results of this study for the n-gram Markov language model is represented in Figure 2.5. This figure shows that the character-based unigram distribution helps the identification of two dialects, the Mauritanian and the Moroccan with an overall F-measure of 60% and an overall accuracy of 96%. Furthermore, the bigram distribution of 2 characters affix helps recognize 4 dialects, the Mauritanian, Moroccan, Tunisian, and Qatari, with an overall F-measure of 70% and overall accuracy of 97%. Lastly, the trigram distribution of three characters affix helps recognize four dialects, the Mauritanian, Tunisian, Qatari, and Kuwaiti, with an overall F-measure of 73% and an overall accuracy of 98%. Overall, for 18 dialects, the bigram model performed better than other models (unigram and trigram models).

image

      Figure 2.5: Accuracies on the character-based n-gram Markov language models for 18 countries [Sadat et al., 2014a].

      Since many dialects are related to a region, and these Arabic dialects are approximately similar, the authors also considered the accuracy of dialects group. Figure 2.6 shows the result on the three different character n-gram Markov language models and a classification on the six groups of divisions that were defined in Figure 2.4. Again, the bigram and trigram character Markov language models performed almost the same as in Figure 2.5, although the F-Measure of the bigram model for all dialect groups was higher than for the trigram model, except for the Egyptian dialect. Therefore, on average, for all dialects, the character-based bigram language model performed better than the character-based unigram and trigram models.

image

      Figure 2.6: Accuracies on the character-based n-gram Markov language models for the six divisions/groups [Sadat et al., 2014a].

      Figure 2.7 shows the results on the n-gram models using Naïve Bayes classifiers for the different countries, while Figure 2.8 shows the results on the n-gram models using Naïve Bayes classifiers for the six divisions according to Figure 2.4. The results show that the Naïve Bayes classifiers based on character unigram, bigram, and trigram have better results than the previous character-based unigram, bigram, and trigram Markov language models, respectively. An overall F-measure of 72% and an accuracy of 97% were noticed for the 18 Arabic dialects. Furthermore, the Naïve Bayes classifier that is based on a bigram model has an overall F-measure of 80% and an accuracy of 98%, except for the Palestinian dialect because of the small size of the data. The Naïve Bayes classifier based on the trigram model showed an overall F-measure of 78% and an accuracy of 98% except for the Palestinian and Bahrain dialects. This classifier could not distinguish between the Bahrain and the Emirati dialects because of the similarities on their three affixes. In addition, the Naïve Bayes classifier based on character bigrams performed better than the classifier based on character trigrams, according to Figure 2.7. Also, as shown in Figure 2.8, the accuracy of dialect groups for the Naïve Bayes classifier based on character bigram model yielded better results than the two other models (unigrams and trigrams).

      Recently, Zaidan and Callison-Burch [2014] created a large monolingual data set rich in dialectal Arabic content called the Arabic Online Commentary Dataset. They used crowdsourcing for annotating the texts with the dialect label. They also presented experiments on the automatic classification of the dialects for this dataset, using similar word and character-based language models. The best results were around 85% accuracy for distinguishing MSA from dialectal data and lower accuracies for identifying the correct dialect for the latter case. Then they applied the classifiers to discover new dialectical data from a large Web crawl consisting of 3.5 million pages mined from online Arabic newspapers.

image

      Figure 2.7: Accuracies on the character-based n-gram Naïve Bayes classifiers for 18 countries [Sadat et al., 2014a].

      Several other projects focused on Arabic dialects: classification [Tillmann et al., 2014], code switching [Elfardy and Diab, 2013], and collecting a Twitter corpus for several dialects [Mubarak and Darwish, 2014].

      This chapter discussed the issue of adapting NLP tools to social media texts. One way is to use text normalization techniques, in order to make

Скачать книгу