Natural Language Processing for Social Media. Diana Inkpen
Чтение книги онлайн.
Читать онлайн книгу Natural Language Processing for Social Media - Diana Inkpen страница 18
Only some of the available tools were trained directly on social media data.
• LDIG30 is an off-the-shelf Java language identification tool targeted specifically at Twitter messages. It has pre-trained models for 47 languages. It uses a document representation based on data structures named tries.31
• MSR-LID [Goldszmidt et al., 2013] is based on rank-order statistics over character n-grams, and Spearman’s coefficient to measure correlations. Twitter-specific training data was acquired through a bootstrapping approach.
Some datasets of social media texts annotated with language labels are available.
• The dataset of Tromp and Pechenizkiy [2011] contains 9,066 Twitter messages labeled with one of the six languages: German, English, Spanish, French, Italian, and Dutch.32
• The Twituser language identification dataset33 of Lui and Baldwin [2014] for English, Japanese, and Chinese.
2.8.2 DIALECT IDENTIFICATION
Sometimes it is not enough that a language has been identified correctly. A case in point is Arabic. It is the official language in 22 countries, spoken by more than 350 million people worldwide.34 Modern Standard Arabic (MSA) is the written form of Arabic used in education; it is also the formal communication language. Arabic dialects or colloquial languages are spoken varieties of Arabic, and spoken daily by Arab people. There are more than 22 dialects; some countries share the same dialect, while many dialects may exist alongside MSA within the same Arab country. Arabic speakers prefer to use their own local dialect. Recently, more attention has been given to the Arabic dialects and the written varieties of Arabic found on social networking sites such as chats, micro-blogs, blogs, and forums which are the target of research on sentiment analysis and opinion extraction.
Huang [2015] shows us an approach to improving Arabic dialect classification with semisupervised learning. He trained multiple classifiers using a combination of weakly supervised, strongly supervised, and unsupervised classifiers. These combinations yielded significant and consistent improvement on two test sets. The dialect classification accuracy improved by 5% over the strongly supervised classifier and 20% over the weakly supervised classifier. Furthermore, when applying the improved dialect classifier to build a MSA language model (LM), the new model size was reduced by 70%, while the English-Arabic translation quality improved by 0.6 BLEU points.
Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. Figure 2.3 illustrates the AD distribution.
There is a possible division of regional language within the six regional groups, as follows: Egyptian, Levantine, Gulf, Iraqi, Maghrebi, and others, as shown in Figure 2.4.
Dialect identification is closely related to the language identification problem. The dialect identification task attempts to identify the spoken dialect from within a set of texts that use the same character set in a known language.
Due to the similarity of dialects within a language, dialect identification is more difficult than language identification. Machine learning approaches and language models which are used for language identification need to be adapted for dialect identification as well.
Several projects on NLP for MSA have been carried out, but research on Dialectal Arabic NLP is in early stages [Habash, 2010].
When processing Arabic for the purposes of social media analysis, the first step is to identify the dialect and then map the dialect to MSA, because there is a lack of resources and tools for Dialectal Arabic NLP. We can therefore use MSA tools and resources after mapping the dialect to MSA.
Figure 2.3: Arabic dialects distribution and variation across Asia and Africa [Sadat et al., 2014a].
Figure 2.4: Division of Arabic dialects in six groups/divisions [Sadat et al., 2014a].
Diab et al. [2010] have run the COLABA project, a major effort to create resources and processing tools for Dialectal Arabic blogs. They used the BAMA and MAGEAD morphological analyzers. This project focused on four dialects: Egyptian, Iraqi, Levantine, and Moroccan.
Several tools for MSA regarding text processing—BAMA, MAGED, and MADA—will now be described briefly.
BAMA (Buckwalter Arabic Morphological Analyzer) provides morphological annotation for MSA. The BAMA database contains three tables of Arabic stems, complex prefixes, and complex suffixes and three additional tables used for controlling prefix-stem, stem-suffix, and prefix-suffix combinations [Buckwalter, 2004].
MAGEAD is a morphological analyzer and generator for the Arabic languages including MSA and the spoken dialects of Arabic. MAGEAD is modified to analyze the Levantine dialect [Habash and Rambow, 2006].
MADA+TOKEN is a toolkit for morphological analysis and disambiguation for the Arabic language that includes Arabic tokenization, discretization, disambiguation, POS tagging, stemming, and lemmatization. MADA selects the best analysis result within all possible analyses for each word in the current context by using SVM models classifying into 19 weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary, and morphological information. TOKEN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA depends on three resources: BAMA, the SRILM toolkit, and SVMTools [Habash et al., 2009].
Going back to the problem of AD identification, we give here a detailed example, with results. Sadat et al. [2014c] provided a framework for AD classification using probabilistic models across social media datasets. They incorporated the two popular techniques for language identification: the character n-gram Markov language model and Naïve Bayes classifiers.35
The Markov model calculates the probability that an input text is derived from a given language model built from training data [Dunning, 1994]. This model enables the computation of the probability P(S) or likelihood, of a sentence S, by using the following chain formula in the following equation:
The sequence (w1, w2, …, wn) represents the sequence of characters in a sentence S. P (wi | w1, …, wi–1) represents the probability of the character wi given the sequence w1, …wi–1.
A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naïve) independence assumptions. In text classification, this classifier assigns the most likely category or class to a given document d from a set of pre-defined N classes as c1, c2, …, cN. The classification function f maps a document to a category (f : D → C) by maximizing the probability of the following equation [Peng and Schuurmans, 2003]: