Social Monitoring for Public Health. Michael J. Paul
Чтение книги онлайн.
Читать онлайн книгу Social Monitoring for Public Health - Michael J. Paul страница 14
Classifiers learn to distinguish positive and negative instances by analyzing a set of labeled examples, and patterns learned from these “training” examples can then be used to make inferences about new instances in the future. Because training data is provided as examples, this approach is called supervised machine learning.
Common classification models include support vector machines (SVMs) and logistic regression, sometimes called a maximum entropy (MaxEnt) classifier in machine learning [Berger et al., 1996]. Logistic regression is commonly used for public health, traditionally as a tool for data analysis (see discussion of regression analysis in Section 4.1.3) rather than as a classifier, which predicts labels for new data. Recent advances in neural networks—loosely, models that stack and combine classifiers into more complex models—have made this type of model attractive for classification [Goldberg, 2017]. While more computationally intensive, neural networks can give state-of-the-art performance for classification.
Classifiers treat each message as a set of predictors, called features in machine learning, typically consisting of the words in a document, and sometimes longer phrases as well. Phrases of length n are called n-grams, while individual words are called unigrams. One can also use additional linguistic information as features. Natural language processing (NLP) is an area of computer science that involves processing human language, and a number of NLP tools exist to parse linguistic information from text. For example, Lamb et al. [2013] showed that classification performance can be improved by including linguistic features in addition to n-grams, like whether “flu” is used as a noun or adjective, or whether it is the subject or object of a verb.
We won’t get into the technical details of classification in this book, but many of the common toolkits for machine learning (a few of which are described at the end of this section) provide tutorials.
Unsupervised Clustering and Topic Modeling
An alternative to classification is clustering. Clustering has the same goal as classification—organizing messages into categories—but the categories are not known in advance; rather, messages are grouped together automatically based on similarities. This is a type of unsupervised machine learning.
A popular method of clustering for text documents is topic modeling. In particular, probabilistic topic models are statistical models that treat text documents as if they are composed of underlying “topics,” where each topic is defined as a probability distribution over words and each document is associated with a distribution over topics. Topics can be interpreted as clusters of related words. In other words, topic models cluster together words into topics, which then allows documents with similar topics to be clustered. Probabilistic topic models have been applied to social media data for various scientific applications [Ramage et al., 2009], including for health [Brody and Elhadad, 2010, Chen et al., 2015b, Ghosh and Guha, 2013, Paul and Dredze, 2011, 2014, Prier et al., 2011, Wang et al., 2014].
The most commonly used topic model is Latent Dirichlet Allocation (LDA) [Blei et al., 2003], a Bayesian topic model. For the domain of health, Paul and Dredze developed the Ailment Topic Aspect Model (ATAM) [2011, 2014], an extension of LDA that explicitly identifies health concepts. ATAM creates two different types of topics: non-health topics, similar to LDA, as well as special “ailment” word distributions with words that are found in dictionaries of disease names, symptom terms, and treatments. Examples of ATAM ailments are shown in Figure 4.2.
An advantage of topic models over simple phrase-based filtering is that they learn many words that are related to concepts. For example, words like “cough” and “fever” are associated with “flu.” When inferring the topic composition of a document, the entire context is taken into account, which can help disambiguate words with multiple meanings (e.g., “dance fever”). A disadvantage is that they are typically less accurate than supervised machine learning methods, but the tradeoff is that topic models can learn without requiring annotated data. Another consideration of topic models is that they discover broad and popular topics, but additional effort may be needed to discover finer-grained issues [Prier et al., 2011].
Another use of topic models, or unsupervised methods in general, is for exploratory analysis. Unsupervised methods can be used to uncover the prominent themes or patterns in a large dataset of interest to a researcher. Once an unsupervised model has revealed the properties of a dataset, then one might use more precise methods such as supervised classification for specific topics of interest.
The technical details of probabilistic topic models are beyond the scope of this book. For an introduction, we recommend reading Blei and Lafferty [2009].
Which Approach to Use?
We have mentioned a variety of approaches to identifying social media content, including keyword filtering, classification, and topic modeling. These approaches have different uses and tradeoffs, so the choice of technique depends on the data and the task.
Most research using a large, general platform like Twitter will require keyword filtering as a first step, since relevant content will be such a small portion of the overall data, whether that requires keywords related to a particular topic like flu or vaccination, or health in general—for example, Paul and Dredze [2014] used a few hundred health-related keywords to collect a broad range of health tweets, which is still only a small sample of Twitter. Keyword filtering can be reasonably reliable for obtaining relevant content, although it may miss data that is relevant but uses terminology not in the keyword list, or it may identify irrelevant data that uses terms in different ways (e.g., slang usage of “sick”). Classifiers can overcome the limitations of keyword filtering, but are time consuming to build, so they are generally considered as a next step if keywords are insufficient. Topic models, on the other hand, are most often used for exploratory purposes—understanding what the content looks like at a high level—rather than looking for specific content.
Figure 4.2: Examples of ailment clusters discovered from tweets, learned with the Ailment Topic Aspect Model (ATAM) [Paul and Dredze, 2011]. The word clouds show the most probable words in each ailment, corresponding to (clockwise from top left) allergies, dental health, pain, and infuenza-like illness.
These techniques are not mutually exclusive, and it is not unreasonable to combine all three. Let’s illustrate this with an example. Suppose you want to use social media to learn how people are responding to the recent outbreak of Zika, a virus that can cause birth defects and had been rare in recent years until a widespread outbreak in 2015 originating in Brazil. (In fact, several researchers have done just that [Dredze et al., 2016c, Ghenai et al., 2017, Juric et al., 2017, Miller et al., 2017, Muppalla et al., 2017, Stefanidis et al., 2017].)
You decide to study this on Twitter, which captures a large and broad population. The first step is to collect tweets about Zika. There aren’t a lot of ways to refer to Zika without using its name (or perhaps its Portuguese translation, Zica, or its viral abbreviation, ZIKV). You might therefore start with a keyword filter for tweets containing “zika,” “zica,” or “zikv,” which would account for a tiny fraction of Twitter, but probably nearly all tweets about Zika, at least explicitly.
If you don’t already know what people discuss about Zika on Twitter (since it was not widely discussed until recently, after the outbreak), you might use a topic model as