An Introduction to Text Mining. Gabe Ignatow

Чтение книги онлайн.

Читать онлайн книгу An Introduction to Text Mining - Gabe Ignatow страница 9

Автор:
Серия:
Издательство:
An Introduction to Text Mining - Gabe Ignatow

Скачать книгу

not allow access to their entire database, as the subscriptions universities pay for are based on the assumption that researchers want to read a few articles on a subject rather than use large numbers of articles as primary data. Yet despite these limitations, a large and growing number of digital text collections are available for text mining researchers to use (see Appendix A). Among the most useful of these collections is the Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca), the largest public access corpus of English. Created by Davies of Brigham Young University, the corpus contains more than 520 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990 to 2015 and is updated regularly. The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. COCA and related corpora are often used by social scientists as secondary data sources in order to compare word frequencies between their main data source and “standard” English (e.g., Baker et al., 2008).

      Another major source of digital data is represented by social media platforms, many of which provide their own application programming interfaces (APIs) for programmatic access to their data. The Twitter APIs (http://dev.twitter.com), for instance, allow one to access a small set of random tweets every day, or larger keyword-based collections of tweets (e.g., all the recent tweets with the hashtag #amused). If larger collections are necessary, they can be obtained through third-party vendors such as Gnip or others, which cover several social media sites and often partly curate the data. Twitter also provides limited demographic information on their users, such as location and self-maintained free-text profiles that sometime can include gender, age, industry, interests, and others.

      Blogs can also be accessed through an API—for instance, the Blogger platform offers programmatic access to the blogs and the profile of the bloggers, which includes a rich set of fields covering location, gender, age, industry, favorite books and movies, interests, and so on. Other blog sites, such as LiveJournal, also include additional information on the bloggers, for instance, their mood when writing a blog post.

      In addition, there are several other social media websites, with different target audiences, such as Instagram (where users upload mainly images they take), Pinterest (with “pins” of interesting things, covering a variety of domains from DIY to fashion to design and decoration), and many review platforms such as Amazon, Yelp, and others.

      If you are interested in assembling your own data set, Chapter 6 provides an overview of software tools for scraping and crawling websites to collect your own data, and Chapter 5 provides instruction related to data selection and sampling.

      Advantages and Limitations of Online Digital Resources for Social Science Research

      The use of online digital resources, and in particular of social media, comes with its plusses and minuses. Salganik (in press) provided a good summary of the characteristics of big data in general, many of which apply to social media in particular. He grouped characteristics into those that are good for research and those that are not good for research.

      Among the characteristics that make big data good for research are (a) its size, which can allow for the observation of rare events, for causal inferences, and generally for more advanced statistical processing that is not otherwise possible when the data are small; (b) its “always-on” property, which provides a time dimension to the data and makes it suitable to study unexpected events and produce real time measurements (e.g., capture people’s reactions during a tornado, by analyzing the tweets from the affected area); and (c) its nonreactive nature, which implies that the respondents behave more naturally due to the fact that they are not aware of their data being captured (as it is the case with surveys).

      Then there are also characteristics that make big data less appealing to research, such as (a) its incompleteness—that is, often digital data collections lack demographics or other information that is important for social studies; (b) its inherent bias, in that the contributors to such online resources are not a random sample of the people—consider, for instance, the people who tweet many tweets a day versus those who choose to never tweet; they represent different types of populations with different interests, personalities, and values, and even the largest collection of tweets will not capture the behaviors of those who are not users of Twitter; (c) its change over time, in terms of users (who generates social media data and how it generates it) and platforms (how is the social media data being captured), which makes it difficult to conduct longitudinal studies; and (d) finally its susceptibility to algorithmic confounds, which are properties that seem to belong to the data being studied which in fact are caused by the underlying system used to collect the data—as in the seemingly magic number of 20 friends that many people seem to have on Facebook, which turns out to be an effect of the Facebook platform that actively encourages people to make friends until they reach 20 friends (Salganik, in press). In addition, some types of digital data are inaccessible—for example, e-mails, queries sent to search engines, phone calls, and so forth, which makes it difficult to conduct research on behaviors associated with those data types.

      Examples of Social Science Research Using Digital Data

      There are examples of social science research studies that use social media data in most of the chapters of this textbook. If you are interested in using Facebook data for your own project, it is important to review the studies discussed in Chapter 3 on the Facebook ethics controversy. In addition, research by the sociologist Hanna (2013) on using Facebook to study social movements may be a useful starting point. Hanna reviewed procedures for analyzing social movements such as the Arab Spring and Occupy movements by applying text mining methods to Facebook data. Hanna uses the Natural Language Toolkit (NLTK; www.nltk.org) and the R package ReadMe (http://gking.harvard.edu/readme) to analyze mobilization patterns of Egypt’s April 6 youth movement. He corroborated results from his text mining methods with in-depth interviews with movement participants.

      If you are interested in using Twitter data, two Twitter-based thematic analysis (see Chapter 11) studies are good places to start. The first is a study of the live Twitter chat of the Centers for Disease Control and Prevention conducted by Lazard, Scheinfeld, Bernhardt, Wilcox, and Suran (2015). Lazard’s team collected, sorted, and analyzed users’ tweets to reveal major themes of public concern with the symptoms and life span of the virus, disease transfer and contraction, safe travel, and protection of one’s body. Lazard and her team used SAS Text Miner (www.sas.com/en_us/software/analytics/text-miner.html)

Скачать книгу