An Introduction to Text Mining. Gabe Ignatow

Чтение книги онлайн.

Читать онлайн книгу An Introduction to Text Mining - Gabe Ignatow страница 3

Автор:
Серия:
Издательство:
An Introduction to Text Mining - Gabe Ignatow

Скачать книгу

began to adapt text mining tools to use in their research, they spent decades studying transcribed interviews, newspaper articles, speeches, and other forms of textual data, and they developed sophisticated text analysis methods that we review in the chapters in Part IV. So while text mining is a relatively new interdisciplinary field based in computer science, text analysis methods have a long history in the social sciences (see Roberts, 1997).

      Text mining processes typically include information retrieval (methods for acquiring texts) and applications of advanced statistical methods and natural language processing (NLP) such as part-of-speech tagging and syntactic parsing. Text mining also often involves named entity recognition (NER), which is the use of statistical techniques to identify named text features such as people, organizations, and place names; disambiguation, which is the use of contextual clues to decide where words refer to one or another of their multiple meanings; and sentiment analysis, which involves discerning subjective material and extracting attitudinal information such as sentiment, opinion, mood, and emotion. These techniques are covered in Parts III and V of this book. Text mining also involves more basic techniques for acquiring and processing data. These techniques include tools for web scraping and web crawling, for making use of dictionaries and other lexical resources, and for processing texts and relating words to texts. These techniques are covered in Parts II and III.

      Research in the Spotlight

      Predicting the Stock Market With Twitter

      Bollen, J., Mao, H., & Zeng, X.-J. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8.

      The computer scientists Bollen, Mao, and Zeng asked whether societies can experience mood states that affect their collective decision making, and by extension whether the public mood is correlated or even predictive of economic indicators. Applying sentiment analysis (see Chapter 14) to large-scale Twitter feeds, Bollen and colleagues investigated whether measurements of collective mood states are correlated to the value of the Dow Jones Industrial Average over time. They analyzed the text content of daily Twitter feeds using OpinionFinder, which measures positive versus negative mood and Google Profile of Mood States to measure mood in terms of six dimensions (calm, alert, sure, vital, kind, and happy). They also investigated the hypothesis that public mood states are predictive of changes in Dow Jones Industrial Average closing values, finding that the accuracy of stock market predictions can be significantly improved by the inclusion of some specific public mood dimensions but not others.

      Specialized software used:

      OpinionFinder

       http://mpqa.cs.pitt.edu/opinionfinder

      Text analysis involves systematic analysis of word use patterns in texts and typically combines formal statistical methods and less formal, more humanistic interpretive techniques. Text analysis arguably originated as early as the 1200s with the Dominican friar Hugh of Saint-Cher and his team of several hundred fellow friars who created the first biblical concordance, or cross-listing of terms and concepts in the Bible. There is also evidence of European inquisitorial church studies of newspapers in the late 1600s, and the first well-documented quantitative text analysis was performed in Sweden in the 1700s when the Swedish state church analyzed the symbology and ideological content of popular hymns that appeared to challenge church orthodoxy (Krippendorff, 2013, pp. 10–11). The field of text analysis expanded rapidly in the 20th century as researchers in the social sciences and humanities developed a broad spectrum of techniques for analyzing texts, including methods that relied heavily on human interpretation of texts as well as formal statistical methods. Systematic quantitative analysis of newspapers was performed in the late 1800s and early 1900s by researchers including Speed (1893), who showed that in the late 1800s New York newspapers had decreased their coverage of literary, scientific, and religious matters in favor of sports, gossip, and scandals. Similar text analysis studies were performed by Wilcox (1900), Fenton (1911), and White (1924), all of whom quantified newspaper space devoted to different categories of news. In the 1920s through 1940s, Lasswell and his colleagues conducted breakthrough content analysis studies of political messages and propaganda (e.g., Lasswell, 1927). Lasswell’s work inspired large-scale content analysis projects including the General Inquirer project at Harvard, which is a lexicon attaching syntactic, semantic, and pragmatic information to part-of-speech tagged words (Stone, Dunphry, Smith, & Ogilvie, 1966).

      While text mining’s roots are in computer science and the roots of text analysis are in the social sciences and humanities, today, as we will see throughout this textbook, the two fields are converging. Social scientists and humanities scholars are adapting text mining tools for their research projects, while text mining specialists are investigating the kinds of social phenomena (e.g., political protests and other forms of collective behavior) that have traditionally been studied within the social sciences.

      Six Approaches to Text Analysis

      The field of text mining is divided mainly in terms of different methodologies, while the field of text analysis can be divided into several different approaches that are each based on a different way of theorizing language use. Before discussing some of the special challenges associated with using online data for social science research, next we review six of the most prominent approaches to text analysis. As we will see, many researchers who work with these approaches are finding ways to make use of the new text mining methodologies and tools that are covered in Parts II, III, and V. These approaches include conversation analysis, xe "analysis of discourse positions"analysis of discourse positions, critical discourse analysis (CDA), content analysis, Foucauldian analysis, and analysis of texts as social information. These approaches use different logical strategies and are based on different theoretical foundations and philosophical assumptions (discussed in Chapter 4). They also operate at different levels of analysis (micro, meso, and macro) and employ different selection and sampling strategies (see Chapter 5).

      Conversation Analysis

      Conversation analysts study everyday conversations in terms of how people negotiate the meaning of the conversation in which they are participating and the larger discourse of which the conversation is a part. Conversation analysts focus not only on what is said in daily conversations but also on how people use language pragmatically to define the situations in which they find themselves. These processes go mostly unnoticed until there is disagreement as to the meaning of a particular situation. An example of conversation analysis is the educational researcher Evison’s (2013) study of “academic talk,” which used corpus linguistic techniques (see Appendix F) on both a corpus of 250,000 words of spoken academic discourse and a benchmark corpus of casual conversation to explore conversational turn openings. The corpus of academic discourse included 13,337 turns taken by tutors and students in a range of social interactions. In seeking to better understand the unique language of academia and of specific academic disciplines, Evison identified six items that have a particularly strong affinity with the turn-opening position (mhm, mm, yes, laughter, oh, no) as key characteristics of academic talk.

      Further examples of conversation analysis research include studies of conversation in educational settings by O’Keefe and Walsh (2012); in health care settings by Heath and Luff (2000), Heritage and Raymond (2005), and Silverman (2016); and in online environments among Wikipedia editors by Danescu-Niculescu-Mizil, Lee, Pang, and Kleinberg

Скачать книгу