An Introduction to Text Mining. Gabe Ignatow

Чтение книги онлайн.

Читать онлайн книгу An Introduction to Text Mining - Gabe Ignatow страница 8

An Introduction to Text Mining - Gabe Ignatow

Скачать книгу


       If you were interested in conducting a CDA of a contemporary discourse, what discourse would you study? Where would you find data for your analysis?

       How do researchers choose between collecting data from offline sources, such as in-person interviews, and online sources, such as social media platforms?

       What are the most critical problems with using data from online sources?

       If you already have an idea for a research project, what are likely to be the most critical advantages and disadvantages of using online data for your project?

       What are some ways text mining research be used to benefit science and society?

      Developing a Research Proposal

      Select a social issue that interests you. How might you analyze how people talk about this issue? Are there differences between people from different communities and backgrounds in terms of how they think about this issue? Where (e.g., offline, online) do people talk about this issue, and how could you collect data from them?

      Further Reading

      Ayers, E. L. (1999). The pasts and futures of digital history. Retrieved June 17, 2015, from

      Bauer, M. W., Bicquelet, A., & Suerdem, A. K. (Eds.), Textual analysis. SAGE benchmarks in social research methods (Vol. 1). Thousand Oaks, CA: Sage.

      Krippendorff, K. (2013). Content analysis: An introduction to its methodology. Thousand Oaks, CA: Sage.

      Kuckartz, U. (2014). Qualitative text analysis: A guide to methods, practice, and using software. Thousand Oaks, CA: Sage.

      Roberts, C. W. (1997). Text analysis for the social sciences: Methods for drawing statistical inferences from texts and transcripts. Mahwah, NJ: Lawrence Erlbaum.

      2 Acquiring Data

      Learning Objectives

      The goals of Chapter 2 are to help you to do the following:

      1 Recognize the role data plays in text mining and the characteristics of ideal data sets for text mining applications.

      2 Identify a variety of different data sources used to compile text mining data sets.

      3 Assess the advantages and limitations of using social media to acquire data.

      4 Analyze examples of social science research using data sets drawn from different sources.


      While social scientists have for decades made use of data from attitude surveys, today researchers are attempting to leverage the growing volume of naturally occurring unstructured data generated by people, such as text or images. Some of these unstructured data are referred to as “big data,” although that term has become a bit of a faddish buzzword. Naturally, there are questions that arise from the use of textual data sets as a way to learn about social groups and communities. There are, of course, advantages and disadvantages to each, and there are also ways to leverage both surveys and big data.

      Surveys are the traditional mechanisms for gathering information on people, and there are entire fields that have developed around these data collection instruments. Surveys can collect clear, targeted information, and as such, the information obtained from surveys is significantly “cleaner” and significantly easier to process as compared to the information extracted from unstructured data sources. Surveys also have the advantage that they can be run in controlled settings, with complete information on the survey takers. These controlled settings can however also be a disadvantage. It has been argued, for instance, that survey research is often biased because of the typical places where surveys are run—for example, large student populations from Introduction to Psychology courses. Another challenge associated with surveys is that it excludes those people who do not like to provide information, and there is an entire body of research around methodologies to remove such participation bias. Above all, the main difficulty associated with survey instruments is the fact that they are expensive to run, both in terms of time and in terms of financial costs.

      The alternative to surveys that has been extensively explored in recent years is the extraction of information from unstructured sources. For instance, rather than surveying a group of people on whether they are optimistic or pessimistic, alongside with asking for their location, as a way to create maps of “optimism,” one could achieve the same goal by collecting Twitter or blog data, extracting the location of the writers from their profile, and using automatic text classification tools to infer their level of optimism (Ruan, Wilson, & Mihalcea, 2016). The main advantage of gathering people information from such data sources is their “always on” property, which allows one to collect information continuously and inexpensively. These digital resources also eliminate some of the biases that come with the survey instruments, but they nonetheless introduce other kinds of biases. For instance, most of these data-driven collections of information on people rely on social media or on crowdsourcing platforms such as Amazon Mechanical Turk, but these sources cover only a certain type of population who is open to posting on social media or participating in online crowdsourcing experiments. Even more important, another major difficulty associated with the use of unstructured data sources is the lack of exactness during the process of extracting information. This process often consists of automatic tools for text mining and classification, which even if they are generally very good, they are not perfect. This effect can, however, be counteracted with the use of large data quantities: If the data that one can get from surveys are often limited by the number of participants (which in turn is limited by time and cost reasons), that limit is much higher when it comes to the information that one can gather from digital data sources. Thus, if cleverly used, the richness of the information obtained from unstructured data can rival, if not exceed, the one obtained with surveys.

      Online Data Sources

      Researchers often prefer to use ready-made data rather than, or often in addition to, constructing their own data sets using crawling and scraping tools. While many sources of data are in the public domain, some require access through a university subscription. For example, sources of news data include the websites of local and regional news outlets as well as private databases such as EBSCO, Factiva, and LexisNexis, which provide access to tens of thousands of global news sources, including blogs, television and radio transcripts, and traditional print news. One example of the use of such databases is a study of academic research on international entrepreneurship by the management researchers Jones, Coviello, and Tang (2011). Jones and colleagues used EBSCO and ABI/INFORM search tools to select their final data set of 323 journal articles on international entrepreneurship published between 1989 and 2009. They then used thematic analysis (see Chapter 11) to identify themes and subthemes in their data.

      In addition to being able to access digitized news sources, researchers have access to writing produced by organizations including political statements, organizational calendars, and event reports. These data include recent online writing as well as digitized historical archives. Unfortunately, many online data sources are not simple to access. Most news databases allow access

Скачать книгу