Computational Text Analysis (262F24)

Aarhus University, Department of Political Science


Tobias Widmann

General Information

Time: Spring semester 2024, Mondays 15:00 - 18:00

Location: 1330-024 Undervisningslokale

Instructor: Tobias Widmann (Office: 1340-133)

Exam Format: 7-day take-home exam

Course Content

With the rise of the internet, the availability of new data sources vastly changed over the course of the past decades. Social scientists can nowadays rely on huge data sets consisting of videos, images or text to answer pressing societal questions. In particular, the amount of available text data has exploded due to the growth of websites such as Twitter, Facebook, Google and Wikipedia. This increase has further been strengthened by the digitisation of historical archives, journalistic corpora and administrative records. However, collecting and analysing large amounts of text data also present new challenges for researchers and students.

The aim of this class is to introduce students to the computational analysis of text from a social science perspective, with a special focus on politics. To do so, the course covers the theoretical foundation as well as the practical application of text analysis approaches. However, the course is predominantly practical in nature and aims to give students the tools to perform their own analyses. Thus, we focus on empirical questions we can ask with text-as-data and learn how to answer them. To do so, students are provided with hands-on exercises during class using the R statistical programming language. Furthermore, we discuss recent examples of empirical research that rely on text analysis techniques.

Overall, the course will cover a range of popular techniques for collecting, processing and analysing text-based data. These range from data collection techniques to supervised and unsupervised approaches. Among others, the course will cover topics such as:

  • Web scraping techniques
  • Cleaning and pre-processing of text data
  • Dictionary approaches
  • Machine learning approaches
  • Topic modelling
  • Scaling
  • Word embeddings
  • Transformer models

The individual lessons of this course cover different aspects and build on top of each other. For instance, we start out with a basic introduction, followed by data collection and data preparation and finally move on to more complex forms of text analysis. To follow this course, basic R knowledge would be beneficial, but advanced programming skills are not required. We will learn how to use R/Rstudio and the necessary packages together in class.

Each class consists of two different formats. In the lecture part of the course, the instructor will introduce new text analysis techniques and present example studies. In the practical part of the class, students work individually or in groups on weekly assignments. Students should finish these assignments at home between classes and the solution of these assignments will be discussed together in class in the following week.

Learning Outcomes

At the end of the course the student:

  • is able to use the programming language R to explore, analyse and communicate text data
  • is familiar with and able to account for contemporary debates and empirical examples in relation to computational text analysis
  • is able to understand and explain strengths and weaknesses of different computational text analysis approaches
  • is able to assess and discuss the usefulness of different text analysis methods and apply them practically
  • is able to independently collect, clean and prepare large amounts of text data to answer empirical questions
  • is able to apply theory and methods presented throughout the course to answer specific research questions
  • is able to effectively present and interpret the results of empirical text analyses
  • is able to discuss the ethical implications of using text-as-data methods

Course Structure

Week 1: Introduction (29.01.2024)

  • General introduction (course structure & requirements)
  • What is “Computational Text Analysis” and what are its assumptions?
  • What do we need to pay attention to when we analyze text computationally?
  • Practical Exercise:
    • How to use R & Rstudio / general commands and functions
    • Introduction to Quarto

Readings Week 1:

  • Chapter 1 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • J. Wilkerson and A. Casas (2017). “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges”. Annual Review of Political Science 20: 529–544.

Optional Readings:

  • J. Grimmer and B. M. Stewart (2013). “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”. Political Analysis 21 (3): 267–297
  • K. Benoit (2020). “Text as Data: An Overview”. Handbook of Research Methods in Political Science and International Relations. Ed. by L. Curini and R. Franzese. Thousand Oaks: Sage: 461–497.

Week 2: The Basics I (05.02.2024)

  • Examples of Computational Text Analysis in Social Sciences
  • What are potential ‘flaws’ of computational text analysis?
  • Learning terms often used in CTA
  • Practical Exercise: First steps of CTA: regular expressions, tokenization, …

Readings Week 2:

  • M. Schoonvelde, G. Schumacher, and B. N. Bakker (2019). “Friends with Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology”. Journal of Social and Political Psychology 7 (1): 124–143.
  • Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G van der Velden. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 0, no. 0 (Dezember 2021): 1–18.

Week 3: The Basics II (12.02.2024)

  • The “Bag-of-Words” approach
  • What is pre-processing of text and why do we need it?
  • What do we need to pay attention to during pre-processing?
  • Practical Exercise: removing stopwords, stemming, lowercasing, removing features, creating DFMs, …

Readings Week 3:

  • Chapter 5 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • M. W. Denny and A. Spirling (2018). “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It”. Political Analysis 26 (2): 168–189.

Week 4: Dictionaries (19.02.2024)

  • Introduction to dictionaries
  • How do we apply off-the-shelf dictionaries to text?
  • How do we create and apply our own dictionaries?
  • Practical Exercise: creating and applying dictionaries to text data

Readings Week 4:

  • Chapter 16 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • Young, L., & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2), 205–231.
  • S.-O. Proksch, W. Lowe, J. Wäckerle, and S. N. Soroka (2019). “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches”. Legislative Studies Quarterly 44 (1): 97–131.

Week 5: Data Collection: Web Scraping (26.02.2024)

  • How do we collect information from static websites?
  • How do we use APIs to collect text data?
  • Practical Exercise: using ‘rvest’, scraping websites and tables, using loops & functions, connect to APIs,…

Readings Week 5:

  • No readings this week

Week 6: Unsupervised & Supervised Topic Models (04.03.2024)

  • What are unsupervised topic models and how do they work?
  • What assumptions underlie topic models? What are their weaknesses?
  • Practical Exercise: apply Structural Topic Models (STM) and Keyword Assisted topic models to text

Readings Week 6:

  • Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082.
  • Eshima, S., Imai, K., & Sasaki, T. (2020). Keyword-Assisted Topic Models. American Journal of Political Science, n/a(n/a).

Week 7: Machine Learning I (11.03.2024)

  • How does machine learning work and how can we classify documents into pre-defined categories?
  • How do we create a test and training data set?
  • Practical Exercise: applying supervised machine learning algorithms to classify documents

Readings Week 7:

  • Domingos, Pedro. “A Few Useful Things to Know about Machine Learning.” Communications of the ACM 55, no. 10 (October 2012): 78–87.
  • Chapter 17 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Week 8: Validation (18.03.2024)

  • How do we validate the results of automated text analysis?
  • What different approaches exist?
  • Practical Exercise: Understand and calculate inter-coder agreement, accuracy, precision, and F1 scores.

Readings Week 8:

  • Chapter 20 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • Ying, Luwei, Jacob M. Montgomery, and Brandon M. Stewart. “Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures.” Political Analysis, September 27, 2021, 1–20.

Week 9: Ethics, Biases, and Machine Learning II (03.04.2024)

  • Ethical considerations in computational text analysis
  • What biases do we need to keep in mind?
  • Practical Exercise: improve performance of machine learning algorithms to classify documents

Readings Week 9:

  • Chapter 3 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • Chapter 2 of Salganik, M. J. (2019). Bit by bit: Social research in the digital age. Princeton University Press.

Week 10: Do your own text-as-data project (08.04.2024)

  • Think about a research question, collect your data and choose a tool that helps you answering this question

Readings Week 10:

  • No readings this week

Week 11: Word Embeddings I (15.04.2024)

  • What are word embeddings?
  • How do they improve other approaches of text analysis?
  • Practical Exercise: Training word embeddings locally and using pre-trained embeddings

Readings Week 11:

  • Rodriguez, Pedro L., and Arthur Spirling. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics, Mai 2021, 000–000.

  • Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905-949

  • Kroon, Anne C., Damian Trilling, Toni G. L. A. van der Meer, and Jeroen G. F. Jonkman. “Clouded Reality: News Representations of Culturally Close and Distant Ethnic Outgroups.” Communications 0, no. 0 (November 19, 2019).

Week 12: Word Embeddings II (22.04.2024)

  • How can we use word embeddings for statistical inference?
  • Practical Exercise: using word embeddings in downstream tasks

Readings Week 12:

  • Hebbelstrup Rye Rasmussen, S., Bor, A., Osmundsen, M., & Petersen, M. B. (2023). ‘Super-Unsupervised’ Classification for Labelling Text: Online Political Hostility as an Illustration. British Journal of Political Science, 1–22.

  • Rheault, L., & Cochrane, C. (2020). Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora. Political Analysis, 28(1), 112–133.

Week 13: From Word Embeddings to Transformers (29.04.2024)

  • Getting to know ‘state-of-the-art’ transformer-based models
  • How do they differ from previous approaches, such as word embeddings?
  • Practical Exercise: Contextualized word embeddings, applying pre-trained transformer-based models in text analysis

Readings Week 13:

  • Wankmüller, S. (2022). Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis. Sociological Methods & Research, 00491241221134527.
  • Shanahan, M. (2023). Talking About Large Language Models (arXiv:2212.03551). arXiv.

Week 14: Q & A, Visualization & Communication (06.05.2024)

  • Q & A
  • How do we visualize our results effectively?
  • How do we use Quarto to communicate about our results?

Readings Week 14:

No readings this week