Computational Text Analysis (AUG3E25)

Aarhus University, Department of Political Science

Author

Tobias Widmann

General Information

Time: August seminar 2025, 29.07 - 13.08 (11:00 - 14:00)

Location: 1330-038 Undervisningslokale

Instructor: Tobias Widmann

Exam Format: Oral exam

Course Content

With the rise of the internet, the availability of new data sources vastly changed over the course of the past decades. Social scientists can nowadays rely on huge data sets consisting of videos, images or text to answer pressing societal questions. In particular, the amount of available text data has exploded due to the growth of websites such as Twitter, Facebook, Google and Wikipedia. This increase has further been strengthened by the digitisation of historical archives, journalistic corpora and administrative records. However, collecting and analysing large amounts of text data also present new challenges for researchers and students.

The aim of this class is to introduce students to the computational analysis of text from a social science perspective, with a special focus on politics. To do so, the course covers the theoretical foundation as well as the practical application of text analysis approaches. However, the course is predominantly practical in nature and aims to give students the tools to perform their own analyses. Thus, we focus on empirical questions we can ask with text-as-data and learn how to answer them. To do so, students are provided with hands-on exercises during class using the R statistical programming language. Furthermore, we discuss recent examples of empirical research that rely on text analysis techniques.

Overall, the course will cover a range of popular techniques for collecting, processing and analysing text-based data. These range from data collection techniques to supervised and unsupervised approaches. Among others, the course will cover topics such as:

Cleaning and pre-processing of text data
Web scraping techniques
Dictionary approaches
Machine learning approaches
Topic modelling
Word embeddings
Transformer models + LLMs

The individual lessons of this course cover different aspects and build on top of each other. For instance, we start out with a basic introduction, followed by data collection and data preparation and finally move on to more complex forms of text analysis. To follow this course, basic R knowledge would be beneficial, but advanced programming skills are not required. We will learn how to use R/Rstudio and the necessary packages together in class.

Each class consists of two different formats. In the lecture part of the course, the instructor will introduce new text analysis techniques and present example studies. In the practical part of the class, students work individually or in groups on exercises. Students should finish these assignments at home between classes and the solution of these assignments will be discussed together in the following class.

Learning Outcomes

At the end of the course the student:

is able to use the programming language R to explore, analyse and communicate text data
is familiar with and able to account for contemporary debates and empirical examples in relation to computational text analysis
is able to understand and explain strengths and weaknesses of different computational text analysis approaches
is able to assess and discuss the usefulness of different text analysis methods and apply them practically
is able to independently collect, clean and prepare large amounts of text data to answer empirical questions
is able to apply theory and methods presented throughout the course to answer specific research questions
is able to effectively present and interpret the results of empirical text analyses
is able to discuss the ethical implications of using text-as-data methods

Course Structure

Class 1: Introduction

General introduction
What is “Computational Text Analysis” and what are its assumptions?
What do we need to pay attention to when we analyze text computationally?

Readings Class 1:

Chapter 1 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
J. Wilkerson and A. Casas (2017). “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges”. Annual Review of Political Science 20: 529–544.
K. Benoit (2020). “Text as Data: An Overview”. Handbook of Research Methods in Political Science and International Relations. Ed. by L. Curini and R. Franzese. Thousand Oaks: Sage: 461–497.

Optional Readings:

J. Grimmer and B. M. Stewart (2013). “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”. Political Analysis 21 (3): 267–297

Class 2: The Basics of CTA

Examples of Computational Text Analysis in Social Sciences
Common terms used in CTA
Practical Exercise: First steps of CTA: regular expressions, tokenization, …

Readings Class 2:

M. Schoonvelde, G. Schumacher, and B. N. Bakker (2019). “Friends with Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology”. Journal of Social and Political Psychology 7 (1): 124–143.
Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G van der Velden. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 0, no. 0 (Dezember 2021): 1–18.
CHAPTER 2 in D. Jurafsky and J. H. Martin (2021). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition. Link: https://web.stanford.edu/~jurafsky/slp3/

Class 3: Dictionaries + Language Complexity

Introduction to dictionaries + bag-of-words
How do we apply off-the-shelf dictionaries to text?
How do we create and apply our own dictionaries?
How can we measure language complexity?

Readings Class 3:

Chapter 5 & 16 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Young, L., & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2), 205–231.
S.-O. Proksch, W. Lowe, J. Wäckerle, and S. N. Soroka (2019). “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches”. Legislative Studies Quarterly 44 (1): 97–131.
Spirling, Arthur. “Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.” The Journal of Politics 78, no. 1 (January 2016): 120–36.
Schoonvelde, Martijn, Anna Brosius, Gijs Schumacher, and Bert N. Bakker. “Liberals Lecture, Conservatives Communicate: Analyzing Complexity and Ideology in 381,609 Political Speeches.” PLoS ONE 14, no. 2 (February 6, 2019).

Class 4: Data Collection using Web Scraping

How do we collect information from static websites?
How can we use APIs?
Practical Exercise: using ‘rvest’, scraping websites and tables, using loops & functions,

Readings Class 4:

Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. John Wiley & Sons, 2014. Chapter 9 (270 – 293).
Chapter 2 of Salganik, M. J. (2019). Bit by bit: Social research in the digital age. Princeton University Press. https://www.bitbybitbook.com/
Bradley, Alex, and Richard J. E. James. “Web Scraping Using R.” Advances in Meth- ods and Practices in Psychological Science 2, no. 3 (September 1, 2019): 264–70.
Barberá, Pablo, and Gonzalo Rivero. “Understanding the Political Representativeness of Twitter Users.” Social Science Computer Review 33, no. 6 (December 2015): 712–29.

Class 5: Unsupervised & Supervised Topic Models

What are unsupervised topic models and how do they work?
What assumptions underlie topic models? What are their weaknesses?
Practical Exercise: apply Structural Topic Models (STM) to text

Readings Class 5:

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082.
Eshima, S., Imai, K., & Sasaki, T. (2020). Keyword-Assisted Topic Models. American Journal of Political Science.

Class 6: Machine Learning

How does machine learning work and how can we classify documents into pre-defined categories?
How do we create a test and training data set?
Practical Exercise: applying supervised machine learning algorithms to classify documents

Readings Class 6:

Domingos, Pedro. “A Few Useful Things to Know about Machine Learning.” Communications of the ACM 55, no. 10 (October 2012): 78–87.
Chapter 17 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Class 7: Validation

How do we validate the results of automated text analysis?
What different approaches exist?
Practical Exercise: Understand and calculate different forms of valitdity, calculate accuracy, precision, and F1 scores.

Readings Class 7:

Chapter 20 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Olteanu, Alexandra, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.” Frontiers in Big Data 2 (2019): 13.

Class 8: Ethics + Biases in CTA

Ethical considerations in computational text analysis
What biases do we need to keep in mind?

Readings Class 8:

Chapter 3 of Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Talks at Google, (2016). Weapons of Math Destruction | Cathy O’Neil | Talks at Google. https://www.youtube.com/watch?v=TQHs8SA1qpk.
Olteanu, Alexandra, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.” Frontiers in Big Data 2 (2019): 13.
Bianchi, F., Kalluri, P., Durmus, E., Ladhak, F., Cheng, M., Nozza, D., … & Caliskan, A. (2023, June). Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (pp. 1493-1504).
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P. S., Mellor, J., … & Gabriel, I. (2022, June). Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 214-229).

Class 9: Do your own text-as-data project

Q & A
ggplot2
Think about a research question, collect your data and choose a tool that helps you answering this question

Readings Class 9:

No readings this week

Class 10: Word Embeddings

What are word embeddings?
How do they compare to other approaches in CTA?
Practical Exercise: Training word embeddings locally and using pre-trained embeddings

Readings Class 10:

Rodriguez, Pedro L., and Arthur Spirling. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics, Mai 2021, 000–000.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905-949
Kroon, Anne C., Damian Trilling, Toni G. L. A. van der Meer, and Jeroen G. F. Jonkman. “Clouded Reality: News Representations of Culturally Close and Distant Ethnic Outgroups.” Communications 0, no. 0 (November 19, 2019).
Rheault, L., & Cochrane, C. (2020). Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora. Political Analysis, 28(1), 112–133.
Aroyehun, Segun T., Almog Simchon, Fabio Carrella, Jana Lasser, Stephan Lewandowsky, and David Garcia. 2025. “Computational Analysis of US Congressional Speeches Reveals a Shift from Evidence to Intuition.” Nature Human Behaviour, April, 1–12.

Class 11: Transformers

Getting to know ‘state-of-the-art’ transformer-based models
How do they differ from previous approaches, such as word embeddings?
Practical exercise: Finetuning a transformer model

Readings Class 11:

Wankmüller, S. (2022). Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis. Sociological Methods & Research, 00491241221134527.
Timoneda, Joan C., and Sebastián Vallejo Vera. 2025. “BERT, RoBERTa, or DeBERTa? Comparing Performance Across Transformers Models in Political Science Text.” The Journal of Politics 87 (1): 347–64.

Class 12: LLMs

Getting to know Large Language Models and their application in R
Practical Exercise: OpenAI, rollama, …

Readings Class 12:

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. “ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences 120 (30): e2305016120.
Shanahan, M. (2023). Talking About Large Language Models (arXiv:2212.03551). arXiv.
Mens, Gaël Le, and Aina Gallego. 2025. “Positioning Political Texts with Large Language Models by Asking and Averaging.” Political Analysis 33 (3): 274–82.
Bisbee, James, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. 2024. “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.” Political Analysis 32 (4): 401–16.
Törnberg, Petter. 2024. “Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages.” Social Science Computer Review, September, 08944393241286471.