Computational Text Analysis

General Information

Course Content

Learning Outcomes

Course Structure

Week 1: Introduction
Week 2: The Basics 1
Week 3: The Basics 2
Week 4: Data Collection: Web Scraping
Week 5: Data Collection: Social Media Scraping and Dynamic Webpages
Week 6: Dictionaries
Week 7: Language Complexity & Sophistication
Week 8: Scaling
Week 9: Unsupervised Topic Models
Week 10: Machine Learning
Week 11: Validation
Week 12: Word Embeddings 1
Week 13: Word Embeddings 2
Week 14: Transformer-Based Models

General Information

Time: Spring semester 2022, Tuesdays 11:00 - 14:00

Location: 1330-024 Undervisningslokale

Instructor: Tobias Widmann (Office: 1341-122)

Exam Format: 7-day take-home exam

Course Content

With the rise of the internet, the availability of new data sources vastly changed over the course of the past decades. Social scientists can nowadays rely on huge data sets consisting of videos, images or text to answer pressing societal questions. In particular, the amount of available text data has exploded due to the growth of websites such as Twitter, Facebook, Google and Wikipedia. This increase has further been strengthened by the digitisation of historical archives, journalistic corpora and administrative records. However, collecting and analysing large amounts of text data also present new challenges for researchers and students.

The aim of this class is to introduce students to the computational analysis of text from a social science perspective, with a special focus on politics. To do so, the course covers the theoretical foundation as well as the practical application of text analysis approaches. However, the course is predominantly practical in nature and aims to give students the tools to perform their own analyses. Thus, we focus on empirical questions we can ask with text-as-data and learn how to answer them. To do so, students are provided with hands-on exercises during class using the R statistical programming language. Furthermore, we discuss recent examples of empirical research that rely on text analysis techniques.

Overall, the course will cover a range of popular techniques for collecting, processing and analysing text-based data. These range from data collection techniques to supervised and unsupervised approaches. Among others, the course will cover topics such as:

Web scraping techniques
Cleaning and pre-processing of text data
Dictionary approaches
Machine learning approaches
Topic modelling
Scaling
Word embeddings

The individual lessons of this course cover different aspects and build on top of each other. For instance, we start out with a basic introduction, followed by data collection and data preparation and finally move on to more complex forms of text analysis. To follow this course, basic R knowledge would be beneficial, but advanced programming skills are not required. We will learn how to use R/Rstudio and the necessary packages together in class.

Each class consists of two different formats. In the lecture part of the course, the instructor will introduce new text analysis techniques and present example studies. In the practical part of the class, students work individually or in groups on weekly assignments. Students should finish these assignments at home between classes and the solution of these assignments will be discussed together in class in the following week.

Learning Outcomes

At the end of the course the student:

is familiar with and able to account for contemporary debates and empirical examples in relation to computational text analysis
is able to understand and explain strengths and weaknesses of different computational text analysis approaches
is able to assess and discuss the usefulness of different text analysis methods and apply them practically
is able to independently collect, clean and prepare large amounts of text data to answer empirical questions
is able to independently formulate a research question that can be answered by applying theory and methods presented throughout the course
is able to apply theory and methods presented throughout the course to answer specific research questions
is able to effectively present and interpret the results of empirical text analyses.

Course Structure

Week 1: Introduction

General introduction (course structure & requirements)
What is “Computational Text Analysis” and what are its assumptions?
What do we need to pay attention to when we analyze text computationally?
Practical Exercise: How to use R & Rstudio and general commands and functions

Readings Week 1:

J. Wilkerson and A. Casas (2017). “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges”. Annual Review of Political Science 20: 529–544.
J. Grimmer and B. M. Stewart (2013). “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”. Political Analysis 21 (3): 267–297
K. Benoit (2020). “Text as Data: An Overview”. Handbook of Research Methods in Political Science and International Relations. Ed. by L. Curini and R. Franzese. Thousand Oaks: Sage: 461–497.

Week 2: The Basics 1

Examples of Computational Text Analysis in Social Sciences
What are potential ‘flaws’ of computational text analysis?
Learning terms often used in CTA
Practical Exercise: First steps of CTA: regular expressions, tokenization

Readings Week 2:

M. Schoonvelde, G. Schumacher, and B. N. Bakker (2019). “Friends with Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology”. Journal of Social and Political Psychology 7 (1): 124–143. https://doi.org/10.5964/jspp.v7i1.964
Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G van der Velden. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 0, no. 0 (Dezember 2021): 1–18. https://doi.org/10.1080/19312458.2021.2015574
CHAPTER 2 in D. Jurafsky and J. H. Martin (2021). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 3rd edition. https://web.stanford.edu/~jurafsky/slp3/

Week 3: The Basics 2

What is pre-processing of text and why do we need it?
What do we need to pay attention to during pre-processing?
Practical Exercise: removing stopwords, stemming, lowercasing, removing features, creating DFMs, introducing quanteda

Readings Week 3:

K. Welbers, W. Van Atteveldt, and K. Benoit (2017). “Text Analysis in R”. Communication Methods and Measures 11 (4): 245–265.
M. W. Denny and A. Spirling (2018). “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It”. Political Analysis 26 (2): 168–189.

Week 4: Data Collection: Web Scraping

How do we collect information from static websites?
How do we collect information from tables online?
Practical Exercise: using R package ‘rvest’, scraping websites and tables, using loops & functions

Readings Week 4:

Olteanu, Alexandra, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.” Frontiers in Big Data 2 (2019): 13. https://doi.org/10.3389/fdata.2019.00013.
Bradley, Alex, and Richard J. E. James. “Web Scraping Using R.” Advances in Methods and Practices in Psychological Science 2, no. 3 (September 1, 2019): 264–70. https://doi.org/10.1177/2515245919859535.
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. John Wiley & Sons, 2014. Chapter 9 (270 – 293). adgang via AUL: https://soeg.kb.dk/permalink/45KBDK_KGL/1pioq0f/alma99123014837905763

Week 6: Dictionaries

What are dictionaries? How do we use them?
How do we apply off-the-shelf dictionaries to text?
How do we create and apply our own dictionaries?
Practical Exercise: creating and apply dictionaries to text data

Readings Week 6:

Pennebaker, James W., Matthias R. Mehl, and Kate G. Niederhoffer. “Psychological Aspects of Natural Language Use: Our Words, Our Selves.” Annual Review of Psychology 54, no. 1 (2003): 547–77. https://doi.org/10.1146/annurev.psych.54.101601.145041
S.-O. Proksch, W. Lowe, J. Wäckerle, and S. N. Soroka (2019). “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches”. Legislative Studies Quarterly 44(1): 97–131.
C. Rauh (2018). “Validating a Sentiment Dictionary for German Political Language: A Work- bench Note”. Journal of Information Technology & Politics 15 (4): 319–343.

Week 7: Language Complexity & Sophistication

How do texts differ in terms of ‘complexity’ and ‘sophistication’?
How do we measure linguistic complexity/sophistication? What do we need to pay attention to?
Practical Exercise: estimating text complexity and comparing texts

Readings Week 7:

Benoit, Kenneth, Kevin Munger, and Arthur Spirling. “Measuring and Explaining Political Sophistication through Textual Complexity.” American Journal of Political Science 63, no. 2 (2019): 491–508. https://doi.org/10.1111/ajps.12423.
Spirling, Arthur. “Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.” The Journal of Politics 78, no. 1 (January 2016): 120–36. https://doi.org/10.1086/683612.
Schoonvelde, Martijn, Anna Brosius, Gijs Schumacher, and Bert N. Bakker. “Liberals Lecture, Conservatives Communicate: Analyzing Complexity and Ideology in 381,609 Political Speeches.” PLoS ONE 14, no. 2 (February 6, 2019). https://doi.org/10.1371/journal.pone.0208450.

Week 8: Scaling

What is supervised and unsupervised scaling?
What assumptions underlie scaling techniques?
What is scaling useful for?
Practical Exercise: applying different scaling techniques to text (wordfish, wordscores, lss)

Readings Week 8:

M. Laver, J. Garry, and K. Benoit (2003). “Extracting Policy Positions from Political Texts Using Words as Data”. American Political Science Review 97 (2): 311–331.
Slapin, Jonathan B. and Sven-Oliver Proksch. 2008. A Scaling Model for Estimating Time- Series Party Positions from Texts. American Journal of Political Science 52(3): 705-722.
Watanabe, Kohei. “Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages.” Communication Methods and Measures 0, no. 0 (November 1, 2020): 1–23.
Hjorth, Frederik, Robert Klemmensen, Sara Hobolt, Martin Ejnar Hansen, and Peter Kurrild-Klitgaard. “Computers, Coders, and Voters: Comparing Automated Methods for Estimating Party Positions.” Research & Politics 2, no. 2 (June 2, 2015): 205316801558047. https://doi.org/10.1177/2053168015580476.

Week 9: Unsupervised Topic Models

What are unsupervised topic models and how do they work?
What assumptions underlie topic models? What are their weaknesses?
Practical Exercise: apply Structural Topic Models (STM) to text

Readings Week 9:

Blei, David M. “Probabilistic Topic Models.” Communications of the ACM 55, no. 4 (April 2012): 77–84. https://doi.org/10.1145/2133806.2133826.
Roberts, Margaret E, Brandon M Stewart, and Dustin Tingley. “Stm: R Package for Structural Topic Models.” Journal of Statistical Software, 2014, 42. https://doi.org/10.18637/jss.v091.i02

Week 10: Machine Learning

What exactly is machine learning and how can we classify documents into pre-defined categories?
How do we create a test and training data set?
Practical Exercise: applying different machine learning algorithms to classify documents

Readings Week 10:

Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. “Machine Learning for Social Science: An Agnostic Approach.” Annual Review of Political Science 24, no. 1 (2021): 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921.
Domingos, Pedro. “A Few Useful Things to Know about Machine Learning.” Communications of the ACM 55, no. 10 (October 2012): 78–87. https://doi.org/10.1145/2347736.2347755.
D’Orazio, Vito, Steven T. Landis, Glenn Palmer, and Philip Schrodt. “Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines.” Political Analysis 22, no. 2 (2014): 224–42.

Week 11: Validation

How do we validate the results of automated text analysis?
What different approaches exist?
Practical Exercise: Understand and calculate inter-coder agreement, accuracy, precision, and F1 scores.

Readings Week 11:

Ying, Luwei, Jacob M. Montgomery, and Brandon M. Stewart. “Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures.” Political Analysis, September 27, 2021, 1–20. https://doi.org/10.1017/pan.2021.33.
Benoit, Kenneth, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110, no. 2 (May 2016): 278–95. https://doi.org/10.1017/S0003055416000058.
Lind, Fabienne, Jakob-Moritz Eberl, Tobias Heidenreich, and Hajo G. Boomgaarden. “Computational Communication Science| When the Journey Is as Important as the Goal: A Roadmap to Multilingual Dictionary Construction.” International Journal of Communication 13 (2019): 21.
Lind, Fabienne, Maria Gruber, and Hajo G. Boomgaarden. “Content Analysis by the Crowd: Assessing the Usability of Crowdsourcing for Coding Latent Constructs.” Communication Methods and Measures 11, no. 3 (July 3, 2017): 191–209. https://doi.org/10.1080/19312458.2017.1317338.

Week 12: Word Embeddings 1

What are word embeddings and how can we use them?
How do they improve other approaches of text analysis?
Practical Exercise: Training word embeddings locally and using pre-trained embeddings

Readings Week 12:

Rudkowsky, Elena, Martin Haselmayer, Matthias Wastian, and Marcelo Jenny. “More than Bags of Words: Sentiment Analysis with Word Embeddings.” Communication Methods and Measures, 2018, 19.
Rodriguez, Pedro L., and Arthur Spirling. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics, Mai 2021, 000–000. https://doi.org/10.1086/715162.
Rheault, Ludovic, and Christopher Cochrane. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” Political Analysis (2019): 1-22. https://doi.org/10.1017/pan.2019.26
Kroon, Anne C., Damian Trilling, Toni G. L. A. van der Meer, and Jeroen G. F. Jonkman. “Clouded Reality: News Representations of Culturally Close and Distant Ethnic Outgroups.” Communications 0, no. 0 (November 19, 2019). https://doi.org/10.1515/commun-2019-2069.

Week 13: Word Embeddings 2

How can we use word embeddings for statistical inference?
Practical Exercise: using word embeddings in a regression framework

Readings Week 13:

Rodriguez, Pedro L, Arthur Spirling, and Brandon M Stewart. “Models for Context-Speciﬁc Description and Inference in Political Science,” n.d., 43. https://arthurspirling.org/documents/embedregression.pdf

Week 14: Transformer-Based Models

Getting to know ‘state-of-the-art’ transformer-based models
How do they differ from previous approaches, such as ‘simple’ word embeddings?
Practical Exercise: applying pre-trained transformer-based models in text analysis

Readings Week 14:

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” ArXiv:1810.04805 [Cs], May 24, 2019. http://arxiv.org/abs/1810.04805.
Widmann, Tobias and Maximilian Wich. “Creating and Comparing Dictionary, Word Embedding, and Transformer- based Models to Measure Discrete Emotions in German Political Text” (2020).
Terechshenko, Zhanna, Fridolin Linder, Vishakh Padmakumar, Michael Liu, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. “A Comparison of Methods in Political Science Text Classification: Transfer Learning Language Models for Politics.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, October 20, 2020. https://doi.org/10.2139/ssrn.3724644.