TextAsDataCourse

Materials for "Text as Data" classes at Penn State and Essex.

View the Project on GitHub burtmonroe/TextAsDataCourse

Course schedule - Essex 2P Advanced Text as Data / NLP

Day 1 (Jul 26) - Introduction and Overview

Kenneth Benoit. 2020. “Text as Data: An Overview.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. here

Jacob Eisenstein. 2018. “Introduction.” Natural Language Processing here

Notes: Open Source Tools for Text as Data / NLP in R

Notes: Open Source Tools for Text as Data / NLP in Python

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day1-Introduction

R Notebooks for Day 1 https://rstudio.cloud/.

Python notebook on string manipulation and regular expressions: here

Day 2 (Jul 27) - Language Models and NLP Pipelines for Sequence Labeling

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapters 3, 8, 14. “N-gram Language Models,” “Sequence Labeling for Parts of Speech and Named Entities,” “Dependency Parsing.” here

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day2-NLPPipelines.pdf

R Tutorials for Day 2 on RStudio Cloud: NLP Pipelines in R + spaCy in R (RStudio Cloud)

Python Tutorials on NLP Annotation Pipelines (spaCy and Stanza for now) here

Day 3 (Jul 28) - Word Embeddings

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 6, “Vector Semantics and Embeddings.” here

Pedro Rodriguez and Arthur Spirling (Forthcoming) “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research.” Journal of Politics. here

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day3-WordEmbeddings.pdf

Python Tutorial on Estimating Word Embeddings with gensim: https://colab.research.google.com/drive/1eSzd2z5B3CDeTxpdMXCIh3bm1L-gYzCr?usp=sharing

3rd party tutorial on how to estimate your own GloVe embeddings in R with text2vec, http://text2vec.org/glove.html, and a replication of that same example from within quanteda is here: https://quanteda.io/articles/pkgdown/replication/text2vec.html

Day 4-5 (Jul 29-30) - Neural Networks and Deep Learning for NLP

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 7, “Neural Networks and Neural Language Models.” here

Jay Alammar. 2016. “A Visual and Interactive Guide to the Basics of Neural Networks.” here

Jay Alammar. 2016. “A Visual and Interactive Look at Basic Neural Network Math.” here

Christopher Olah. 2014. “Deep Learning, NLP, and Representations.” here

Kakia Chatsiou and Slava Jankin Mikhaylov. 2020. “Deep Learning for Political Science.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. here

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day4-DeepLearningNLP.pdf

R Tutorial on Text Classification with Keras and Tensorflow, on RStudio Cloud in Day 4 Project

Python Tutorial on Text Classification with Keras and Tensorflow: https://colab.research.google.com/drive/1MG2_5Hx5dwN77hmVNY0aUiGo99k2mPGb?usp=sharing

Aug 2

Study guide for exams: https://docs.google.com/document/d/1eZGUUzqTJCfQjQiJAn1DqhaVWeDoH4RSvOU8TC3WIqE/edit?usp=sharing

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day5-DeepLearningNLP2.pdf

Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (Python): https://colab.research.google.com/drive/1kGhXArEbWDP_A4TtlB1cgSubekIsX4VP?usp=sharing

Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (R): https://colab.research.google.com/drive/1hq9eCrWjDOkpMUY0QJ9fAOHWagcBSXU7?usp=sharing

Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (Python): https://colab.research.google.com/drive/1pkJNzWDdqTaVzZFQ1RnkAxx87Wkyr31T?usp=sharing

Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (R): Not currently functional.

Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (Python): https://colab.research.google.com/drive/1_6m2DVFQJPZH5UENZDs7jkrOU6kjyuCu?usp=sharing

Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (R): https://colab.research.google.com/drive/1n1Al0lplHxY78P5vPATBp6kLUUYz6maA?usp=sharing

Aug 3 - From Recurrent Neural Networks to Transformers

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 9, “Deep Learning Architectures for Sequence Processing.” here

Jay Alammar. 2018. “The Illustrated Transformer.” here

Suggested: Andrew Halterman. 2019. “Geolocating Political Events.” here

Suggested: Han Zhang and Jennifer Pan. 2019. “CASM: A Deep-Learning Approach for Identifying Collective Action Events from Text and Image Data.” Sociological Methodology 49(1): 1-57. here

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day6-RNNsToTransformers.pdf

Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (Python): https://colab.research.google.com/drive/1TDYGoskrMCbWGyS4X_kj8Ftzc6hURgz8?usp=sharing

Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (R)

Text Classification with Keras and Tensorflow 6: CNNs (Python)

Text Classification with Keras and Tensorflow 6: CNNs (R)

Third party notebooks on Transformers that may be of interest:

Aug 4 - Contextual Embeddings, Pretrained Language Models, and Transfer Learning

Noah Smith. 2019. “Contextual Word Vectors: A Contextual Introduction.”

Jay Alammar. 2018. “The Illustrated BERT, ELMo and Co. (How NLP Cracked Transfer Learning).” here

Jay Alammar. 2019. “A Visual Guide to Using BERT for the First Time.” here

Zhanna Terechskenko, Fridolin Linder, Vishakh Padmakumar, Michael Liu, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. 2020. “A Comparison of Methods in Political Science Text Classification: Transfer Learning Language Models for Politics.” [here](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3724644

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day7-TransferLearning.pdf

Text Classification with Keras and Tensorflow - BERT: https://colab.research.google.com/drive/1OQbZQZtoOB7Kg3RR_nqh52gDivuPkaEU?usp=sharing

Aug 5 - Multilingual Text as Data and Machine Translation

Mitchell Goist and Burt L. Monroe. 2020. “Scaling the Tower of Babel: Common-Space Analysis of Political Text in Multiple Languages.”

Leah C. Windsor, James G. Cupit, Alistair J. Windsor. 2019. “Automated content analysis across six languages.” PloS ONE 14(11):e0224425. here

Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day8-Multilingual.pdf

Text Translation Using Pretrained Transformer (Encoder-Decoder) Language Models: https://colab.research.google.com/drive/1d6SZzl1Rnxr25e8_ecR1vZGG156aOUk-?usp=sharing

Aug 6 - Natural Language Understanding / Semantic Change / Fairness & Bias in NLP

Emily M. Bender and Alexander Koller. 2020. “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.” https://aclanthology.org/2020.acl-main.463.pdf

Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, and Fabienne Marco. 2020. “Bias in Word Embeddings.” https://dl.acm.org/doi/pdf/10.1145/3351095.3372843

Suggested on semantic change (“classic”, canonical cite): William L. Hamilton, Jure Leskovec, Dan Jurafsky. 2016. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” ACL-2016 https://cs.stanford.edu/people/jure/pubs/diachronic-acl16.pdf

Suggested on semantic change (State of the Art): Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, Nina Tahmasebi. 2020. “SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection.” https://arxiv.org/pdf/2007.11464.pdf

(I’ll squeeze in the requested task of “custom named entity recognition” if I can.)

Slides: Part 1 and Part 2