Materials for "Text as Data" classes at Penn State and Essex.
Kenneth Benoit. 2020. “Text as Data: An Overview.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. here
Jacob Eisenstein. 2018. “Introduction.” Natural Language Processing here
Notes: Open Source Tools for Text as Data / NLP in R
Notes: Open Source Tools for Text as Data / NLP in Python
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day1-Introduction
R Notebooks for Day 1 https://rstudio.cloud/.
Python notebook on string manipulation and regular expressions: here
Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapters 3, 8, 14. “N-gram Language Models,” “Sequence Labeling for Parts of Speech and Named Entities,” “Dependency Parsing.” here
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day2-NLPPipelines.pdf
R Tutorials for Day 2 on RStudio Cloud: NLP Pipelines in R + spaCy in R (RStudio Cloud)
Python Tutorials on NLP Annotation Pipelines (spaCy and Stanza for now) here
Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 6, “Vector Semantics and Embeddings.” here
Pedro Rodriguez and Arthur Spirling (Forthcoming) “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research.” Journal of Politics. here
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day3-WordEmbeddings.pdf
Python Tutorial on Estimating Word Embeddings with gensim: https://colab.research.google.com/drive/1eSzd2z5B3CDeTxpdMXCIh3bm1L-gYzCr?usp=sharing
3rd party tutorial on how to estimate your own GloVe embeddings in R with text2vec, http://text2vec.org/glove.html, and a replication of that same example from within quanteda is here: https://quanteda.io/articles/pkgdown/replication/text2vec.html
Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 7, “Neural Networks and Neural Language Models.” here
Jay Alammar. 2016. “A Visual and Interactive Guide to the Basics of Neural Networks.” here
Jay Alammar. 2016. “A Visual and Interactive Look at Basic Neural Network Math.” here
Christopher Olah. 2014. “Deep Learning, NLP, and Representations.” here
Kakia Chatsiou and Slava Jankin Mikhaylov. 2020. “Deep Learning for Political Science.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. here
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day4-DeepLearningNLP.pdf
R Tutorial on Text Classification with Keras and Tensorflow, on RStudio Cloud in Day 4 Project
Python Tutorial on Text Classification with Keras and Tensorflow: https://colab.research.google.com/drive/1MG2_5Hx5dwN77hmVNY0aUiGo99k2mPGb?usp=sharing
Study guide for exams: https://docs.google.com/document/d/1eZGUUzqTJCfQjQiJAn1DqhaVWeDoH4RSvOU8TC3WIqE/edit?usp=sharing
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day5-DeepLearningNLP2.pdf
Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (Python): https://colab.research.google.com/drive/1kGhXArEbWDP_A4TtlB1cgSubekIsX4VP?usp=sharing
Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (R): https://colab.research.google.com/drive/1hq9eCrWjDOkpMUY0QJ9fAOHWagcBSXU7?usp=sharing
Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (Python): https://colab.research.google.com/drive/1pkJNzWDdqTaVzZFQ1RnkAxx87Wkyr31T?usp=sharing
Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (R): Not currently functional.
Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (Python): https://colab.research.google.com/drive/1_6m2DVFQJPZH5UENZDs7jkrOU6kjyuCu?usp=sharing
Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (R): https://colab.research.google.com/drive/1n1Al0lplHxY78P5vPATBp6kLUUYz6maA?usp=sharing
Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 9, “Deep Learning Architectures for Sequence Processing.” here
Jay Alammar. 2018. “The Illustrated Transformer.” here
Suggested: Andrew Halterman. 2019. “Geolocating Political Events.” here
Suggested: Han Zhang and Jennifer Pan. 2019. “CASM: A Deep-Learning Approach for Identifying Collective Action Events from Text and Image Data.” Sociological Methodology 49(1): 1-57. here
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day6-RNNsToTransformers.pdf
Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (Python): https://colab.research.google.com/drive/1TDYGoskrMCbWGyS4X_kj8Ftzc6hURgz8?usp=sharing
Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (R)
Text Classification with Keras and Tensorflow 6: CNNs (Python)
Text Classification with Keras and Tensorflow 6: CNNs (R)
Third party notebooks on Transformers that may be of interest:
Original Tensor2Tensor notebook (deprecated) (has illustration of self-attention): https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb
Successor Trax notebook: https://colab.research.google.com/github/google/trax/blob/master/trax/intro.ipynb
Text Classification with Transformer (Apoorv Nandan, 2020) - https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/text_classification_with_transformer.ipynb#scrollTo=anLSsILXyULq (IMDB sentiment)
Noah Smith. 2019. “Contextual Word Vectors: A Contextual Introduction.”
Jay Alammar. 2018. “The Illustrated BERT, ELMo and Co. (How NLP Cracked Transfer Learning).” here
Jay Alammar. 2019. “A Visual Guide to Using BERT for the First Time.” here
Zhanna Terechskenko, Fridolin Linder, Vishakh Padmakumar, Michael Liu, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. 2020. “A Comparison of Methods in Political Science Text Classification: Transfer Learning Language Models for Politics.” [here](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3724644
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day7-TransferLearning.pdf
Text Classification with Keras and Tensorflow - BERT: https://colab.research.google.com/drive/1OQbZQZtoOB7Kg3RR_nqh52gDivuPkaEU?usp=sharing
Mitchell Goist and Burt L. Monroe. 2020. “Scaling the Tower of Babel: Common-Space Analysis of Political Text in Multiple Languages.”
Leah C. Windsor, James G. Cupit, Alistair J. Windsor. 2019. “Automated content analysis across six languages.” PloS ONE 14(11):e0224425. here
Slides: https://burtmonroe.github.io/TextAsDataCourse/Essex/EssexNLP-Day8-Multilingual.pdf
Text Translation Using Pretrained Transformer (Encoder-Decoder) Language Models: https://colab.research.google.com/drive/1d6SZzl1Rnxr25e8_ecR1vZGG156aOUk-?usp=sharing
Emily M. Bender and Alexander Koller. 2020. “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.” https://aclanthology.org/2020.acl-main.463.pdf
Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, and Fabienne Marco. 2020. “Bias in Word Embeddings.” https://dl.acm.org/doi/pdf/10.1145/3351095.3372843
Suggested on semantic change (“classic”, canonical cite): William L. Hamilton, Jure Leskovec, Dan Jurafsky. 2016. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” ACL-2016 https://cs.stanford.edu/people/jure/pubs/diachronic-acl16.pdf
Suggested on semantic change (State of the Art): Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, Nina Tahmasebi. 2020. “SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection.” https://arxiv.org/pdf/2007.11464.pdf
(I’ll squeeze in the requested task of “custom named entity recognition” if I can.)