Tutorials / Notebooks / Code
Burt Monroe (Penn State)
Produced for Penn State and Essex Courses in “Text as Data”
String Processing and Regular Expressions in R & Python
- Introduction to String Manipulation and Regular Expressions in R
- Notebook html: here
- Notebook .Rmd here
- Available on Essex RStudioCloud (Day 1 - Review project)
- Introduction to String Manipulation and Regular Expressions in Python
NLP / Text-as-Data Frameworks in R & Python
In R
In Python
Scraping and Data Wrangling:
- Scraping with rvest (R) (Example: United Nations meeting summaries)
- Notebook nb.html: here
- Notebook Rmd: here
- Scraping with RSelenium (R) (Example: UN Document Search)
- Notebook nb.html: here
- Notebook Rmd: here
-
Scraping with Requests and BeautifulSoup (Python)
-
Scraping with scraPy (Python)
-
Scraping with pattern (Python)
-
Scraping with Selenium (Python)
- Dealing with PDFs (pdftools, tabulizer, and textreadr in R; xpdf/pdftotext in Unix; PyPDF2/PyPDF4, PDFQuery/Slate/PDFminer, xpdf, and tabula-py in Python)
- R Notebook nb.html: here
- R Notebook Rmd: here
-
Dealing with .doc, .docx, .rtf files (textreadr in R; python-docx and python-docx2txt in Python)
-
Dealing with XML files
-
Dealing with JSON files
- Introduction to encoding, Unicode, UTF-8 and similar concepts
Measuring, Modeling, and Representation
- Introduction to Cosine Similarity (R)
- Notebook nb.html: here
- Notebook Rmd: here
- Introduction to Dictionary-based Analysis in R
- Introduction to Text Classification (Naive Bayes, Logistic/ridge/LASSO, Support Vector Machine, Random Forests, and ensembling) (R)
- Notebook nb.html: here
- Notebook Rmd: here
-
Latent Dirichlet Allocation in R (topicmodels, lda, and MALLET)
- Latent Dirichlet Allocation in Python’s “lda” package.
-
LDA and related analyses in gensim (Python).
- Introduction to the Structural Topic Model (R)
- Notebook nb.html: here
- Notebook Rmd: here
-
Topic models and unsupervised learning with gensim (Python)
- Code for Fightin Words and Demo (R)
- Notebook nb.html: here
- Notebook Rmd: here
- Introduction to Scaling with Wordfish (R)
- Notebook nb.html: here
- Notebook Rmd: here
- Introduction to Estimating Word Embeddings with gensim (word2vec and fasttext) (Python)
- https://colab.research.google.com/drive/1eSzd2z5B3CDeTxpdMXCIh3bm1L-gYzCr?usp=sharing#scrollTo=54KJAKL0OD5Q
- (3rd party tutorials) Estimating GloVe embeddings in R with
- text2vec: http://text2vec.org/glove.html
- quanteda: https://quanteda.io/articles/pkgdown/replication/text2vec.html
Neural NLP / Deep Learning
-
(3rd party demo) Interactive Demo, (Feedforward) Neural Networks (Daniel Smilkov and Shan Carter, TensorFlow
- Text Classification with Keras and Tensorflow in Python:
- https://colab.research.google.com/drive/1MG2_5Hx5dwN77hmVNY0aUiGo99k2mPGb?usp=sharing
-
Text Classidication with Keras and Tensorflow in R - needs to be ported from RStudio Cloud.
- Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (Python):
- https://colab.research.google.com/drive/1kGhXArEbWDP_A4TtlB1cgSubekIsX4VP?usp=sharing
- Text Classification with Keras and Tensorflow 2: Dropout and Weight Regularization (R):
- https://colab.research.google.com/drive/1hq9eCrWjDOkpMUY0QJ9fAOHWagcBSXU7?usp=sharing
- Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (Python):
- https://colab.research.google.com/drive/1pkJNzWDdqTaVzZFQ1RnkAxx87Wkyr31T?usp=sharing
-
Text Classification with Keras and Tensorflow 3: Pretrained Embeddings (R): Not currently functional.
- Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (Python):
- https://colab.research.google.com/drive/1_6m2DVFQJPZH5UENZDs7jkrOU6kjyuCu?usp=sharing
- Text Classification with Keras and Tensorflow 4: Incorporating an Embedding Layer (R):
- https://colab.research.google.com/drive/1n1Al0lplHxY78P5vPATBp6kLUUYz6maA?usp=sharing
- (Older) Introduction to Deep Learning with Keras and TensorFlow in R
- Builds deep and shallow feed-forward ANN models for classification of IMDB data. Discusses interpretation. Compares to classic classifiers. Adds embedding layer with embeddings learned during estimation. Adds pretrained (GloVe) embeddings.
- Notebook nb.html: here
- Notebook Rmd: here
- Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (Python):
- https://colab.research.google.com/drive/1TDYGoskrMCbWGyS4X_kj8Ftzc6hURgz8?usp=sharing
-
Text Classification with Keras and Tensorflow 5: LSTMs and Bi-LSTMs (R)
-
Text Classification with Keras and Tensorflow 6: CNNs (Python)
-
Text Classification with Keras and Tensorflow 6: CNNs (R)
- Third party notebooks on Transformers that may be of interest:
- Original Tensor2Tensor notebook (deprecated) (has illustration of self-attention): https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb
- Successor Trax notebook: https://colab.research.google.com/github/google/trax/blob/master/trax/intro.ipynb
- Text Classification with Transformer (Apoorv Nandan, 2020) - https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/text_classification_with_transformer.ipynb#scrollTo=anLSsILXyULq (IMDB sentiment)
- Text Classification with Keras and Tensorflow - BERT (Python):
- https://colab.research.google.com/drive/1OQbZQZtoOB7Kg3RR_nqh52gDivuPkaEU?usp=sharing
- Text Translation Using Pretrained Transformer (Encoder-Decoder) Language Models (Python):
- https://colab.research.google.com/drive/1d6SZzl1Rnxr25e8_ecR1vZGG156aOUk-?usp=sharing