TextAsDataCourse

Materials for "Text as Data" classes at Penn State and Essex.

View the Project on GitHub burtmonroe/TextAsDataCourse

Open source Tools for Text as Data / NLP in R

Notes for students in my Text as Data/NLP courses at Penn State / Essex

R generics

Text processing and string manipulation

String manipulation operations, with particular focus on pattern matching with regular expressions in the stringr and stringi libraries, are addressed in this tutorial: https://burtmonroe.github.io/TextAsDataCourse/Tutorials/TADA-IntroToTextManipulation.nb.html. (There is a similar notebook for Python.)

You may also be interested in packages ore (https://github.com/jonclayden/ore) and rex (https://github.com/kevinushey/rex) which provide alternative regular expression engines/syntax, in both cases based on how they work in the programming language Ruby.

Also of note is package stringdist (https://github.com/markvanderloo/stringdist) which provides string distance metrics (see Jurafsky and Martin, Chapter 2 for a good discussion of “edit distance”) and fuzzy string match searching.

Text-as-data frameworks/ecosystems

Quanteda, tm, and tidytext are general – partially overlapping, interrelated, and interconnected – frameworks for text-as-data analysis, and most social scientific text work in R is managed through one of these. Their primary strengths are in the data science aspects of managing/wrangling text data as quantitative data for statistical / machine learning analysis.

quanteda (https://quanteda.io)

tm (“text mining”) - https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

tidytext https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

corpustools

corpus

TextMiningGUI

RcmdrPlugin.temis

RTextTools

NLP pipelines

udpipe

spacyR and SpaCy from R

Stanza / StanfordNLP / Stanford CoreNLP / coreNLP

OpenNLP

sparkNLP

cleanNLP

korPus

**Other popular libraries accessible through Python

All of the following are desribed in the Python notes: https://github.io/burtmonroeTextAsDataCourse. I imagine you can access nltk in R through reticulate, although I don’t think I’ve tried. I doubt the others are even hypothetically accessible in R through reticulate.

Text-as-data modeling / topic models / embeddings

topicmodels

stm (Structural Topic Model) - https://www.structuraltopicmodel.com

lda (Latent Dirichlet Allocation)

tidylda

text2vec

wordspace

Deep Learning

keras / tensorflow

torch

Utility packages

SnowballC / Rstem

stopwords

tokenizers

tokenizers.bpe

NLP

tif - Text Interchange Format

hunspell

wordnet

tau

tokenbrowser

textreuse