TextAsDataCourse

Materials for "Text as Data" classes at Penn State and Essex.

View the Project on GitHub burtmonroe/TextAsDataCourse

Open source Tools for Text as Data / NLP in Python

Notes for students in my Text as Data/NLP courses at Penn State / Essex

Python generics

Text processing libraries

See the Python text manipulation notebook for basic operations with str-typed variables, common string operations with module string, pattern matching with regular expressions in the re module, and manipulation and normalization of unicode strings with module unicodedata.

NLP & text modeling

spaCy - https://spacy.io

NLTK (Natural Language Toolkit) - http://www.nltk.org

Stanza (formerly StanfordNLP) - https://stanfordnlp.github.io/stanza/

Stanford CoreNLP - https://stanfordnlp.github.io/CoreNLP/ See Stanza.

UDPipe (https://ufal.mff.cuni.cz/udpipe)

Apache OpenNLP (https://opennlp.apache.org/)

Flair - https://github.com/flairNLP/flair

AllenNLP - https://github.com/allenai/allennlp https://guide.allennlp.org/

SparkNLP - https://nlp.johnsnowlabs.com/

NLP Architect (https://github.com/IntelLabs/nlp-architect)

torchtext (https://pytorch.org/text/stable/index.html)

Polyglot - https://polyglot.readthedocs.io/en/latest/

PyNLPl - https://pynlpl.readthedocs.io/

fastText - https://fasttext.cc

gensim - https://radimrehurek.com/gensim

TextBlob - https://www.textblob.readthedocs.io/

pattern https://clips.uantwerpen.be/pages/pattern

MontyLingua

Vocabulary

Web crawling and scraping

Requests

Scrapy

BeautifulSoup

Selenium

See also pattern (above)

Different filetypes

Deep learning frameworks

TensorFlow - https://www.tensorflow.org

PyTorch - https://pytorch.org

Keras

fastai

Theano - https://pypi.org/project/Theano

H2O.ai

Chainer - https://chainer.org