Updated September 2021.
The quanteda package (https://quanteda.io) is a very general and well-documented ecosystem for text analysis in R. A very large percentage of what is typically done in social science text-as-data research can be done with, or at least through, quanteda. Among the "competitors" to quanteda are the classic package tm and the tidyverse-consistent package tidytext. These actually are interrelated, with shared code and conversion utilities available, so they aren't necessarily in conflict.
Official description:
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
In addition to the extensive documentation, Stefan Muller and Ken Benoit have a very helpful cheatsheet here: https://muellerstefan.net/files/quanteda-cheatsheet.pdf.
In this notebook, we will use quanteda to turn a collection of texts, a corpus, into quantitative data, with each document represented by the counts of the "words" in it. Since we do away with word order this is called a bag-of-words representation.
Install the following packages if you haven't.
# install.packages("quanteda", dependencies=TRUE)
# install.packages("tokenizers", dependencies=TRUE)
# install.packages("quanteda.textplots", dependencies=TRUE)
# install.packages("RColorBrewer", dependencies=TRUE)
Note that quanteda has in recent versions moved analysis and plotting functions to new packages quanteda.textplots, quanteda.textmodels (classification and scaling models), and quanteda.textstats.
Now load quanteda:
library(quanteda)
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
If you are working on RStudio Cloud, you may have received a warning message about the "locale." You set the locale for British English ("en_GB") with the stri_locale_set
comman in the already loaded stringi package. You may wish to set it to assume you are working in a different context (e.g., "en_US" for US English) or language (e.g., "pt_BR" for Brazilian Portuguese). This seems to happen every time an RStudio Cloud project with quanteda loaded is reopened, so you have to reissue this command to make the warning message go away.
# stringi::stri_locale_set("en_GB")
Quanteda comes with several corpora included. Lets load in the corpus of US presidential inaugural addresses and see what it looks like:
corp <- quanteda::data_corpus_inaugural
summary(corp)
What does a document look like? Let's look at one document (George Washington's first inaugural), which can be accessed with the as.character
method. (The previous command texts
has been deprecated.)
as.character(corp[1])
## 1789-Washington
## "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years - a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "
The first task is tokenizing. You can apply a tokenizer in quanteda with the tokens
command, turning a "corpus" object -- or just a vector of texts -- into a "tokens" object. In the latest version of Quanteda, most commands operate on a tokens object.
The examples from the help file will be used to show a few of the options:
txt <- c(doc1 = "A sentence, showing how tokens() works.",
doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
doc3 = "Self-documenting code??",
doc4 = "£1,000,000 for 50¢ is gr8 4ever \U0001f600")
tokens(txt)
## Tokens consisting of 4 documents.
## doc1 :
## [1] "A" "sentence" "," "showing" "how" "tokens"
## [7] "(" ")" "works" "."
##
## doc2 :
## [1] "@quantedainit" "and"
## [3] "#textanalysis" "https://example.com?p=123."
##
## doc3 :
## [1] "Self-documenting" "code" "?" "?"
##
## doc4 :
## [1] "£" "1,000,000" "for" "50" "¢"
## [6] "is" "gr8" "4ever" "\U0001f600"
The what
option selects different tokenizers. The default is word
which replaces a slower and less subtle word1
legacy version.
tokens(txt, what = "word1")
## Tokens consisting of 4 documents.
## doc1 :
## [1] "A" "sentence" "," "showing" "how" "tokens"
## [7] "(" ")" "works" "."
##
## doc2 :
## [1] "@" "quantedainit" "and" "#" "textanalysis"
## [6] "https" ":" "/" "/" "example.com"
## [11] "?" "p"
## [ ... and 3 more ]
##
## doc3 :
## [1] "Self-documenting" "code" "?" "?"
##
## doc4 :
## [1] "£" "1,000,000" "for" "50" "¢"
## [6] "is" "gr8" "4ever" "\U0001f600"
For some purposes you may wish to tokenize by characters:
tokens(txt[1], what = "character")
## Tokens consisting of 1 document.
## doc1 :
## [1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
## [ ... and 22 more ]
You can "tokenize" (the usual term is "segment") by sentence in Quanteda, but note that they recommend the spacyr package (discussed in a separate notebook) for better sentence segmentation. Let's try it on Washington's inaugural:
tokens(corp[1], what = "sentence")
## Tokens consisting of 1 document and 4 docvars.
## 1789-Washington :
## [1] "Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month."
## [2] "On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years - a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time."
## [3] "On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies."
## [4] "In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected."
## [5] "All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated."
## [6] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge."
## [7] "In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either."
## [8] "No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States."
## [9] "Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage."
## [10] "These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed."
## [11] "You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence."
## [12] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\""
## [ ... and 11 more ]
Wow, those are long sentences. Out of curiosity, let's look at Trump's:
tokens(corp[58], what = "sentence")
## Tokens consisting of 1 document and 4 docvars.
## 2017-Trump :
## [1] "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you."
## [2] "We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people."
## [3] "Together, we will determine the course of America and the world for many, many years to come."
## [4] "We will face challenges."
## [5] "We will confront hardships."
## [6] "But we will get the job done."
## [7] "Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition."
## [8] "They have been magnificent."
## [9] "Thank you."
## [10] "Today's ceremony, however, has very special meaning."
## [11] "Because today we are not merely transferring power from one Administration to another, or from one party to another - but we are transferring power from Washington DC and giving it back to you, the people."
## [12] "For too long, a small group in our nation's Capital has reaped the rewards of government while the people have borne the cost."
## [ ... and 76 more ]
Those are ... shorter.
There are a number of options you can apply with the tokens
command, controlling how the tokenizer deals with punctuation, numbers, symbols, hyphenization, etc. Again, just the help file examples:
# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "A" "sentence" "showing" "how" "tokens" "works"
##
## doc2 :
## [1] "@quantedainit" "and"
## [3] "#textanalysis" "https://example.com?p=123."
# splitting hyphenated words
tokens(txt[3])
## Tokens consisting of 1 document.
## doc3 :
## [1] "Self-documenting" "code" "?" "?"
tokens(txt[3], split_hyphens = TRUE)
## Tokens consisting of 1 document.
## doc3 :
## [1] "Self" "-" "documenting" "code" "?"
## [6] "?"
# symbols and numbers
tokens(txt[4])
## Tokens consisting of 1 document.
## doc4 :
## [1] "£" "1,000,000" "for" "50" "¢"
## [6] "is" "gr8" "4ever" "\U0001f600"
tokens(txt[4], remove_numbers = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "£" "for" "¢" "is" "gr8"
## [6] "4ever" "\U0001f600"
tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "for" "is" "gr8" "4ever"
You can use other tokenizers, like those from the "tokenizers" package. The output of a command like tokenizers::tokenize_words
can be passed to the tokens command:
# install.packages("tokenizers")
library(tokenizers)
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "1,000,000" "for" "50" "is" "gr8" "4ever"
# using pipe notation
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
tokens(remove_symbols = TRUE)
## Tokens consisting of 4 documents.
## doc1 :
## [1] "A" "sentence" "," "showing" "how" "tokens"
## [7] "(" ")" "works" "."
##
## doc2 :
## [1] "@" "quantedainit" "and" "#" "textanalysis"
## [6] "https" ":" "/" "/" "example.com"
## [11] "?" "p"
## [ ... and 2 more ]
##
## doc3 :
## [1] "Self" "-" "documenting" "code" "?"
## [6] "?"
##
## doc4 :
## [1] "1,000,000" "for" "50" "is" "gr8" "4ever"
tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) %>%
tokens(remove_punct = TRUE)
## Tokens consisting of 1 document.
## doc3 :
## [1] "s" "e" "l" "f" "d" "o" "c" "u" "m" "e" "n" "t"
## [ ... and 7 more ]
tokenizers::tokenize_sentences(
"The quick brown fox. It jumped over the lazy dog.") %>%
tokens()
## Tokens consisting of 1 document.
## text1 :
## [1] "The quick brown fox." "It jumped over the lazy dog."
Look carefully -- what did it do differently?
Let's make a fairly generic tokens object from our inaugural speeches corpus.
inaugural_tokens <- quanteda::tokens(corp,
what = "word",
remove_punct = TRUE, # default FALSE
remove_symbols = TRUE, # default FALSE
remove_numbers = FALSE,
remove_url = TRUE, # default FALSE
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = quanteda_options("verbose")
)
This produces a tokens
class object. Expand the object in your RStudio Environment tab to take a look at it.
Foremost, it's a list with one entry per document consisting of a character vector of the document's tokens.
inaugural_tokens[["2017-Trump"]][1:30]
## [1] "Chief" "Justice" "Roberts" "President" "Carter" "President"
## [7] "Clinton" "President" "Bush" "President" "Obama" "fellow"
## [13] "Americans" "and" "people" "of" "the" "world"
## [19] "thank" "you" "We" "the" "citizens" "of"
## [25] "America" "are" "now" "joined" "in" "a"
It also has a vector of the "types" -- the vocabulary of tokens in the whole corpus/object. This attribute can be accessed through the attr
function.
attr(inaugural_tokens,"types")[1:30]
## [1] "Fellow-Citizens" "of" "the" "Senate"
## [5] "and" "House" "Representatives" "Among"
## [9] "vicissitudes" "incident" "to" "life"
## [13] "no" "event" "could" "have"
## [17] "filled" "me" "with" "greater"
## [21] "anxieties" "than" "that" "which"
## [25] "notification" "was" "transmitted" "by"
## [29] "your" "order"
length(attr(inaugural_tokens, "types"))
## [1] 10147
Just over 10000 unique tokens have been used. Notice the
appears third and never again. But ... The
does:
which(attr(inaugural_tokens,"types")=="The")
## [1] 339
Why are they "the" and "The" different types? Why is "Fellow-Citizens" one type?
Under the hood, the tokens
vector isn't a vector of strings. It's a vector of integers, indicating the index of the token in the type vector. So every time the
appears, it is stored as the integer 3.
By default, the tokens
object also retains all of the document metadata that came with the corpus.
The tokens object also provides access to a variety of quanteda utilities. For example, a very helpful traditional qualitative tool is the Key Words in Context or kwic
command:
kwic(inaugural_tokens, "humble", window=3)
kwic(inaugural_tokens, "tombstones", window=4)
Hmmm. Moving on.
Stemming is the truncation of words in an effort to associate related words with a common token, e.g., "baby" and "babies" -> "babi".
The tokenizers package provides a wrapper to the wordStem
function from the SnowballC package, which applies a standard stemmer called the Porter stemmer. (The function takes as input a vector of texts or corpus, and returns a list, each element a vector of the stems for the corresponding text.)
tokenizers::tokenize_word_stems(corp)$`2017-Trump`[1:50]
## [1] "chief" "justic" "robert" "presid" "carter" "presid"
## [7] "clinton" "presid" "bush" "presid" "obama" "fellow"
## [13] "american" "and" "peopl" "of" "the" "world"
## [19] "thank" "you" "we" "the" "citizen" "of"
## [25] "america" "are" "now" "join" "in" "a"
## [31] "great" "nation" "effort" "to" "rebuild" "our"
## [37] "countri" "and" "restor" "it" "promis" "for"
## [43] "all" "of" "our" "peopl" "togeth" "we"
## [49] "will" "determin"
Quanteda is focused largely on bag-of-words (or bag-of-tokens or bag-of-terms) models that work from a document-term matrix. where each row represents a document, each column represents a type (a "term" in the vocabulary) and the entries are the counts of tokens matching the term in the current document.
For this we will use quanteda's "dfm" command with some commonly chosen preprocessing options. In older version os quanteda, the dfm function was applied to a corpus, with tokenizing and normalizing options applied there. It is now applied to a tokens object where most of that has already been done. Here, we'll add case-folding, merging the
and The
, among other things, into a single type.
doc_term_matrix <- quanteda::dfm(inaugural_tokens,
tolower = TRUE # case-fold
)
What kind of object is doc_term_matrix?
class(doc_term_matrix)
## [1] "dfm"
## attr(,"package")
## [1] "quanteda"
Typing the dfm's name will show an object summary. This is a matrix, so how many rows does it have? How many columns? What does "91.89% sparse" mean?
doc_term_matrix
## Document-feature matrix of: 59 documents, 9,422 features (91.89% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives
## 1789-Washington 1 71 116 1 48 2 2
## 1793-Washington 0 11 13 0 2 0 0
## 1797-Adams 3 140 163 1 130 0 2
## 1801-Jefferson 2 104 130 0 81 0 0
## 1805-Jefferson 0 101 143 0 93 0 0
## 1809-Madison 1 69 104 0 43 0 0
## features
## docs among vicissitudes incident
## 1789-Washington 1 1 1
## 1793-Washington 0 0 0
## 1797-Adams 4 0 0
## 1801-Jefferson 1 0 0
## 1805-Jefferson 7 0 0
## 1809-Madison 0 0 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,412 more features ]
You can peak inside it, indexing it like you would a matrix
or Matrix
object:
doc_term_matrix[1:5,1:5]
## Document-feature matrix of: 5 documents, 5 features (20.00% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and
## 1789-Washington 1 71 116 1 48
## 1793-Washington 0 11 13 0 2
## 1797-Adams 3 140 163 1 130
## 1801-Jefferson 2 104 130 0 81
## 1805-Jefferson 0 101 143 0 93
What are the most frequent terms?
topfeatures(doc_term_matrix,40)
## the of and to in a our
## 10183 7180 5406 4591 2827 2292 2224
## we that be is it for by
## 1827 1813 1502 1491 1398 1230 1091
## have which not with as will this
## 1031 1007 980 970 966 944 874
## i all are their but has people
## 871 836 828 761 670 631 584
## from its government or on my us
## 578 573 564 563 544 515 505
## been can no they so
## 496 487 470 463 397
You can get the same thing through sorting a column sum of the dtm:
word_freq <- colSums(doc_term_matrix)
sort(word_freq,decreasing=TRUE)[1:40]
## the of and to in a our
## 10183 7180 5406 4591 2827 2292 2224
## we that be is it for by
## 1827 1813 1502 1491 1398 1230 1091
## have which not with as will this
## 1031 1007 980 970 966 944 874
## i all are their but has people
## 871 836 828 761 670 631 584
## from its government or on my us
## 578 573 564 563 544 515 505
## been can no they so
## 496 487 470 463 397
For some purposes, you may wish to remove "stopwords." There are stopword lists accessible through the stopwords
function, exported from the automatically loaded stopwords
package. The default is English from the Snowball collection. Get a list of sources with stopwords_getsources()
and a list of languages for the source with stopwords_getlanguages()
The default English list is fairly short.
stopwords('en')[1:10] #Snowball
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you" "your"
length(stopwords('en'))
## [1] 175
This one's three times longer.
stopwords('en', source='smart')[1:10]
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
length(stopwords('en', source='smart'))
## [1] 571
This one's almost ten times as long and is ... interesting
stopwords('en', source='stopwords-iso')[1:10]
## [1] "'ll" "'tis" "'twas" "'ve" "10" "39"
## [7] "a" "a's" "able" "ableabout"
length(stopwords('en', source='stopwords-iso'))
## [1] 1298
The beginning of a German list.
stopwords('de')[1:10]
## [1] "aber" "alle" "allem" "allen" "aller" "alles" "als" "also" "am"
## [10] "an"
A slice from an Ancient Greek list:
stopwords('grc',source='ancient')[264:288]
## [1] "xxx" "xxxi" "xxxii" "xxxiii" "xxxiv" "xxxix" "xxxv"
## [8] "xxxvi" "xxxvii" "xxxviii" "y" "z" "α" "ἅ"
## [15] "ἃ" "ᾇ" "ἄγαν" "ἄγε" "ἄγχι" "ἀγχοῦ" "ἁγώ"
## [22] "ἁγὼ" "ἅγωγ" "ἁγών" "ἁγὼν"
Lets case-fold our tokens object to lowercase, remove the stopwords, then make a new dtm and see how it's different.
inaugural_tokens.nostop <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('en'))
dtm.nostop <- dfm(inaugural_tokens.nostop)
dtm.nostop
## Document-feature matrix of: 59 documents, 9,284 features (92.70% sparse) and 4 docvars.
## features
## docs fellow-citizens senate house representatives among
## 1789-Washington 1 1 2 2 1
## 1793-Washington 0 0 0 0 0
## 1797-Adams 3 1 0 2 4
## 1801-Jefferson 2 0 0 0 1
## 1805-Jefferson 0 0 0 0 7
## 1809-Madison 1 0 0 0 0
## features
## docs vicissitudes incident life event filled
## 1789-Washington 1 1 1 2 1
## 1793-Washington 0 0 0 0 0
## 1797-Adams 0 0 2 0 0
## 1801-Jefferson 0 0 1 0 0
## 1805-Jefferson 0 0 2 0 0
## 1809-Madison 0 0 1 0 1
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,274 more features ]
We've got about 1000 fewer features, and it is slightly more sparse. Why?
What are the most frequent tokens now?
topfeatures(dtm.nostop,40)
## people government us can must upon
## 584 564 505 487 376 371
## great may states world shall country
## 344 343 334 319 316 308
## nation every one peace new power
## 305 300 267 258 250 241
## now public time citizens constitution united
## 229 225 220 209 209 203
## america nations union freedom free war
## 202 199 190 185 183 181
## american let national made good make
## 172 160 158 156 149 147
## years justice men without
## 143 142 140 140
I'm just curious. Besides "tombstones," what other words made their inaugural debut in 2017?
unique_to_trump <- as.vector(colSums(doc_term_matrix) == doc_term_matrix["2017-Trump",])
colnames(doc_term_matrix)[unique_to_trump]
## [1] "obama" "hardships" "lady" "michelle"
## [5] "transferring" "dc" "reaped" "politicians"
## [9] "2017" "listening" "likes" "neighborhoods"
## [13] "trapped" "rusted-out" "tombstones" "landscape"
## [17] "flush" "stolen" "robbed" "unrealized"
## [21] "carnage" "stops" "we've" "subsidized"
## [25] "allowing" "sad" "depletion" "trillions"
## [29] "overseas" "infrastructure" "disrepair" "ripped"
## [33] "redistributed" "issuing" "ravages" "stealing"
## [37] "tunnels" "hire" "goodwill" "shine"
## [41] "reinforce" "islamic" "bedrock" "disagreements"
## [45] "solidarity" "unstoppable" "complaining" "arrives"
## [49] "mysteries" "brown" "bleed" "sprawl"
## [53] "windswept" "nebraska" "ignored"
OK!
We can also change the settings. What happens if we don't remove punctuation?
inaugural_tokens.wpunct <- quanteda::tokens(corp,
what = "word",
remove_punct = FALSE) %>%
tokens_tolower() %>%
tokens_remove(stopwords('en'))
dtm.wpunct <- dfm(inaugural_tokens.wpunct)
dtm.wpunct
## Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
## features
## docs fellow-citizens senate house representatives : among
## 1789-Washington 1 1 2 2 1 1
## 1793-Washington 0 0 0 0 1 0
## 1797-Adams 3 1 0 2 0 4
## 1801-Jefferson 2 0 0 0 1 1
## 1805-Jefferson 0 0 0 0 0 7
## 1809-Madison 1 0 0 0 0 0
## features
## docs vicissitudes incident life event
## 1789-Washington 1 1 1 2
## 1793-Washington 0 0 0 0
## 1797-Adams 0 0 2 0
## 1801-Jefferson 0 0 1 0
## 1805-Jefferson 0 0 2 0
## 1809-Madison 0 0 1 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]
topfeatures(dtm.wpunct,40)
## , . people ; government us
## 7173 5155 584 565 564 505
## can - must upon great may
## 487 403 376 371 344 343
## states world shall country nation every
## 334 319 316 308 305 300
## one peace " new power now
## 267 258 256 250 241 229
## public time citizens constitution united america
## 225 220 209 209 203 202
## nations union freedom free war american
## 199 190 185 183 181 172
## let national made good
## 160 158 156 149
How big is it now? How sparse is it now?
What happens if we lower case and stem?
inaugural_tokens.stems <- quanteda::tokens(corp,
what = "word",
remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords('en')) %>%
tokens_wordstem()
dtm.stems <- dfm(inaugural_tokens.stems)
dtm.stems
## Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
## features
## docs fellow-citizen senat hous repres among vicissitud incid life
## 1789-Washington 1 1 2 2 1 1 1 1
## 1793-Washington 0 0 0 0 0 0 0 0
## 1797-Adams 3 1 3 3 4 0 0 2
## 1801-Jefferson 2 0 0 1 1 0 0 1
## 1805-Jefferson 0 0 0 0 7 0 0 2
## 1809-Madison 1 0 0 1 0 1 0 1
## features
## docs event fill
## 1789-Washington 2 1
## 1793-Washington 0 0
## 1797-Adams 0 0
## 1801-Jefferson 0 0
## 1805-Jefferson 1 0
## 1809-Madison 0 1
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]
topfeatures(dtm.stems,40)
## nation govern peopl us can state great must
## 691 657 632 505 487 452 378 376
## power upon countri world may shall everi constitut
## 375 371 359 347 343 316 300 289
## peac one right time law citizen american new
## 288 279 279 271 271 265 257 250
## america public now unit duti war make interest
## 242 229 229 225 212 204 202 197
## union freedom free secur hope year good let
## 190 190 184 178 176 176 163 160
It's somewhat difficult to get your head around these sorts of things but there are statistical regularities here. For example, these frequencies tend to be distributed by "Zipf's Law" and by a (related) "power law."
plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank")
That makes the "long tail" clear. The grand relationship becomes clearer in a logarithmic scale:
plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank", log="xy")
For the power law, we need the number of words that appear at any given frequency. We'll turn word_freq
into a categorical variable by making it a "factor". The categories are "1", "2", ..."17" ...etc. and then use summary
to give us counts of each "category." (The maxsum
option is used to be sure it doesn't stop at 100 and lump everything else as "Other"
words_with_freq <- summary(as.factor(word_freq),maxsum=10000)
freq_bin <- as.integer(names(words_with_freq))
plot(freq_bin, words_with_freq, main="Power Law?", xlab="Word Frequency", ylab="Number of Words", log="xy")
Zipf's law implies that, in a new corpus say, a small number of terms will be very common (we'll know a lot about them, but they won't help us distinguish documents), a large number of terms will be very rare (we'll know very little about them), and that there will be some number of terms we have never seen before. This "out-of-vocabulary" (OOV) problem is an important one in some applications.
Let's go back to preprocessing choices. What happens if we count bigrams? Let's first do it without removing stopwords.
inaugural_tokens.2grams <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_ngrams(n=2)
dtm.2grams <- dfm(inaugural_tokens.2grams)
dtm.2grams
## Document-feature matrix of: 59 documents, 65,497 features (97.10% sparse) and 4 docvars.
## features
## docs fellow-citizens_of my_fellow of_the the_senate
## 1789-Washington 1 2 20 1
## 1793-Washington 0 0 4 0
## 1797-Adams 0 0 29 0
## 1801-Jefferson 0 1 28 0
## 1805-Jefferson 0 2 17 0
## 1809-Madison 0 0 20 0
## features
## docs fellow_citizens senate_and and_of the_house house_of
## 1789-Washington 2 1 2 2 2
## 1793-Washington 1 0 1 0 0
## 1797-Adams 0 0 2 0 0
## 1801-Jefferson 5 0 3 0 0
## 1805-Jefferson 8 0 1 0 0
## 1809-Madison 0 0 2 0 0
## features
## docs of_representatives
## 1789-Washington 2
## 1793-Washington 0
## 1797-Adams 0
## 1801-Jefferson 0
## 1805-Jefferson 0
## 1809-Madison 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 65,487 more features ]
topfeatures(dtm.2grams,40)
## of_the in_the to_the of_our
## 1772 821 727 628
## and_the it_is by_the for_the
## 474 324 322 315
## to_be the_people we_have of_a
## 313 270 264 262
## with_the that_the the_world have_been
## 240 221 214 205
## will_be has_been on_the is_the
## 199 185 183 183
## we_are from_the and_to the_government
## 178 168 166 165
## the_united united_states all_the in_our
## 164 158 157 156
## the_constitution we_will of_all of_this
## 156 156 153 142
## of_their should_be in_a and_in
## 141 139 138 132
## those_who we_must of_my may_be
## 129 128 126 126
How big is it? How sparse? It doesn't give us a lot of sense of content, but it does offer some rudimentary insights into how English is structured.
For example, we can create a rudimentary statistical language model that "predicts" the next word based on bigram frequencies. We apply Bayes' Theorem by calculating the frequency of bigrams starting with the current word and then scaling that by dividing by the total frequency (see Jurafsky and Martin, Speech and Language Processing, Chapter 3: https://web.stanford.edu/~jurafsky/slp3/ for more detail / nuance).
If the current word is "american" what is probably next, in this corpus?
First we find the right bigrams using a regular expression. See the regular expressions notebook for more detail if that is unfamiliar.
american_bigrams <- grep("^american_",colnames(dtm.2grams),value=TRUE)
american_bigrams
## [1] "american_people" "american_renewal" "american_on"
## [4] "american_covenant" "american_freemen" "american_lives"
## [7] "american_treasure" "american_belief" "american_heart"
## [10] "american_business" "american_century" "american_dream"
## [13] "american_policy" "american_here" "american_in"
## [16] "american_to" "american_promise" "american_spirit"
## [19] "american_story" "american_today" "american_conscience"
## [22] "american_auspices" "american_above" "american_freedom"
## [25] "american_interests" "american_a" "american_instinct"
## [28] "american_family" "american_citizenship" "american_rights"
## [31] "american_industries" "american_labor" "american_citizens"
## [34] "american_flag" "american_market" "american_that"
## [37] "american_is" "american_soldiers" "american_she"
## [40] "american_states" "american_name" "american_carnage"
## [43] "american_industry" "american_workers" "american_being"
## [46] "american_families" "american_hands" "american_and"
## [49] "american_we" "american_merchant" "american_navy"
## [52] "american_products" "american_destiny" "american_enjoys"
## [55] "american_way" "american_arms" "american_revolution"
## [58] "american_ideal" "american_slavery" "american_who"
## [61] "american_emancipation" "american_steamship" "american_history"
## [64] "american_life" "american_i" "american_control"
## [67] "american_opportunity" "american_if" "american_sovereignty"
## [70] "american_citizen" "american_anthem" "american_sound"
## [73] "american_he" "american_must" "american_failure"
## [76] "american_sense" "american_manliness" "american_character"
## [79] "american_achievement" "american_democracy" "american_standards"
## [82] "american_bottoms" "american_childhood" "american_experiment"
## [85] "american_statesmen" "american_ideals" "american_statesmanship"
## [88] "american_subjects" "american_political"
Most likely bigrams starting with "american":
freq_american_bigrams <- colSums(dtm.2grams[,american_bigrams])
most_likely_bigrams <- sort(freq_american_bigrams/sum(freq_american_bigrams),dec=TRUE)[1:10]
most_likely_bigrams
## american_people american_story american_citizen
## 0.23255814 0.03488372 0.03488372
## american_dream american_to american_citizenship
## 0.02906977 0.02325581 0.02325581
## american_citizens american_policy american_labor
## 0.02325581 0.01744186 0.01744186
## american_that
## 0.01744186
Let's see what happens if we remove the stopwords first.
inaugural_tokens.2grams.nostop <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('en')) %>%
tokens_ngrams(n=2)
dtm.2grams.nostop <- dfm(inaugural_tokens.2grams.nostop)
dtm.2grams.nostop
## Document-feature matrix of: 59 documents, 57,723 features (98.12% sparse) and 4 docvars.
## features
## docs fellow-citizens_senate senate_house house_representatives
## 1789-Washington 1 1 2
## 1793-Washington 0 0 0
## 1797-Adams 0 0 0
## 1801-Jefferson 0 0 0
## 1805-Jefferson 0 0 0
## 1809-Madison 0 0 0
## features
## docs representatives_among among_vicissitudes
## 1789-Washington 1 1
## 1793-Washington 0 0
## 1797-Adams 0 0
## 1801-Jefferson 0 0
## 1805-Jefferson 0 0
## 1809-Madison 0 0
## features
## docs vicissitudes_incident incident_life life_event event_filled
## 1789-Washington 1 1 1 1
## 1793-Washington 0 0 0 0
## 1797-Adams 0 0 0 0
## 1801-Jefferson 0 0 0 0
## 1805-Jefferson 0 0 0 0
## 1809-Madison 0 0 0 0
## features
## docs filled_greater
## 1789-Washington 1
## 1793-Washington 0
## 1797-Adams 0
## 1801-Jefferson 0
## 1805-Jefferson 0
## 1809-Madison 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 57,713 more features ]
topfeatures(dtm.2grams.nostop,40)
## united_states let_us fellow_citizens
## 158 105 78
## american_people federal_government men_women
## 40 32 28
## years_ago four_years upon_us
## 27 26 25
## general_government one_another constitution_united
## 25 22 20
## government_can every_citizen fellow_americans
## 20 20 19
## vice_president great_nation government_people
## 18 17 17
## among_nations god_bless people_can
## 17 16 15
## people_united foreign_nations may_well
## 15 15 15
## almighty_god form_government among_people
## 15 14 14
## chief_justice nations_world peace_world
## 14 14 14
## national_life free_people every_american
## 13 13 13
## can_never administration_government public_debt
## 12 12 12
## constitution_laws people_world within_limits
## 12 12 12
## one_nation
## 11
How big is it? How sparse? It gives some interesting content -- "great_nation", "almighty_god", "public_debt" -- but some confusing contructions, e.g. "people_world" which is really things like "people_of_the_world."
Ugh, well, yes, if you must. Wordclouds are an abomination -- I'll rant about that at a later date -- but here's Trump's first inaugural in a wordcloud ...
library(quanteda.textplots)
set.seed(100)
textplot_wordcloud(dtm.nostop["2017-Trump",], min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
Save a copy of the notebook and use it to answer the questions below. Those labeled "Challenge" require more than demonstrated above.
1) Use the inaugural_tokens.nostop
object. Define a word's "context" as a window of five words/tokens before and after a word's usage. In what contexts does the word "Roman" appear in this corpus?
2) Using dtm.wpunct
, which president used the most exclamation points in his inaugural address?
3) Use dtm.nostop
for these questions.
a) Do any terms appear only in the document containing Abraham Lincoln's first inaugural address?
b) Challenge: How many terms appeared first in Abraham Lincoln's first inaugural address?
c) How many times has the word "slave" been used in inaugural addresses?
d) Challenge: How many times has a word that included "slave" (like "slavery" or "enslaved") been used in inaugural addresses?
4) Construct a dtm of trigrams (lower case, not stemmed, no stop words removed).
a) How big is the matrix? How sparse is it?
b) What are the 50 most frequent trigrams?
c) Challenge How many trigrams appear only once?
5) Tokenize the following string of tweets using the built-in word
tokenizer, the tokenize_words
tokenizer from the tokenizers
package, and the tokenize_tweets
tokenizer from the tokenizers
package, and explain what's different.
https://t.co/9z2J3P33Uc FB needs to hurry up and add a laugh/cry button 😬😭😓🤢🙄😱 Since eating my feelings has not fixed the world's problems, I guess I'll try to sleep... HOLY CRAP: DeVos questionnaire appears to include passages from uncited sources https://t.co/FNRoOlfw9s well played, Senator Murray Keep the pressure on: https://t.co/4hfOsmdk0l @datageneral thx Mr Taussig It's interesting how many people contact me about applying for a PhD and don't spell my name right.