Updated September 2021.

The quanteda package (https://quanteda.io) is a very general and well-documented ecosystem for text analysis in R. A very large percentage of what is typically done in social science text-as-data research can be done with, or at least through, quanteda. Among the "competitors" to quanteda are the classic package tm and the tidyverse-consistent package tidytext. These actually are interrelated, with shared code and conversion utilities available, so they aren't necessarily in conflict.

Official description:

The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.

In addition to the extensive documentation, Stefan Muller and Ken Benoit have a very helpful cheatsheet here: https://muellerstefan.net/files/quanteda-cheatsheet.pdf.

In this notebook, we will use quanteda to turn a collection of texts, a corpus, into quantitative data, with each document represented by the counts of the "words" in it. Since we do away with word order this is called a bag-of-words representation.

Install the following packages if you haven't.

# install.packages("quanteda", dependencies=TRUE)
# install.packages("tokenizers", dependencies=TRUE)
# install.packages("quanteda.textplots", dependencies=TRUE)
# install.packages("RColorBrewer", dependencies=TRUE)

Note that quanteda has in recent versions moved analysis and plotting functions to new packages quanteda.textplots, quanteda.textmodels (classification and scaling models), and quanteda.textstats.

Now load quanteda:

library(quanteda)
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

If you are working on RStudio Cloud, you may have received a warning message about the "locale." You set the locale for British English ("en_GB") with the stri_locale_set comman in the already loaded stringi package. You may wish to set it to assume you are working in a different context (e.g., "en_US" for US English) or language (e.g., "pt_BR" for Brazilian Portuguese). This seems to happen every time an RStudio Cloud project with quanteda loaded is reopened, so you have to reissue this command to make the warning message go away.

# stringi::stri_locale_set("en_GB")

A first corpus

Quanteda comes with several corpora included. Lets load in the corpus of US presidential inaugural addresses and see what it looks like:

corp <- quanteda::data_corpus_inaugural

summary(corp)

What does a document look like? Let's look at one document (George Washington's first inaugural), which can be accessed with the as.character method. (The previous command texts has been deprecated.)

as.character(corp[1])
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             1789-Washington 
## "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years  -  a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "

Tokenizing - what is in the bag of words?

The first task is tokenizing. You can apply a tokenizer in quanteda with the tokens command, turning a "corpus" object -- or just a vector of texts -- into a "tokens" object. In the latest version of Quanteda, most commands operate on a tokens object.

The examples from the help file will be used to show a few of the options:

txt <- c(doc1 = "A sentence, showing how tokens() works.",
         doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
         doc3 = "Self-documenting code??",
         doc4 = "£1,000,000 for 50¢ is gr8 4ever \U0001f600")
tokens(txt)
## Tokens consisting of 4 documents.
## doc1 :
##  [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
##  [7] "("        ")"        "works"    "."       
## 
## doc2 :
## [1] "@quantedainit"              "and"                       
## [3] "#textanalysis"              "https://example.com?p=123."
## 
## doc3 :
## [1] "Self-documenting" "code"             "?"                "?"               
## 
## doc4 :
## [1] "£"          "1,000,000"  "for"        "50"         "¢"         
## [6] "is"         "gr8"        "4ever"      "\U0001f600"

The what option selects different tokenizers. The default is word which replaces a slower and less subtle word1 legacy version.

tokens(txt, what = "word1")
## Tokens consisting of 4 documents.
## doc1 :
##  [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
##  [7] "("        ")"        "works"    "."       
## 
## doc2 :
##  [1] "@"            "quantedainit" "and"          "#"            "textanalysis"
##  [6] "https"        ":"            "/"            "/"            "example.com" 
## [11] "?"            "p"           
## [ ... and 3 more ]
## 
## doc3 :
## [1] "Self-documenting" "code"             "?"                "?"               
## 
## doc4 :
## [1] "£"          "1,000,000"  "for"        "50"         "¢"         
## [6] "is"         "gr8"        "4ever"      "\U0001f600"

For some purposes you may wish to tokenize by characters:

tokens(txt[1], what = "character")
## Tokens consisting of 1 document.
## doc1 :
##  [1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
## [ ... and 22 more ]

You can "tokenize" (the usual term is "segment") by sentence in Quanteda, but note that they recommend the spacyr package (discussed in a separate notebook) for better sentence segmentation. Let's try it on Washington's inaugural:

tokens(corp[1], what = "sentence")
## Tokens consisting of 1 document and 4 docvars.
## 1789-Washington :
##  [1] "Fellow-Citizens of the Senate and of the House of Representatives:  Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month."                                                                                                                                                                                                                                                                                                                                                                                        
##  [2] "On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years  -  a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time."                                                                                                                                                                                
##  [3] "On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies."                                                                                                                                                                                                                          
##  [4] "In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [5] "All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated."                                                                                                                                           
##  [6] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge."
##  [7] "In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either."                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [8] "No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [9] "Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage."                                                                                    
## [10] "These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [11] "You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [12] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [ ... and 11 more ]

Wow, those are long sentences. Out of curiosity, let's look at Trump's:

tokens(corp[58], what = "sentence")
## Tokens consisting of 1 document and 4 docvars.
## 2017-Trump :
##  [1] "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you."                                                                         
##  [2] "We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people."                                                                               
##  [3] "Together, we will determine the course of America and the world for many, many years to come."                                                                                                                              
##  [4] "We will face challenges."                                                                                                                                                                                                   
##  [5] "We will confront hardships."                                                                                                                                                                                                
##  [6] "But we will get the job done."                                                                                                                                                                                              
##  [7] "Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition."
##  [8] "They have been magnificent."                                                                                                                                                                                                
##  [9] "Thank you."                                                                                                                                                                                                                 
## [10] "Today's ceremony, however, has very special meaning."                                                                                                                                                                       
## [11] "Because today we are not merely transferring power from one Administration to another, or from one party to another - but we are transferring power from Washington DC and giving it back to you, the people."              
## [12] "For too long, a small group in our nation's Capital has reaped the rewards of government while the people have borne the cost."                                                                                             
## [ ... and 76 more ]

Those are ... shorter.

There are a number of options you can apply with the tokens command, controlling how the tokenizer deals with punctuation, numbers, symbols, hyphenization, etc. Again, just the help file examples:

# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "A"        "sentence" "showing"  "how"      "tokens"   "works"   
## 
## doc2 :
## [1] "@quantedainit"              "and"                       
## [3] "#textanalysis"              "https://example.com?p=123."
# splitting hyphenated words
tokens(txt[3])
## Tokens consisting of 1 document.
## doc3 :
## [1] "Self-documenting" "code"             "?"                "?"
tokens(txt[3], split_hyphens = TRUE)
## Tokens consisting of 1 document.
## doc3 :
## [1] "Self"        "-"           "documenting" "code"        "?"          
## [6] "?"
# symbols and numbers
tokens(txt[4])
## Tokens consisting of 1 document.
## doc4 :
## [1] "£"          "1,000,000"  "for"        "50"         "¢"         
## [6] "is"         "gr8"        "4ever"      "\U0001f600"
tokens(txt[4], remove_numbers = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "£"          "for"        "¢"          "is"         "gr8"       
## [6] "4ever"      "\U0001f600"
tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "for"   "is"    "gr8"   "4ever"

External tokenizers

You can use other tokenizers, like those from the "tokenizers" package. The output of a command like tokenizers::tokenize_words can be passed to the tokens command:

# install.packages("tokenizers")
library(tokenizers)
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)
## Tokens consisting of 1 document.
## doc4 :
## [1] "1,000,000" "for"       "50"        "is"        "gr8"       "4ever"
# using pipe notation
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
  tokens(remove_symbols = TRUE)
## Tokens consisting of 4 documents.
## doc1 :
##  [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
##  [7] "("        ")"        "works"    "."       
## 
## doc2 :
##  [1] "@"            "quantedainit" "and"          "#"            "textanalysis"
##  [6] "https"        ":"            "/"            "/"            "example.com" 
## [11] "?"            "p"           
## [ ... and 2 more ]
## 
## doc3 :
## [1] "Self"        "-"           "documenting" "code"        "?"          
## [6] "?"          
## 
## doc4 :
## [1] "1,000,000" "for"       "50"        "is"        "gr8"       "4ever"
tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) %>%
    tokens(remove_punct = TRUE)
## Tokens consisting of 1 document.
## doc3 :
##  [1] "s" "e" "l" "f" "d" "o" "c" "u" "m" "e" "n" "t"
## [ ... and 7 more ]
tokenizers::tokenize_sentences(
    "The quick brown fox.  It jumped over the lazy dog.") %>%
    tokens()
## Tokens consisting of 1 document.
## text1 :
## [1] "The quick brown fox."         "It jumped over the lazy dog."

Look carefully -- what did it do differently?

Let's make a fairly generic tokens object from our inaugural speeches corpus.

inaugural_tokens <- quanteda::tokens(corp,
                       what = "word",
                       remove_punct = TRUE, # default FALSE
                       remove_symbols = TRUE, # default FALSE
                       remove_numbers = FALSE,
                       remove_url = TRUE, # default FALSE
                       remove_separators = TRUE,
                       split_hyphens = FALSE,
                       include_docvars = TRUE,
                       padding = FALSE,
                       verbose = quanteda_options("verbose")
                       )

This produces a tokens class object. Expand the object in your RStudio Environment tab to take a look at it.

Foremost, it's a list with one entry per document consisting of a character vector of the document's tokens.

inaugural_tokens[["2017-Trump"]][1:30]
##  [1] "Chief"     "Justice"   "Roberts"   "President" "Carter"    "President"
##  [7] "Clinton"   "President" "Bush"      "President" "Obama"     "fellow"   
## [13] "Americans" "and"       "people"    "of"        "the"       "world"    
## [19] "thank"     "you"       "We"        "the"       "citizens"  "of"       
## [25] "America"   "are"       "now"       "joined"    "in"        "a"

Tokens vs. types

It also has a vector of the "types" -- the vocabulary of tokens in the whole corpus/object. This attribute can be accessed through the attr function.

attr(inaugural_tokens,"types")[1:30]
##  [1] "Fellow-Citizens" "of"              "the"             "Senate"         
##  [5] "and"             "House"           "Representatives" "Among"          
##  [9] "vicissitudes"    "incident"        "to"              "life"           
## [13] "no"              "event"           "could"           "have"           
## [17] "filled"          "me"              "with"            "greater"        
## [21] "anxieties"       "than"            "that"            "which"          
## [25] "notification"    "was"             "transmitted"     "by"             
## [29] "your"            "order"
length(attr(inaugural_tokens, "types"))
## [1] 10147

Just over 10000 unique tokens have been used. Notice the appears third and never again. But ... The does:

which(attr(inaugural_tokens,"types")=="The")
## [1] 339

Why are they "the" and "The" different types? Why is "Fellow-Citizens" one type?

Under the hood, the tokens vector isn't a vector of strings. It's a vector of integers, indicating the index of the token in the type vector. So every time the appears, it is stored as the integer 3.

By default, the tokens object also retains all of the document metadata that came with the corpus.

Key Words in Context

The tokens object also provides access to a variety of quanteda utilities. For example, a very helpful traditional qualitative tool is the Key Words in Context or kwic command:

kwic(inaugural_tokens, "humble", window=3)
kwic(inaugural_tokens, "tombstones", window=4)

Hmmm. Moving on.

Stemming

Stemming is the truncation of words in an effort to associate related words with a common token, e.g., "baby" and "babies" -> "babi".

The tokenizers package provides a wrapper to the wordStem function from the SnowballC package, which applies a standard stemmer called the Porter stemmer. (The function takes as input a vector of texts or corpus, and returns a list, each element a vector of the stems for the corresponding text.)

tokenizers::tokenize_word_stems(corp)$`2017-Trump`[1:50]
##  [1] "chief"    "justic"   "robert"   "presid"   "carter"   "presid"  
##  [7] "clinton"  "presid"   "bush"     "presid"   "obama"    "fellow"  
## [13] "american" "and"      "peopl"    "of"       "the"      "world"   
## [19] "thank"    "you"      "we"       "the"      "citizen"  "of"      
## [25] "america"  "are"      "now"      "join"     "in"       "a"       
## [31] "great"    "nation"   "effort"   "to"       "rebuild"  "our"     
## [37] "countri"  "and"      "restor"   "it"       "promis"   "for"     
## [43] "all"      "of"       "our"      "peopl"    "togeth"   "we"      
## [49] "will"     "determin"

From text to data - the document-term-matrix

Quanteda is focused largely on bag-of-words (or bag-of-tokens or bag-of-terms) models that work from a document-term matrix. where each row represents a document, each column represents a type (a "term" in the vocabulary) and the entries are the counts of tokens matching the term in the current document.

For this we will use quanteda's "dfm" command with some commonly chosen preprocessing options. In older version os quanteda, the dfm function was applied to a corpus, with tokenizing and normalizing options applied there. It is now applied to a tokens object where most of that has already been done. Here, we'll add case-folding, merging the and The, among other things, into a single type.

doc_term_matrix <- quanteda::dfm(inaugural_tokens,
                                 tolower = TRUE  # case-fold
                                 )

What kind of object is doc_term_matrix?

class(doc_term_matrix)
## [1] "dfm"
## attr(,"package")
## [1] "quanteda"

Typing the dfm's name will show an object summary. This is a matrix, so how many rows does it have? How many columns? What does "91.89% sparse" mean?

doc_term_matrix
## Document-feature matrix of: 59 documents, 9,422 features (91.89% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives
##   1789-Washington               1  71 116      1  48     2               2
##   1793-Washington               0  11  13      0   2     0               0
##   1797-Adams                    3 140 163      1 130     0               2
##   1801-Jefferson                2 104 130      0  81     0               0
##   1805-Jefferson                0 101 143      0  93     0               0
##   1809-Madison                  1  69 104      0  43     0               0
##                  features
## docs              among vicissitudes incident
##   1789-Washington     1            1        1
##   1793-Washington     0            0        0
##   1797-Adams          4            0        0
##   1801-Jefferson      1            0        0
##   1805-Jefferson      7            0        0
##   1809-Madison        0            0        0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,412 more features ]

You can peak inside it, indexing it like you would a matrix or Matrix object:

doc_term_matrix[1:5,1:5]
## Document-feature matrix of: 5 documents, 5 features (20.00% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and
##   1789-Washington               1  71 116      1  48
##   1793-Washington               0  11  13      0   2
##   1797-Adams                    3 140 163      1 130
##   1801-Jefferson                2 104 130      0  81
##   1805-Jefferson                0 101 143      0  93

What are the most frequent terms?

What are the most frequent terms?

topfeatures(doc_term_matrix,40)
##        the         of        and         to         in          a        our 
##      10183       7180       5406       4591       2827       2292       2224 
##         we       that         be         is         it        for         by 
##       1827       1813       1502       1491       1398       1230       1091 
##       have      which        not       with         as       will       this 
##       1031       1007        980        970        966        944        874 
##          i        all        are      their        but        has     people 
##        871        836        828        761        670        631        584 
##       from        its government         or         on         my         us 
##        578        573        564        563        544        515        505 
##       been        can         no       they         so 
##        496        487        470        463        397

You can get the same thing through sorting a column sum of the dtm:

word_freq <- colSums(doc_term_matrix)
sort(word_freq,decreasing=TRUE)[1:40]
##        the         of        and         to         in          a        our 
##      10183       7180       5406       4591       2827       2292       2224 
##         we       that         be         is         it        for         by 
##       1827       1813       1502       1491       1398       1230       1091 
##       have      which        not       with         as       will       this 
##       1031       1007        980        970        966        944        874 
##          i        all        are      their        but        has     people 
##        871        836        828        761        670        631        584 
##       from        its government         or         on         my         us 
##        578        573        564        563        544        515        505 
##       been        can         no       they         so 
##        496        487        470        463        397

Stopwords

For some purposes, you may wish to remove "stopwords." There are stopword lists accessible through the stopwords function, exported from the automatically loaded stopwords package. The default is English from the Snowball collection. Get a list of sources with stopwords_getsources() and a list of languages for the source with stopwords_getlanguages()

The default English list is fairly short.

stopwords('en')[1:10] #Snowball
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"      
##  [7] "ours"      "ourselves" "you"       "your"
length(stopwords('en'))
## [1] 175

This one's three times longer.

stopwords('en', source='smart')[1:10]
##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"
length(stopwords('en', source='smart'))
## [1] 571

This one's almost ten times as long and is ... interesting

stopwords('en', source='stopwords-iso')[1:10]
##  [1] "'ll"       "'tis"      "'twas"     "'ve"       "10"        "39"       
##  [7] "a"         "a's"       "able"      "ableabout"
length(stopwords('en', source='stopwords-iso'))
## [1] 1298

The beginning of a German list.

stopwords('de')[1:10]
##  [1] "aber"  "alle"  "allem" "allen" "aller" "alles" "als"   "also"  "am"   
## [10] "an"

A slice from an Ancient Greek list:

stopwords('grc',source='ancient')[264:288]
##  [1] "xxx"     "xxxi"    "xxxii"   "xxxiii"  "xxxiv"   "xxxix"   "xxxv"   
##  [8] "xxxvi"   "xxxvii"  "xxxviii" "y"       "z"       "α"       "ἅ"      
## [15] "ἃ"       "ᾇ"       "ἄγαν"    "ἄγε"     "ἄγχι"    "ἀγχοῦ"   "ἁγώ"    
## [22] "ἁγὼ"     "ἅγωγ"    "ἁγών"    "ἁγὼν"

Lets case-fold our tokens object to lowercase, remove the stopwords, then make a new dtm and see how it's different.

inaugural_tokens.nostop <- inaugural_tokens %>%
                            tokens_tolower() %>%
                            tokens_remove(stopwords('en'))
dtm.nostop <- dfm(inaugural_tokens.nostop)
dtm.nostop
## Document-feature matrix of: 59 documents, 9,284 features (92.70% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens senate house representatives among
##   1789-Washington               1      1     2               2     1
##   1793-Washington               0      0     0               0     0
##   1797-Adams                    3      1     0               2     4
##   1801-Jefferson                2      0     0               0     1
##   1805-Jefferson                0      0     0               0     7
##   1809-Madison                  1      0     0               0     0
##                  features
## docs              vicissitudes incident life event filled
##   1789-Washington            1        1    1     2      1
##   1793-Washington            0        0    0     0      0
##   1797-Adams                 0        0    2     0      0
##   1801-Jefferson             0        0    1     0      0
##   1805-Jefferson             0        0    2     0      0
##   1809-Madison               0        0    1     0      1
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,274 more features ]

We've got about 1000 fewer features, and it is slightly more sparse. Why?

What are the most frequent tokens now?

topfeatures(dtm.nostop,40)
##       people   government           us          can         must         upon 
##          584          564          505          487          376          371 
##        great          may       states        world        shall      country 
##          344          343          334          319          316          308 
##       nation        every          one        peace          new        power 
##          305          300          267          258          250          241 
##          now       public         time     citizens constitution       united 
##          229          225          220          209          209          203 
##      america      nations        union      freedom         free          war 
##          202          199          190          185          183          181 
##     american          let     national         made         good         make 
##          172          160          158          156          149          147 
##        years      justice          men      without 
##          143          142          140          140

How is this document different from those documents?

I'm just curious. Besides "tombstones," what other words made their inaugural debut in 2017?

unique_to_trump <- as.vector(colSums(doc_term_matrix) == doc_term_matrix["2017-Trump",])
colnames(doc_term_matrix)[unique_to_trump]
##  [1] "obama"          "hardships"      "lady"           "michelle"      
##  [5] "transferring"   "dc"             "reaped"         "politicians"   
##  [9] "2017"           "listening"      "likes"          "neighborhoods" 
## [13] "trapped"        "rusted-out"     "tombstones"     "landscape"     
## [17] "flush"          "stolen"         "robbed"         "unrealized"    
## [21] "carnage"        "stops"          "we've"          "subsidized"    
## [25] "allowing"       "sad"            "depletion"      "trillions"     
## [29] "overseas"       "infrastructure" "disrepair"      "ripped"        
## [33] "redistributed"  "issuing"        "ravages"        "stealing"      
## [37] "tunnels"        "hire"           "goodwill"       "shine"         
## [41] "reinforce"      "islamic"        "bedrock"        "disagreements" 
## [45] "solidarity"     "unstoppable"    "complaining"    "arrives"       
## [49] "mysteries"      "brown"          "bleed"          "sprawl"        
## [53] "windswept"      "nebraska"       "ignored"

OK!

The impact of preprocessing decisions

We can also change the settings. What happens if we don't remove punctuation?

inaugural_tokens.wpunct <- quanteda::tokens(corp,
                          what = "word",
                          remove_punct = FALSE) %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en'))
  
dtm.wpunct <- dfm(inaugural_tokens.wpunct)
dtm.wpunct
## Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens senate house representatives : among
##   1789-Washington               1      1     2               2 1     1
##   1793-Washington               0      0     0               0 1     0
##   1797-Adams                    3      1     0               2 0     4
##   1801-Jefferson                2      0     0               0 1     1
##   1805-Jefferson                0      0     0               0 0     7
##   1809-Madison                  1      0     0               0 0     0
##                  features
## docs              vicissitudes incident life event
##   1789-Washington            1        1    1     2
##   1793-Washington            0        0    0     0
##   1797-Adams                 0        0    2     0
##   1801-Jefferson             0        0    1     0
##   1805-Jefferson             0        0    2     0
##   1809-Madison               0        0    1     0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]
topfeatures(dtm.wpunct,40)
##            ,            .       people            ;   government           us 
##         7173         5155          584          565          564          505 
##          can            -         must         upon        great          may 
##          487          403          376          371          344          343 
##       states        world        shall      country       nation        every 
##          334          319          316          308          305          300 
##          one        peace            "          new        power          now 
##          267          258          256          250          241          229 
##       public         time     citizens constitution       united      america 
##          225          220          209          209          203          202 
##      nations        union      freedom         free          war     american 
##          199          190          185          183          181          172 
##          let     national         made         good 
##          160          158          156          149

How big is it now? How sparse is it now?

What happens if we lower case and stem?

inaugural_tokens.stems <- quanteda::tokens(corp,
                          what = "word",
                          remove_punct = TRUE) %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en')) %>%
                          tokens_wordstem()
  
dtm.stems <- dfm(inaugural_tokens.stems)
dtm.stems
## Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
##                  features
## docs              fellow-citizen senat hous repres among vicissitud incid life
##   1789-Washington              1     1    2      2     1          1     1    1
##   1793-Washington              0     0    0      0     0          0     0    0
##   1797-Adams                   3     1    3      3     4          0     0    2
##   1801-Jefferson               2     0    0      1     1          0     0    1
##   1805-Jefferson               0     0    0      0     7          0     0    2
##   1809-Madison                 1     0    0      1     0          1     0    1
##                  features
## docs              event fill
##   1789-Washington     2    1
##   1793-Washington     0    0
##   1797-Adams          0    0
##   1801-Jefferson      0    0
##   1805-Jefferson      1    0
##   1809-Madison        0    1
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]
topfeatures(dtm.stems,40)
##    nation    govern     peopl        us       can     state     great      must 
##       691       657       632       505       487       452       378       376 
##     power      upon   countri     world       may     shall     everi constitut 
##       375       371       359       347       343       316       300       289 
##      peac       one     right      time       law   citizen  american       new 
##       288       279       279       271       271       265       257       250 
##   america    public       now      unit      duti       war      make  interest 
##       242       229       229       225       212       204       202       197 
##     union   freedom      free     secur      hope      year      good       let 
##       190       190       184       178       176       176       163       160

Zipf's Law and a power law

It's somewhat difficult to get your head around these sorts of things but there are statistical regularities here. For example, these frequencies tend to be distributed by "Zipf's Law" and by a (related) "power law."

plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank")

That makes the "long tail" clear. The grand relationship becomes clearer in a logarithmic scale:

plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank", log="xy")

For the power law, we need the number of words that appear at any given frequency. We'll turn word_freq into a categorical variable by making it a "factor". The categories are "1", "2", ..."17" ...etc. and then use summary to give us counts of each "category." (The maxsum option is used to be sure it doesn't stop at 100 and lump everything else as "Other"

words_with_freq <- summary(as.factor(word_freq),maxsum=10000)
freq_bin <- as.integer(names(words_with_freq))

plot(freq_bin, words_with_freq, main="Power Law?", xlab="Word Frequency", ylab="Number of Words", log="xy")

Zipf's law implies that, in a new corpus say, a small number of terms will be very common (we'll know a lot about them, but they won't help us distinguish documents), a large number of terms will be very rare (we'll know very little about them), and that there will be some number of terms we have never seen before. This "out-of-vocabulary" (OOV) problem is an important one in some applications.

A step toward word order mattering: n-grams

Let's go back to preprocessing choices. What happens if we count bigrams? Let's first do it without removing stopwords.

inaugural_tokens.2grams <- inaugural_tokens %>%
                          tokens_tolower() %>%
                          tokens_ngrams(n=2)
  
dtm.2grams <- dfm(inaugural_tokens.2grams)
dtm.2grams
## Document-feature matrix of: 59 documents, 65,497 features (97.10% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens_of my_fellow of_the the_senate
##   1789-Washington                  1         2     20          1
##   1793-Washington                  0         0      4          0
##   1797-Adams                       0         0     29          0
##   1801-Jefferson                   0         1     28          0
##   1805-Jefferson                   0         2     17          0
##   1809-Madison                     0         0     20          0
##                  features
## docs              fellow_citizens senate_and and_of the_house house_of
##   1789-Washington               2          1      2         2        2
##   1793-Washington               1          0      1         0        0
##   1797-Adams                    0          0      2         0        0
##   1801-Jefferson                5          0      3         0        0
##   1805-Jefferson                8          0      1         0        0
##   1809-Madison                  0          0      2         0        0
##                  features
## docs              of_representatives
##   1789-Washington                  2
##   1793-Washington                  0
##   1797-Adams                       0
##   1801-Jefferson                   0
##   1805-Jefferson                   0
##   1809-Madison                     0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 65,487 more features ]
topfeatures(dtm.2grams,40)
##           of_the           in_the           to_the           of_our 
##             1772              821              727              628 
##          and_the            it_is           by_the          for_the 
##              474              324              322              315 
##            to_be       the_people          we_have             of_a 
##              313              270              264              262 
##         with_the         that_the        the_world        have_been 
##              240              221              214              205 
##          will_be         has_been           on_the           is_the 
##              199              185              183              183 
##           we_are         from_the           and_to   the_government 
##              178              168              166              165 
##       the_united    united_states          all_the           in_our 
##              164              158              157              156 
## the_constitution          we_will           of_all          of_this 
##              156              156              153              142 
##         of_their        should_be             in_a           and_in 
##              141              139              138              132 
##        those_who          we_must            of_my           may_be 
##              129              128              126              126

How big is it? How sparse? It doesn't give us a lot of sense of content, but it does offer some rudimentary insights into how English is structured.

For example, we can create a rudimentary statistical language model that "predicts" the next word based on bigram frequencies. We apply Bayes' Theorem by calculating the frequency of bigrams starting with the current word and then scaling that by dividing by the total frequency (see Jurafsky and Martin, Speech and Language Processing, Chapter 3: https://web.stanford.edu/~jurafsky/slp3/ for more detail / nuance).

If the current word is "american" what is probably next, in this corpus?

First we find the right bigrams using a regular expression. See the regular expressions notebook for more detail if that is unfamiliar.

american_bigrams <- grep("^american_",colnames(dtm.2grams),value=TRUE)
american_bigrams
##  [1] "american_people"        "american_renewal"       "american_on"           
##  [4] "american_covenant"      "american_freemen"       "american_lives"        
##  [7] "american_treasure"      "american_belief"        "american_heart"        
## [10] "american_business"      "american_century"       "american_dream"        
## [13] "american_policy"        "american_here"          "american_in"           
## [16] "american_to"            "american_promise"       "american_spirit"       
## [19] "american_story"         "american_today"         "american_conscience"   
## [22] "american_auspices"      "american_above"         "american_freedom"      
## [25] "american_interests"     "american_a"             "american_instinct"     
## [28] "american_family"        "american_citizenship"   "american_rights"       
## [31] "american_industries"    "american_labor"         "american_citizens"     
## [34] "american_flag"          "american_market"        "american_that"         
## [37] "american_is"            "american_soldiers"      "american_she"          
## [40] "american_states"        "american_name"          "american_carnage"      
## [43] "american_industry"      "american_workers"       "american_being"        
## [46] "american_families"      "american_hands"         "american_and"          
## [49] "american_we"            "american_merchant"      "american_navy"         
## [52] "american_products"      "american_destiny"       "american_enjoys"       
## [55] "american_way"           "american_arms"          "american_revolution"   
## [58] "american_ideal"         "american_slavery"       "american_who"          
## [61] "american_emancipation"  "american_steamship"     "american_history"      
## [64] "american_life"          "american_i"             "american_control"      
## [67] "american_opportunity"   "american_if"            "american_sovereignty"  
## [70] "american_citizen"       "american_anthem"        "american_sound"        
## [73] "american_he"            "american_must"          "american_failure"      
## [76] "american_sense"         "american_manliness"     "american_character"    
## [79] "american_achievement"   "american_democracy"     "american_standards"    
## [82] "american_bottoms"       "american_childhood"     "american_experiment"   
## [85] "american_statesmen"     "american_ideals"        "american_statesmanship"
## [88] "american_subjects"      "american_political"

Most likely bigrams starting with "american":

freq_american_bigrams <- colSums(dtm.2grams[,american_bigrams])
most_likely_bigrams <- sort(freq_american_bigrams/sum(freq_american_bigrams),dec=TRUE)[1:10]
most_likely_bigrams
##      american_people       american_story     american_citizen 
##           0.23255814           0.03488372           0.03488372 
##       american_dream          american_to american_citizenship 
##           0.02906977           0.02325581           0.02325581 
##    american_citizens      american_policy       american_labor 
##           0.02325581           0.01744186           0.01744186 
##        american_that 
##           0.01744186

Let's see what happens if we remove the stopwords first.

inaugural_tokens.2grams.nostop <- inaugural_tokens %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en')) %>%
                          tokens_ngrams(n=2)
  
dtm.2grams.nostop <- dfm(inaugural_tokens.2grams.nostop)
dtm.2grams.nostop
## Document-feature matrix of: 59 documents, 57,723 features (98.12% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens_senate senate_house house_representatives
##   1789-Washington                      1            1                     2
##   1793-Washington                      0            0                     0
##   1797-Adams                           0            0                     0
##   1801-Jefferson                       0            0                     0
##   1805-Jefferson                       0            0                     0
##   1809-Madison                         0            0                     0
##                  features
## docs              representatives_among among_vicissitudes
##   1789-Washington                     1                  1
##   1793-Washington                     0                  0
##   1797-Adams                          0                  0
##   1801-Jefferson                      0                  0
##   1805-Jefferson                      0                  0
##   1809-Madison                        0                  0
##                  features
## docs              vicissitudes_incident incident_life life_event event_filled
##   1789-Washington                     1             1          1            1
##   1793-Washington                     0             0          0            0
##   1797-Adams                          0             0          0            0
##   1801-Jefferson                      0             0          0            0
##   1805-Jefferson                      0             0          0            0
##   1809-Madison                        0             0          0            0
##                  features
## docs              filled_greater
##   1789-Washington              1
##   1793-Washington              0
##   1797-Adams                   0
##   1801-Jefferson               0
##   1805-Jefferson               0
##   1809-Madison                 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 57,713 more features ]
topfeatures(dtm.2grams.nostop,40)
##             united_states                    let_us           fellow_citizens 
##                       158                       105                        78 
##           american_people        federal_government                 men_women 
##                        40                        32                        28 
##                 years_ago                four_years                   upon_us 
##                        27                        26                        25 
##        general_government               one_another       constitution_united 
##                        25                        22                        20 
##            government_can             every_citizen          fellow_americans 
##                        20                        20                        19 
##            vice_president              great_nation         government_people 
##                        18                        17                        17 
##             among_nations                 god_bless                people_can 
##                        17                        16                        15 
##             people_united           foreign_nations                  may_well 
##                        15                        15                        15 
##              almighty_god           form_government              among_people 
##                        15                        14                        14 
##             chief_justice             nations_world               peace_world 
##                        14                        14                        14 
##             national_life               free_people            every_american 
##                        13                        13                        13 
##                 can_never administration_government               public_debt 
##                        12                        12                        12 
##         constitution_laws              people_world             within_limits 
##                        12                        12                        12 
##                one_nation 
##                        11

How big is it? How sparse? It gives some interesting content -- "great_nation", "almighty_god", "public_debt" -- but some confusing contructions, e.g. "people_world" which is really things like "people_of_the_world."

Can I draw those slick wordclouds?

Ugh, well, yes, if you must. Wordclouds are an abomination -- I'll rant about that at a later date -- but here's Trump's first inaugural in a wordcloud ...

library(quanteda.textplots)

set.seed(100)
textplot_wordcloud(dtm.nostop["2017-Trump",], min_count = 1, random_order = FALSE,
                   rotation = .25, 
                   color = RColorBrewer::brewer.pal(8,"Dark2"))

Practice Exercises

Save a copy of the notebook and use it to answer the questions below. Those labeled "Challenge" require more than demonstrated above.

1) Use the inaugural_tokens.nostop object. Define a word's "context" as a window of five words/tokens before and after a word's usage. In what contexts does the word "Roman" appear in this corpus?

2) Using dtm.wpunct, which president used the most exclamation points in his inaugural address?

3) Use dtm.nostop for these questions.

a) Do any terms appear only in the document containing Abraham Lincoln's first inaugural address?

b) Challenge: How many terms appeared first in Abraham Lincoln's first inaugural address?

c) How many times has the word "slave" been used in inaugural addresses?

d) Challenge: How many times has a word that included "slave" (like "slavery" or "enslaved") been used in inaugural addresses?

4) Construct a dtm of trigrams (lower case, not stemmed, no stop words removed).

a) How big is the matrix? How sparse is it?

b) What are the 50 most frequent trigrams?

c) Challenge How many trigrams appear only once?

5) Tokenize the following string of tweets using the built-in word tokenizer, the tokenize_words tokenizer from the tokenizers package, and the tokenize_tweets tokenizer from the tokenizers package, and explain what's different.

https://t.co/9z2J3P33Uc FB needs to hurry up and add a laugh/cry button 😬😭😓🤢🙄😱 Since eating my feelings has not fixed the world's problems, I guess I'll try to sleep... HOLY CRAP: DeVos questionnaire appears to include passages from uncited sources https://t.co/FNRoOlfw9s well played, Senator Murray Keep the pressure on: https://t.co/4hfOsmdk0l @datageneral thx Mr Taussig It's interesting how many people contact me about applying for a PhD and don't spell my name right.