Updated September 2021.

The quanteda package (https://quanteda.io) is a very general and well-documented ecosystem for text analysis in R. A very large percentage of what is typically done in social science text-as-data research can be done with, or at least through, quanteda. Among the "competitors" to quanteda are the classic package tm and the tidyverse-consistent package tidytext. These actually are interrelated, with shared code and conversion utilities available, so they aren't necessarily in conflict.

Official description:

The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.

In addition to the extensive documentation, Stefan Muller and Ken Benoit have a very helpful cheatsheet here: https://muellerstefan.net/files/quanteda-cheatsheet.pdf.

In this notebook, we will use quanteda to turn a collection of texts, a corpus, into quantitative data, with each document represented by the counts of the "words" in it. Since we do away with word order this is called a bag-of-words representation.

Install the following packages if you haven't.

# install.packages("quanteda", dependencies=TRUE)
# install.packages("tokenizers", dependencies=TRUE)
# install.packages("quanteda.textplots", dependencies=TRUE)
# install.packages("RColorBrewer", dependencies=TRUE)

Note that quanteda has in recent versions moved analysis and plotting functions to new packages quanteda.textplots, quanteda.textmodels (classification and scaling models), and quanteda.textstats.

Now load quanteda:

library(quanteda)

If you are working on RStudio Cloud, you may have received a warning message about the "locale." You set the locale for British English ("en_GB") with the stri_locale_set comman in the already loaded stringi package. You may wish to set it to assume you are working in a different context (e.g., "en_US" for US English) or language (e.g., "pt_BR" for Brazilian Portuguese). This seems to happen every time an RStudio Cloud project with quanteda loaded is reopened, so you have to reissue this command to make the warning message go away.

# stringi::stri_locale_set("en_GB")

A first corpus

Quanteda comes with several corpora included. Lets load in the corpus of US presidential inaugural addresses and see what it looks like:

corp <- quanteda::data_corpus_inaugural
summary(corp)

Corpus consisting of 59 documents, showing 59 documents:

            Text Types Tokens Sentences Year
 1789-Washington   625   1537        23 1789
 1793-Washington    96    147         4 1793
      1797-Adams   826   2577        37 1797
  1801-Jefferson   717   1923        41 1801
  1805-Jefferson   804   2380        45 1805
    1809-Madison   535   1261        21 1809
    1813-Madison   541   1302        33 1813
     1817-Monroe  1040   3677       121 1817
     1821-Monroe  1259   4886       131 1821
      1825-Adams  1003   3147        74 1825
    1829-Jackson   517   1208        25 1829
    1833-Jackson   499   1267        29 1833
   1837-VanBuren  1315   4158        95 1837
   1841-Harrison  1898   9123       210 1841
       1845-Polk  1334   5186       153 1845
     1849-Taylor   496   1178        22 1849
     1853-Pierce  1165   3636       104 1853
   1857-Buchanan   945   3083        89 1857
    1861-Lincoln  1075   3999       135 1861
    1865-Lincoln   360    775        26 1865
      1869-Grant   485   1229        40 1869
      1873-Grant   552   1472        43 1873
      1877-Hayes   831   2707        59 1877
   1881-Garfield  1021   3209       111 1881
  1885-Cleveland   676   1816        44 1885
   1889-Harrison  1352   4721       157 1889
  1893-Cleveland   821   2125        58 1893
   1897-McKinley  1232   4353       130 1897
   1901-McKinley   854   2437       100 1901
  1905-Roosevelt   404   1079        33 1905
       1909-Taft  1437   5821       158 1909
     1913-Wilson   658   1882        68 1913
     1917-Wilson   549   1652        59 1917
    1921-Harding  1169   3719       148 1921
   1925-Coolidge  1220   4440       196 1925
     1929-Hoover  1090   3860       158 1929
  1933-Roosevelt   743   2057        85 1933
  1937-Roosevelt   725   1989        96 1937
  1941-Roosevelt   526   1519        68 1941
  1945-Roosevelt   275    633        27 1945
     1949-Truman   781   2504       116 1949
 1953-Eisenhower   900   2743       119 1953
 1957-Eisenhower   621   1907        92 1957
    1961-Kennedy   566   1541        52 1961
    1965-Johnson   568   1710        93 1965
      1969-Nixon   743   2416       103 1969
      1973-Nixon   544   1995        68 1973
     1977-Carter   527   1369        52 1977
     1981-Reagan   902   2780       129 1981
     1985-Reagan   925   2909       123 1985
       1989-Bush   795   2673       141 1989
    1993-Clinton   642   1833        81 1993
    1997-Clinton   773   2436       111 1997
       2001-Bush   621   1806        97 2001
       2005-Bush   772   2312        99 2005
      2009-Obama   938   2689       110 2009
      2013-Obama   814   2317        88 2013
      2017-Trump   582   1660        88 2017
  2021-Biden.txt   811   2766       216 2021
  President       FirstName                 Party
 Washington          George                  none
 Washington          George                  none
      Adams            John            Federalist
  Jefferson          Thomas Democratic-Republican
  Jefferson          Thomas Democratic-Republican
    Madison           James Democratic-Republican
    Madison           James Democratic-Republican
     Monroe           James Democratic-Republican
     Monroe           James Democratic-Republican
      Adams     John Quincy Democratic-Republican
    Jackson          Andrew            Democratic
    Jackson          Andrew            Democratic
  Van Buren          Martin            Democratic
   Harrison   William Henry                  Whig
       Polk      James Knox                  Whig
     Taylor         Zachary                  Whig
     Pierce        Franklin            Democratic
   Buchanan           James            Democratic
    Lincoln         Abraham            Republican
    Lincoln         Abraham            Republican
      Grant      Ulysses S.            Republican
      Grant      Ulysses S.            Republican
      Hayes   Rutherford B.            Republican
   Garfield        James A.            Republican
  Cleveland          Grover            Democratic
   Harrison        Benjamin            Republican
  Cleveland          Grover            Democratic
   McKinley         William            Republican
   McKinley         William            Republican
  Roosevelt        Theodore            Republican
       Taft  William Howard            Republican
     Wilson         Woodrow            Democratic
     Wilson         Woodrow            Democratic
    Harding       Warren G.            Republican
   Coolidge          Calvin            Republican
     Hoover         Herbert            Republican
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
     Truman        Harry S.            Democratic
 Eisenhower       Dwight D.            Republican
 Eisenhower       Dwight D.            Republican
    Kennedy         John F.            Democratic
    Johnson   Lyndon Baines            Democratic
      Nixon Richard Milhous            Republican
      Nixon Richard Milhous            Republican
     Carter           Jimmy            Democratic
     Reagan          Ronald            Republican
     Reagan          Ronald            Republican
       Bush          George            Republican
    Clinton            Bill            Democratic
    Clinton            Bill            Democratic
       Bush       George W.            Republican
       Bush       George W.            Republican
      Obama          Barack            Democratic
      Obama          Barack            Democratic
      Trump       Donald J.            Republican
      Biden       Joseph R.            Democratic

What does a document look like? Let's look at one document (George Washington's first inaugural), which can be accessed with the as.character method. (The previous command texts has been deprecated.)

as.character(corp[1])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1789-Washington 
"Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years  -  a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "

Tokenizing - what is in the bag of words?

The first task is tokenizing. You can apply a tokenizer in quanteda with the tokens command, turning a "corpus" object -- or just a vector of texts -- into a "tokens" object. In the latest version of Quanteda, most commands operate on a tokens object.

The examples from the help file will be used to show a few of the options:

txt <- c(doc1 = "A sentence, showing how tokens() works.",
         doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
         doc3 = "Self-documenting code??",
         doc4 = "£1,000,000 for 50¢ is gr8 4ever \U0001f600")
tokens(txt)

Tokens consisting of 4 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing" 
 [5] "how"      "tokens"   "("        ")"       
 [9] "works"    "."       

doc2 :
[1] "@quantedainit"             
[2] "and"                       
[3] "#textanalysis"             
[4] "https://example.com?p=123."

doc3 :
[1] "Self-documenting" "code"            
[3] "?"                "?"               

doc4 :
[1] "£"          "1,000,000"  "for"       
[4] "50"         "¢"          "is"        
[7] "gr8"        "4ever"      "\U0001f600"

The what option selects different tokenizers. The default is word which replaces a slower and less subtle word1 legacy version.

tokens(txt, what = "word1")

Tokens consisting of 4 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing" 
 [5] "how"      "tokens"   "("        ")"       
 [9] "works"    "."       

doc2 :
 [1] "@"            "quantedainit" "and"         
 [4] "#"            "textanalysis" "https"       
 [7] ":"            "/"            "/"           
[10] "example.com"  "?"            "p"           
[ ... and 3 more ]

doc3 :
[1] "Self-documenting" "code"            
[3] "?"                "?"               

doc4 :
[1] "£"          "1,000,000"  "for"       
[4] "50"         "¢"          "is"        
[7] "gr8"        "4ever"      "\U0001f600"

For some purposes you may wish to tokenize by characters:

tokens(txt[1], what = "character")

Tokens consisting of 1 document.
doc1 :
 [1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
[ ... and 22 more ]

You can "tokenize" (the usual term is "segment") by sentence in Quanteda, but note that they recommend the spacyr package (discussed in a separate notebook) for better sentence segmentation. Let's try it on Washington's inaugural:

tokens(corp[1], what = "sentence")

Tokens consisting of 1 document and 4 docvars.
1789-Washington :
 [1] "Fellow-Citizens of the Senate and of the House of Representatives:  Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month."                                                                                                                                                                                                                                                                                                                                                                                        
 [2] "On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years  -  a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time."                                                                                                                                                                                
 [3] "On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies."                                                                                                                                                                                                                          
 [4] "In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [5] "All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated."                                                                                                                                           
 [6] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge."
 [7] "In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either."                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 [8] "No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [9] "Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage."                                                                                    
[10] "These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[11] "You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[12] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[ ... and 11 more ]

Wow, those are long sentences. Out of curiosity, let's look at Trump's:

tokens(corp[58], what = "sentence")

Tokens consisting of 1 document and 4 docvars.
2017-Trump :
 [1] "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you."                                                                         
 [2] "We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people."                                                                               
 [3] "Together, we will determine the course of America and the world for many, many years to come."                                                                                                                              
 [4] "We will face challenges."                                                                                                                                                                                                   
 [5] "We will confront hardships."                                                                                                                                                                                                
 [6] "But we will get the job done."                                                                                                                                                                                              
 [7] "Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition."
 [8] "They have been magnificent."                                                                                                                                                                                                
 [9] "Thank you."                                                                                                                                                                                                                 
[10] "Today's ceremony, however, has very special meaning."                                                                                                                                                                       
[11] "Because today we are not merely transferring power from one Administration to another, or from one party to another - but we are transferring power from Washington DC and giving it back to you, the people."              
[12] "For too long, a small group in our nation's Capital has reaped the rewards of government while the people have borne the cost."                                                                                             
[ ... and 76 more ]

Those are ... shorter.

There are a number of options you can apply with the tokens command, controlling how the tokenizer deals with punctuation, numbers, symbols, hyphenization, etc. Again, just the help file examples:

# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)

Tokens consisting of 2 documents.
doc1 :
[1] "A"        "sentence" "showing"  "how"     
[5] "tokens"   "works"   

doc2 :
[1] "@quantedainit"             
[2] "and"                       
[3] "#textanalysis"             
[4] "https://example.com?p=123."

# splitting hyphenated words
tokens(txt[3])

Tokens consisting of 1 document.
doc3 :
[1] "Self-documenting" "code"            
[3] "?"                "?"

tokens(txt[3], split_hyphens = TRUE)

Tokens consisting of 1 document.
doc3 :
[1] "Self"        "-"           "documenting"
[4] "code"        "?"           "?"

# symbols and numbers
tokens(txt[4])

Tokens consisting of 1 document.
doc4 :
[1] "£"          "1,000,000"  "for"       
[4] "50"         "¢"          "is"        
[7] "gr8"        "4ever"      "\U0001f600"

tokens(txt[4], remove_numbers = TRUE)

Tokens consisting of 1 document.
doc4 :
[1] "£"          "for"        "¢"         
[4] "is"         "gr8"        "4ever"     
[7] "\U0001f600"

tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)

Tokens consisting of 1 document.
doc4 :
[1] "for"   "is"    "gr8"   "4ever"

External tokenizers

You can use other tokenizers, like those from the "tokenizers" package. The output of a command like tokenizers::tokenize_words can be passed to the tokens command:

# install.packages("tokenizers")
library(tokenizers)
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)

Tokens consisting of 1 document.
doc4 :
[1] "1,000,000" "for"       "50"        "is"       
[5] "gr8"       "4ever"

# using pipe notation
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
  tokens(remove_symbols = TRUE)

Tokens consisting of 4 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing" 
 [5] "how"      "tokens"   "("        ")"       
 [9] "works"    "."       

doc2 :
 [1] "@"            "quantedainit" "and"         
 [4] "#"            "textanalysis" "https"       
 [7] ":"            "/"            "/"           
[10] "example.com"  "?"            "p"           
[ ... and 2 more ]

doc3 :
[1] "Self"        "-"           "documenting"
[4] "code"        "?"           "?"          

doc4 :
[1] "1,000,000" "for"       "50"        "is"       
[5] "gr8"       "4ever"

tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) %>%
    tokens(remove_punct = TRUE)

Tokens consisting of 1 document.
doc3 :
 [1] "s" "e" "l" "f" "d" "o" "c" "u" "m" "e" "n" "t"
[ ... and 7 more ]

tokenizers::tokenize_sentences(
    "The quick brown fox.  It jumped over the lazy dog.") %>%
    tokens()

Tokens consisting of 1 document.
text1 :
[1] "The quick brown fox."        
[2] "It jumped over the lazy dog."

Look carefully -- what did it do differently?

Let's make a fairly generic tokens object from our inaugural speeches corpus.

inaugural_tokens <- quanteda::tokens(corp,
                       what = "word",
                       remove_punct = TRUE, # default FALSE
                       remove_symbols = TRUE, # default FALSE
                       remove_numbers = FALSE,
                       remove_url = TRUE, # default FALSE
                       remove_separators = TRUE,
                       split_hyphens = FALSE,
                       include_docvars = TRUE,
                       padding = FALSE,
                       verbose = quanteda_options("verbose")
                       )

This produces a tokens class object. Expand the object in your RStudio Environment tab to take a look at it.

Foremost, it's a list with one entry per document consisting of a character vector of the document's tokens.

inaugural_tokens[["2017-Trump"]][1:30]

 [1] "Chief"     "Justice"   "Roberts"   "President"
 [5] "Carter"    "President" "Clinton"   "President"
 [9] "Bush"      "President" "Obama"     "fellow"   
[13] "Americans" "and"       "people"    "of"       
[17] "the"       "world"     "thank"     "you"      
[21] "We"        "the"       "citizens"  "of"       
[25] "America"   "are"       "now"       "joined"   
[29] "in"        "a"

Tokens vs. types

It also has a vector of the "types" -- the vocabulary of tokens in the whole corpus/object. This attribute can be accessed through the attr function.

attr(inaugural_tokens,"types")[1:30]

 [1] "Fellow-Citizens" "of"             
 [3] "the"             "Senate"         
 [5] "and"             "House"          
 [7] "Representatives" "Among"          
 [9] "vicissitudes"    "incident"       
[11] "to"              "life"           
[13] "no"              "event"          
[15] "could"           "have"           
[17] "filled"          "me"             
[19] "with"            "greater"        
[21] "anxieties"       "than"           
[23] "that"            "which"          
[25] "notification"    "was"            
[27] "transmitted"     "by"             
[29] "your"            "order"

length(attr(inaugural_tokens, "types"))

[1] 10147

Just over 10000 unique tokens have been used. Notice the appears third and never again. But ... The does:

which(attr(inaugural_tokens,"types")=="The")

[1] 339

Why are they "the" and "The" different types? Why is "Fellow-Citizens" one type?

Under the hood, the tokens vector isn't a vector of strings. It's a vector of integers, indicating the index of the token in the type vector. So every time the appears, it is stored as the integer 3.

By default, the tokens object also retains all of the document metadata that came with the corpus.

Key Words in Context

The tokens object also provides access to a variety of quanteda utilities. For example, a very helpful traditional qualitative tool is the Key Words in Context or kwic command:

kwic(inaugural_tokens, "humble", window=3)

Keyword-in-context with 13 matches.                                                
  [1789-Washington, 572]         along with an |
 [1789-Washington, 1359]         Human Race in |
      [1797-Adams, 2123]          age and with |
   [1801-Jefferson, 169] the contemplation and |
      [1821-Monroe, 173]           favor of my |
      [1825-Adams, 2902]         I commit with |
      [1829-Jackson, 85]      dedication of my |
      [1833-Jackson, 91]          extent of my |
     [1853-Pierce, 3174]       in the nation's |
   [1857-Buchanan, 1204]             I feel an |
  [1953-Eisenhower, 765]           of the most |
     [1997-Clinton, 586]         a new century |
      [2009-Obama, 1760]      we remember with |
                                      
 humble | anticipation of the         
 humble | supplication that since     
 humble | reverence I feel            
 humble | myself before the           
 humble | pretensions the difficulties
 humble | but fearless confidence     
 humble | abilities to their          
 humble | abilities in continued      
 humble | acknowledged dependence upon
 humble | confidence that the         
 humble | and of the                  
 humble | enough not to               
 humble | gratitude those brave

kwic(inaugural_tokens, "tombstones", window=4)

Keyword-in-context with 1 match.                  
 [2017-Trump, 456]
                                                   
 rusted-out factories scattered like | tombstones |
                        
 across the landscape of

Hmmm. Moving on.

Stemming

Stemming is the truncation of words in an effort to associate related words with a common token, e.g., "baby" and "babies" -> "babi".

The tokenizers package provides a wrapper to the wordStem function from the SnowballC package, which applies a standard stemmer called the Porter stemmer. (The function takes as input a vector of texts or corpus, and returns a list, each element a vector of the stems for the corresponding text.)

tokenizers::tokenize_word_stems(corp)$`2017-Trump`[1:50]

 [1] "chief"    "justic"   "robert"   "presid"  
 [5] "carter"   "presid"   "clinton"  "presid"  
 [9] "bush"     "presid"   "obama"    "fellow"  
[13] "american" "and"      "peopl"    "of"      
[17] "the"      "world"    "thank"    "you"     
[21] "we"       "the"      "citizen"  "of"      
[25] "america"  "are"      "now"      "join"    
[29] "in"       "a"        "great"    "nation"  
[33] "effort"   "to"       "rebuild"  "our"     
[37] "countri"  "and"      "restor"   "it"      
[41] "promis"   "for"      "all"      "of"      
[45] "our"      "peopl"    "togeth"   "we"      
[49] "will"     "determin"

From text to data - the document-term-matrix

Quanteda is focused largely on bag-of-words (or bag-of-tokens or bag-of-terms) models that work from a document-term matrix. where each row represents a document, each column represents a type (a "term" in the vocabulary) and the entries are the counts of tokens matching the term in the current document.

For this we will use quanteda's "dfm" command with some commonly chosen preprocessing options. In older version os quanteda, the dfm function was applied to a corpus, with tokenizing and normalizing options applied there. It is now applied to a tokens object where most of that has already been done. Here, we'll add case-folding, merging the and The, among other things, into a single type.

doc_term_matrix <- quanteda::dfm(inaugural_tokens,
                                 tolower = TRUE  # case-fold
                                 )

What kind of object is doc_term_matrix?

class(doc_term_matrix)

[1] "dfm"
attr(,"package")
[1] "quanteda"

Typing the dfm's name will show an object summary. This is a matrix, so how many rows does it have? How many columns? What does "91.89% sparse" mean?

doc_term_matrix

Document-feature matrix of: 59 documents, 9,422 features (91.89% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and
  1789-Washington               1  71 116      1  48
  1793-Washington               0  11  13      0   2
  1797-Adams                    3 140 163      1 130
  1801-Jefferson                2 104 130      0  81
  1805-Jefferson                0 101 143      0  93
  1809-Madison                  1  69 104      0  43
                 features
docs              house representatives among
  1789-Washington     2               2     1
  1793-Washington     0               0     0
  1797-Adams          0               2     4
  1801-Jefferson      0               0     1
  1805-Jefferson      0               0     7
  1809-Madison        0               0     0
                 features
docs              vicissitudes incident
  1789-Washington            1        1
  1793-Washington            0        0
  1797-Adams                 0        0
  1801-Jefferson             0        0
  1805-Jefferson             0        0
  1809-Madison               0        0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,412 more features ]

You can peak inside it, indexing it like you would a matrix or Matrix object:

doc_term_matrix[1:5,1:5]

Document-feature matrix of: 5 documents, 5 features (20.00% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and
  1789-Washington               1  71 116      1  48
  1793-Washington               0  11  13      0   2
  1797-Adams                    3 140 163      1 130
  1801-Jefferson                2 104 130      0  81
  1805-Jefferson                0 101 143      0  93

What are the most frequent terms?

topfeatures(doc_term_matrix,40)

       the         of        and         to 
     10183       7180       5406       4591 
        in          a        our         we 
      2827       2292       2224       1827 
      that         be         is         it 
      1813       1502       1491       1398 
       for         by       have      which 
      1230       1091       1031       1007 
       not       with         as       will 
       980        970        966        944 
      this          i        all        are 
       874        871        836        828 
     their        but        has     people 
       761        670        631        584 
      from        its government         or 
       578        573        564        563 
        on         my         us       been 
       544        515        505        496 
       can         no       they         so 
       487        470        463        397

You can get the same thing through sorting a column sum of the dtm:

word_freq <- colSums(doc_term_matrix)
sort(word_freq,decreasing=TRUE)[1:40]

       the         of        and         to 
     10183       7180       5406       4591 
        in          a        our         we 
      2827       2292       2224       1827 
      that         be         is         it 
      1813       1502       1491       1398 
       for         by       have      which 
      1230       1091       1031       1007 
       not       with         as       will 
       980        970        966        944 
      this          i        all        are 
       874        871        836        828 
     their        but        has     people 
       761        670        631        584 
      from        its government         or 
       578        573        564        563 
        on         my         us       been 
       544        515        505        496 
       can         no       they         so 
       487        470        463        397

Stopwords

For some purposes, you may wish to remove "stopwords." There are stopword lists accessible through the stopwords function, exported from the automatically loaded stopwords package. The default is English from the Snowball collection. Get a list of sources with stopwords_getsources() and a list of languages for the source with stopwords_getlanguages()

The default English list is fairly short.

stopwords('en')[1:10] #Snowball

 [1] "i"         "me"        "my"        "myself"   
 [5] "we"        "our"       "ours"      "ourselves"
 [9] "you"       "your"

length(stopwords('en'))

[1] 175

This one's three times longer.

stopwords('en', source='smart')[1:10]

 [1] "a"           "a's"         "able"       
 [4] "about"       "above"       "according"  
 [7] "accordingly" "across"      "actually"   
[10] "after"

length(stopwords('en', source='smart'))

[1] 571

This one's almost ten times as long and is ... interesting

stopwords('en', source='stopwords-iso')[1:10]

 [1] "'ll"       "'tis"      "'twas"     "'ve"      
 [5] "10"        "39"        "a"         "a's"      
 [9] "able"      "ableabout"

length(stopwords('en', source='stopwords-iso'))

[1] 1298

The beginning of a German list.

stopwords('de')[1:10]

 [1] "aber"  "alle"  "allem" "allen" "aller" "alles"
 [7] "als"   "also"  "am"    "an"

A slice from an Ancient Greek list:

stopwords('grc',source='ancient')[264:288]

 [1] "xxx"     "xxxi"    "xxxii"   "xxxiii" 
 [5] "xxxiv"   "xxxix"   "xxxv"    "xxxvi"  
 [9] "xxxvii"  "xxxviii" "y"       "z"      
[13] "α"       "ἅ"       "ἃ"       "ᾇ"      
[17] "ἄγαν"    "ἄγε"     "ἄγχι"    "ἀγχοῦ"  
[21] "ἁγώ"     "ἁγὼ"     "ἅγωγ"    "ἁγών"   
[25] "ἁγὼν"

Lets case-fold our tokens object to lowercase, remove the stopwords, then make a new dtm and see how it's different.

inaugural_tokens.nostop <- inaugural_tokens %>%
                            tokens_tolower() %>%
                            tokens_remove(stopwords('en'))
dtm.nostop <- dfm(inaugural_tokens.nostop)
dtm.nostop

Document-feature matrix of: 59 documents, 9,284 features (92.70% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house
  1789-Washington               1      1     2
  1793-Washington               0      0     0
  1797-Adams                    3      1     0
  1801-Jefferson                2      0     0
  1805-Jefferson                0      0     0
  1809-Madison                  1      0     0
                 features
docs              representatives among vicissitudes
  1789-Washington               2     1            1
  1793-Washington               0     0            0
  1797-Adams                    2     4            0
  1801-Jefferson                0     1            0
  1805-Jefferson                0     7            0
  1809-Madison                  0     0            0
                 features
docs              incident life event filled
  1789-Washington        1    1     2      1
  1793-Washington        0    0     0      0
  1797-Adams             0    2     0      0
  1801-Jefferson         0    1     0      0
  1805-Jefferson         0    2     0      0
  1809-Madison           0    1     0      1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,274 more features ]

We've got about 1000 fewer features, and it is slightly more sparse. Why?

What are the most frequent tokens now?

topfeatures(dtm.nostop,40)

      people   government           us          can 
         584          564          505          487 
        must         upon        great          may 
         376          371          344          343 
      states        world        shall      country 
         334          319          316          308 
      nation        every          one        peace 
         305          300          267          258 
         new        power          now       public 
         250          241          229          225 
        time     citizens constitution       united 
         220          209          209          203 
     america      nations        union      freedom 
         202          199          190          185 
        free          war     american          let 
         183          181          172          160 
    national         made         good         make 
         158          156          149          147 
       years      justice          men      without 
         143          142          140          140

How is this document different from those documents?

I'm just curious. Besides "tombstones," what other words made their inaugural debut in 2017?

unique_to_trump <- as.vector(colSums(doc_term_matrix) == doc_term_matrix["2017-Trump",])
colnames(doc_term_matrix)[unique_to_trump]

 [1] "obama"          "hardships"     
 [3] "lady"           "michelle"      
 [5] "transferring"   "dc"            
 [7] "reaped"         "politicians"   
 [9] "2017"           "listening"     
[11] "likes"          "neighborhoods" 
[13] "trapped"        "rusted-out"    
[15] "tombstones"     "landscape"     
[17] "flush"          "stolen"        
[19] "robbed"         "unrealized"    
[21] "carnage"        "stops"         
[23] "we've"          "subsidized"    
[25] "allowing"       "sad"           
[27] "depletion"      "trillions"     
[29] "overseas"       "infrastructure"
[31] "disrepair"      "ripped"        
[33] "redistributed"  "issuing"       
[35] "ravages"        "stealing"      
[37] "tunnels"        "hire"          
[39] "goodwill"       "shine"         
[41] "reinforce"      "islamic"       
[43] "bedrock"        "disagreements" 
[45] "solidarity"     "unstoppable"   
[47] "complaining"    "arrives"       
[49] "mysteries"      "brown"         
[51] "bleed"          "sprawl"        
[53] "windswept"      "nebraska"      
[55] "ignored"

OK!

The impact of preprocessing decisions

We can also change the settings. What happens if we don't remove punctuation?

inaugural_tokens.wpunct <- quanteda::tokens(corp,
                          what = "word",
                          remove_punct = FALSE) %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en'))
  
dtm.wpunct <- dfm(inaugural_tokens.wpunct)
dtm.wpunct

Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
                 features
docs              fellow-citizens senate house
  1789-Washington               1      1     2
  1793-Washington               0      0     0
  1797-Adams                    3      1     0
  1801-Jefferson                2      0     0
  1805-Jefferson                0      0     0
  1809-Madison                  1      0     0
                 features
docs              representatives : among
  1789-Washington               2 1     1
  1793-Washington               0 1     0
  1797-Adams                    2 0     4
  1801-Jefferson                0 1     1
  1805-Jefferson                0 0     7
  1809-Madison                  0 0     0
                 features
docs              vicissitudes incident life event
  1789-Washington            1        1    1     2
  1793-Washington            0        0    0     0
  1797-Adams                 0        0    2     0
  1801-Jefferson             0        0    1     0
  1805-Jefferson             0        0    2     0
  1809-Madison               0        0    1     0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]

topfeatures(dtm.wpunct,40)

           ,            .       people            ; 
        7173         5155          584          565 
  government           us          can            - 
         564          505          487          403 
        must         upon        great          may 
         376          371          344          343 
      states        world        shall      country 
         334          319          316          308 
      nation        every          one        peace 
         305          300          267          258 
           "          new        power          now 
         256          250          241          229 
      public         time     citizens constitution 
         225          220          209          209 
      united      america      nations        union 
         203          202          199          190 
     freedom         free          war     american 
         185          183          181          172 
         let     national         made         good 
         160          158          156          149

How big is it now? How sparse is it now?

What happens if we lower case and stem?

inaugural_tokens.stems <- quanteda::tokens(corp,
                          what = "word",
                          remove_punct = TRUE) %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en')) %>%
                          tokens_wordstem()
  
dtm.stems <- dfm(inaugural_tokens.stems)
dtm.stems

Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
                 features
docs              fellow-citizen senat hous repres
  1789-Washington              1     1    2      2
  1793-Washington              0     0    0      0
  1797-Adams                   3     1    3      3
  1801-Jefferson               2     0    0      1
  1805-Jefferson               0     0    0      0
  1809-Madison                 1     0    0      1
                 features
docs              among vicissitud incid life event
  1789-Washington     1          1     1    1     2
  1793-Washington     0          0     0    0     0
  1797-Adams          4          0     0    2     0
  1801-Jefferson      1          0     0    1     0
  1805-Jefferson      7          0     0    2     1
  1809-Madison        0          1     0    1     0
                 features
docs              fill
  1789-Washington    1
  1793-Washington    0
  1797-Adams         0
  1801-Jefferson     0
  1805-Jefferson     0
  1809-Madison       1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]

topfeatures(dtm.stems,40)

   nation    govern     peopl        us       can 
      691       657       632       505       487 
    state     great      must     power      upon 
      452       378       376       375       371 
  countri     world       may     shall     everi 
      359       347       343       316       300 
constitut      peac       one     right      time 
      289       288       279       279       271 
      law   citizen  american       new   america 
      271       265       257       250       242 
   public       now      unit      duti       war 
      229       229       225       212       204 
     make  interest     union   freedom      free 
      202       197       190       190       184 
    secur      hope      year      good       let 
      178       176       176       163       160

Zipf's Law and a power law

It's somewhat difficult to get your head around these sorts of things but there are statistical regularities here. For example, these frequencies tend to be distributed by "Zipf's Law" and by a (related) "power law."

plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank")

That makes the "long tail" clear. The grand relationship becomes clearer in a logarithmic scale:

plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank", log="xy")

For the power law, we need the number of words that appear at any given frequency. We'll turn word_freq into a categorical variable by making it a "factor". The categories are "1", "2", ..."17" ...etc. and then use summary to give us counts of each "category." (The maxsum option is used to be sure it doesn't stop at 100 and lump everything else as "Other"

words_with_freq <- summary(as.factor(word_freq),maxsum=10000)
freq_bin <- as.integer(names(words_with_freq))
plot(freq_bin, words_with_freq, main="Power Law?", xlab="Word Frequency", ylab="Number of Words", log="xy")

Zipf's law implies that, in a new corpus say, a small number of terms will be very common (we'll know a lot about them, but they won't help us distinguish documents), a large number of terms will be very rare (we'll know very little about them), and that there will be some number of terms we have never seen before. This "out-of-vocabulary" (OOV) problem is an important one in some applications.

A step toward word order mattering: n-grams

Let's go back to preprocessing choices. What happens if we count bigrams? Let's first do it without removing stopwords.

inaugural_tokens.2grams <- inaugural_tokens %>%
                          tokens_tolower() %>%
                          tokens_ngrams(n=2)
  
dtm.2grams <- dfm(inaugural_tokens.2grams)
dtm.2grams

Document-feature matrix of: 59 documents, 65,497 features (97.10% sparse) and 4 docvars.
                 features
docs              fellow-citizens_of of_the
  1789-Washington                  1     20
  1793-Washington                  0      4
  1797-Adams                       0     29
  1801-Jefferson                   0     28
  1805-Jefferson                   0     17
  1809-Madison                     0     20
                 features
docs              the_senate senate_and and_of
  1789-Washington          1          1      2
  1793-Washington          0          0      1
  1797-Adams               0          0      2
  1801-Jefferson           0          0      3
  1805-Jefferson           0          0      1
  1809-Madison             0          0      2
                 features
docs              the_house house_of
  1789-Washington         2        2
  1793-Washington         0        0
  1797-Adams              0        0
  1801-Jefferson          0        0
  1805-Jefferson          0        0
  1809-Madison            0        0
                 features
docs              of_representatives
  1789-Washington                  2
  1793-Washington                  0
  1797-Adams                       0
  1801-Jefferson                   0
  1805-Jefferson                   0
  1809-Madison                     0
                 features
docs              representatives_among among_the
  1789-Washington                     1         1
  1793-Washington                     0         0
  1797-Adams                          0         2
  1801-Jefferson                      0         0
  1805-Jefferson                      0         1
  1809-Madison                        0         0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 65,487 more features ]

topfeatures(dtm.2grams,40)

          of_the           in_the           to_the 
            1772              821              727 
          of_our          and_the            it_is 
             628              474              324 
          by_the          for_the            to_be 
             322              315              313 
      the_people          we_have             of_a 
             270              264              262 
        with_the         that_the        the_world 
             240              221              214 
       have_been          will_be         has_been 
             205              199              185 
          on_the           is_the           we_are 
             183              183              178 
        from_the           and_to   the_government 
             168              166              165 
      the_united    united_states          all_the 
             164              158              157 
          in_our the_constitution          we_will 
             156              156              156 
          of_all          of_this         of_their 
             153              142              141 
       should_be             in_a           and_in 
             139              138              132 
       those_who          we_must            of_my 
             129              128              126 
          may_be 
             126

How big is it? How sparse? It doesn't give us a lot of sense of content, but it does offer some rudimentary insights into how English is structured.

For example, we can create a rudimentary statistical language model that "predicts" the next word based on bigram frequencies. We apply Bayes' Theorem by calculating the frequency of bigrams starting with the current word and then scaling that by dividing by the total frequency (see Jurafsky and Martin, Speech and Language Processing, Chapter 3: https://web.stanford.edu/~jurafsky/slp3/ for more detail / nuance).

If the current word is "american" what is probably next, in this corpus?

First we find the right bigrams using a regular expression. See the regular expressions notebook for more detail if that is unfamiliar.

american_bigrams <- grep("^american_",colnames(dtm.2grams),value=TRUE)
american_bigrams

 [1] "american_covenant"     
 [2] "american_lives"        
 [3] "american_treasure"     
 [4] "american_belief"       
 [5] "american_heart"        
 [6] "american_spirit"       
 [7] "american_people"       
 [8] "american_business"     
 [9] "american_to"           
[10] "american_policy"       
[11] "american_citizen"      
[12] "american_democracy"    
[13] "american_auspices"     
[14] "american_instinct"     
[15] "american_interests"    
[16] "american_freemen"      
[17] "american_industries"   
[18] "american_labor"        
[19] "american_life"         
[20] "american_citizens"     
[21] "american_dream"        
[22] "american_if"           
[23] "american_name"         
[24] "american_citizenship"  
[25] "american_rights"       
[26] "american_flag"         
[27] "american_above"        
[28] "american_merchant"     
[29] "american_navy"         
[30] "american_market"       
[31] "american_family"       
[32] "american_products"     
[33] "american_arms"         
[34] "american_states"       
[35] "american_that"         
[36] "american_he"           
[37] "american_steamship"    
[38] "american_control"      
[39] "american_failure"      
[40] "american_experiment"   
[41] "american_story"        
[42] "american_sovereignty"  
[43] "american_renewal"      
[44] "american_on"           
[45] "american_enjoys"       
[46] "american_revolution"   
[47] "american_must"         
[48] "american_slavery"      
[49] "american_sense"        
[50] "american_standards"    
[51] "american_manliness"    
[52] "american_bottoms"      
[53] "american_childhood"    
[54] "american_character"    
[55] "american_achievement"  
[56] "american_century"      
[57] "american_we"           
[58] "american_here"         
[59] "american_in"           
[60] "american_promise"      
[61] "american_today"        
[62] "american_conscience"   
[63] "american_carnage"      
[64] "american_industry"     
[65] "american_workers"      
[66] "american_way"          
[67] "american_families"     
[68] "american_hands"        
[69] "american_and"          
[70] "american_statesmen"    
[71] "american_ideals"       
[72] "american_statesmanship"
[73] "american_freedom"      
[74] "american_ideal"        
[75] "american_a"            
[76] "american_destiny"      
[77] "american_history"      
[78] "american_who"          
[79] "american_emancipation" 
[80] "american_i"            
[81] "american_opportunity"  
[82] "american_subjects"     
[83] "american_is"           
[84] "american_anthem"       
[85] "american_political"    
[86] "american_soldiers"     
[87] "american_sound"        
[88] "american_she"          
[89] "american_being"

Most likely bigrams starting with "american":

freq_american_bigrams <- colSums(dtm.2grams[,american_bigrams])
most_likely_bigrams <- sort(freq_american_bigrams/sum(freq_american_bigrams),dec=TRUE)[1:10]
most_likely_bigrams

     american_people     american_citizen 
          0.23255814           0.03488372 
      american_story       american_dream 
          0.03488372           0.02906977 
         american_to    american_citizens 
          0.02325581           0.02325581 
american_citizenship      american_policy 
          0.02325581           0.01744186 
      american_labor        american_that 
          0.01744186           0.01744186

Let's see what happens if we remove the stopwords first.

inaugural_tokens.2grams.nostop <- inaugural_tokens %>%
                          tokens_tolower() %>%
                          tokens_remove(stopwords('en')) %>%
                          tokens_ngrams(n=2)
  
dtm.2grams.nostop <- dfm(inaugural_tokens.2grams.nostop)
dtm.2grams.nostop

Document-feature matrix of: 59 documents, 57,723 features (98.12% sparse) and 4 docvars.
                 features
docs              fellow-citizens_senate
  1789-Washington                      1
  1793-Washington                      0
  1797-Adams                           0
  1801-Jefferson                       0
  1805-Jefferson                       0
  1809-Madison                         0
                 features
docs              senate_house house_representatives
  1789-Washington            1                     2
  1793-Washington            0                     0
  1797-Adams                 0                     0
  1801-Jefferson             0                     0
  1805-Jefferson             0                     0
  1809-Madison               0                     0
                 features
docs              representatives_among
  1789-Washington                     1
  1793-Washington                     0
  1797-Adams                          0
  1801-Jefferson                      0
  1805-Jefferson                      0
  1809-Madison                        0
                 features
docs              among_vicissitudes
  1789-Washington                  1
  1793-Washington                  0
  1797-Adams                       0
  1801-Jefferson                   0
  1805-Jefferson                   0
  1809-Madison                     0
                 features
docs              vicissitudes_incident
  1789-Washington                     1
  1793-Washington                     0
  1797-Adams                          0
  1801-Jefferson                      0
  1805-Jefferson                      0
  1809-Madison                        0
                 features
docs              incident_life life_event
  1789-Washington             1          1
  1793-Washington             0          0
  1797-Adams                  0          0
  1801-Jefferson              0          0
  1805-Jefferson              0          0
  1809-Madison                0          0
                 features
docs              event_filled filled_greater
  1789-Washington            1              1
  1793-Washington            0              0
  1797-Adams                 0              0
  1801-Jefferson             0              0
  1805-Jefferson             0              0
  1809-Madison               0              0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 57,713 more features ]

topfeatures(dtm.2grams.nostop,40)

            united_states                    let_us 
                      158                       105 
          fellow_citizens           american_people 
                       78                        40 
       federal_government                 men_women 
                       32                        28 
                years_ago                four_years 
                       27                        26 
                  upon_us        general_government 
                       25                        25 
              one_another            government_can 
                       22                        20 
            every_citizen       constitution_united 
                       20                        20 
         fellow_americans            vice_president 
                       19                        18 
             great_nation             among_nations 
                       17                        17 
        government_people                 god_bless 
                       17                        16 
               people_can             people_united 
                       15                        15 
          foreign_nations                  may_well 
                       15                        15 
             almighty_god               peace_world 
                       15                        14 
          form_government             chief_justice 
                       14                        14 
            nations_world              among_people 
                       14                        14 
            national_life            every_american 
                       13                        13 
              free_people                 can_never 
                       13                        12 
administration_government              people_world 
                       12                        12 
            within_limits         constitution_laws 
                       12                        12 
              public_debt                one_nation 
                       12                        11

How big is it? How sparse? It gives some interesting content -- "great_nation", "almighty_god", "public_debt" -- but some confusing contructions, e.g. "people_world" which is really things like "people_of_the_world."

Can I draw those slick wordclouds?

Ugh, well, yes, if you must. Wordclouds are an abomination -- I'll rant about that at a later date -- but here's Trump's first inaugural in a wordcloud ...

library(quanteda.textplots)
set.seed(100)
textplot_wordcloud(dtm.nostop["2017-Trump",], min_count = 1, random_order = FALSE,
                   rotation = .25, 
                   color = RColorBrewer::brewer.pal(8,"Dark2"))

Practice Exercises

Save a copy of the notebook and use it to answer the questions below. Those labeled "Challenge" require more than demonstrated above.

1) Use the inaugural_tokens.nostop object. Define a word's "context" as a window of five words/tokens before and after a word's usage. In what contexts does the word "Roman" appear in this corpus?

2) Using dtm.wpunct, which president used the most exclamation points in his inaugural address?

3) Use dtm.nostop for these questions.

a) Do any terms appear only in the document containing Abraham Lincoln's first inaugural address?

b) Challenge: How many terms appeared first in Abraham Lincoln's first inaugural address?

c) How many times has the word "slave" been used in inaugural addresses?

d) Challenge: How many times has a word that included "slave" (like "slavery" or "enslaved") been used in inaugural addresses?

4) Construct a dtm of trigrams (lower case, not stemmed, no stop words removed).

a) How big is the matrix? How sparse is it?

b) What are the 50 most frequent trigrams?

c) Challenge How many trigrams appear only once?

5) Tokenize the following string of tweets using the built-in word tokenizer, the tokenize_words tokenizer from the tokenizers package, and the tokenize_tweets tokenizer from the tokenizers package, and explain what's different.

https://t.co/9z2J3P33Uc FB needs to hurry up and add a laugh/cry button 😬😭😓🤢🙄😱 Since eating my feelings has not fixed the world's problems, I guess I'll try to sleep... HOLY CRAP: DeVos questionnaire appears to include passages from uncited sources https://t.co/FNRoOlfw9s well played, Senator Murray Keep the pressure on: https://t.co/4hfOsmdk0l @datageneral thx Mr Taussig It's interesting how many people contact me about applying for a PhD and don't spell my name right.

LS0tCnRpdGxlOiAiQW4gSW50cm9kdWN0aW9uIHRvIFRleHQgYXMgRGF0YSB3aXRoIHF1YW50ZWRhIgphdXRob3I6ICJCdXJ0IEwuIE1vbnJvZSIKc3VidGl0bGU6IFBlbm4gU3RhdGUgYW5kIEVzc2V4IGNvdXJzZXMgaW4gIlRleHQgYXMgRGF0YSIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICB0b2M6IHllcwogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgaHRtbF9ub3RlYm9vazoKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgaGlnaGxpZ2h0OiB0YW5nbwogICAgdGhlbWU6IHVuaXRlZAogICAgdG9jOiB5ZXMKICAgIGRmX3ByaW50OiBwYWdlZAotLS0KClVwZGF0ZWQgU2VwdGVtYmVyIDIwMjEuCgpUaGUgKipxdWFudGVkYSoqIHBhY2thZ2UgKGh0dHBzOi8vcXVhbnRlZGEuaW8pIGlzIGEgdmVyeSBnZW5lcmFsIGFuZCB3ZWxsLWRvY3VtZW50ZWQgZWNvc3lzdGVtIGZvciB0ZXh0IGFuYWx5c2lzIGluIFIuIEEgdmVyeSBsYXJnZSBwZXJjZW50YWdlIG9mIHdoYXQgaXMgdHlwaWNhbGx5IGRvbmUgaW4gc29jaWFsIHNjaWVuY2UgdGV4dC1hcy1kYXRhIHJlc2VhcmNoIGNhbiBiZSBkb25lIHdpdGgsIG9yIGF0IGxlYXN0IHRocm91Z2gsIHF1YW50ZWRhLiBBbW9uZyB0aGUgImNvbXBldGl0b3JzIiB0byBxdWFudGVkYSBhcmUgdGhlIGNsYXNzaWMgcGFja2FnZSAqKnRtKiogYW5kIHRoZSB0aWR5dmVyc2UtY29uc2lzdGVudCBwYWNrYWdlICoqdGlkeXRleHQqKi4gVGhlc2UgYWN0dWFsbHkgYXJlIGludGVycmVsYXRlZCwgd2l0aCBzaGFyZWQgY29kZSBhbmQgY29udmVyc2lvbiB1dGlsaXRpZXMgYXZhaWxhYmxlLCBzbyB0aGV5IGFyZW4ndCBuZWNlc3NhcmlseSBpbiBjb25mbGljdC4KCk9mZmljaWFsIGRlc2NyaXB0aW9uOiAKCj5UaGUgcGFja2FnZSBpcyBkZXNpZ25lZCBmb3IgUiB1c2VycyBuZWVkaW5nIHRvIGFwcGx5IG5hdHVyYWwgbGFuZ3VhZ2UgcHJvY2Vzc2luZyB0byB0ZXh0cywgZnJvbSBkb2N1bWVudHMgdG8gZmluYWwgYW5hbHlzaXMuIEl0cyBjYXBhYmlsaXRpZXMgbWF0Y2ggb3IgZXhjZWVkIHRob3NlIHByb3ZpZGVkIGluIG1hbnkgZW5kLXVzZXIgc29mdHdhcmUgYXBwbGljYXRpb25zLCBtYW55IG9mIHdoaWNoIGFyZSBleHBlbnNpdmUgYW5kIG5vdCBvcGVuIHNvdXJjZS4gVGhlIHBhY2thZ2UgaXMgdGhlcmVmb3JlIG9mIGdyZWF0IGJlbmVmaXQgdG8gcmVzZWFyY2hlcnMsIHN0dWRlbnRzLCBhbmQgb3RoZXIgYW5hbHlzdHMgd2l0aCBmZXdlciBmaW5hbmNpYWwgcmVzb3VyY2VzLiBXaGlsZSB1c2luZyBxdWFudGVkYSByZXF1aXJlcyBSIHByb2dyYW1taW5nIGtub3dsZWRnZSwgaXRzIEFQSSBpcyBkZXNpZ25lZCB0byBlbmFibGUgcG93ZXJmdWwsIGVmZmljaWVudCBhbmFseXNpcyB3aXRoIGEgbWluaW11bSBvZiBzdGVwcy4gQnkgZW1waGFzaXppbmcgY29uc2lzdGVudCBkZXNpZ24sIGZ1cnRoZXJtb3JlLCBxdWFudGVkYSBsb3dlcnMgdGhlIGJhcnJpZXJzIHRvIGxlYXJuaW5nIGFuZCB1c2luZyBOTFAgYW5kIHF1YW50aXRhdGl2ZSB0ZXh0IGFuYWx5c2lzIGV2ZW4gZm9yIHByb2ZpY2llbnQgUiBwcm9ncmFtbWVycy4KCkluIGFkZGl0aW9uIHRvIHRoZSBleHRlbnNpdmUgZG9jdW1lbnRhdGlvbiwgU3RlZmFuIE11bGxlciBhbmQgS2VuIEJlbm9pdCBoYXZlIGEgdmVyeSBoZWxwZnVsIGNoZWF0c2hlZXQgaGVyZTogaHR0cHM6Ly9tdWVsbGVyc3RlZmFuLm5ldC9maWxlcy9xdWFudGVkYS1jaGVhdHNoZWV0LnBkZi4KCkluIHRoaXMgbm90ZWJvb2ssIHdlIHdpbGwgdXNlIHF1YW50ZWRhIHRvIHR1cm4gYSBjb2xsZWN0aW9uIG9mIHRleHRzLCBhIGNvcnB1cywgaW50byBxdWFudGl0YXRpdmUgZGF0YSwgd2l0aCBlYWNoIGRvY3VtZW50IHJlcHJlc2VudGVkIGJ5IHRoZSBjb3VudHMgb2YgdGhlICJ3b3JkcyIgaW4gaXQuIFNpbmNlIHdlIGRvIGF3YXkgd2l0aCB3b3JkIG9yZGVyIHRoaXMgaXMgY2FsbGVkIGEgKipiYWctb2Ytd29yZHMqKiByZXByZXNlbnRhdGlvbi4KCkluc3RhbGwgdGhlIGZvbGxvd2luZyBwYWNrYWdlcyBpZiB5b3UgaGF2ZW4ndC4KCmBgYHtyfQojIGluc3RhbGwucGFja2FnZXMoInF1YW50ZWRhIiwgZGVwZW5kZW5jaWVzPVRSVUUpCiMgaW5zdGFsbC5wYWNrYWdlcygidG9rZW5pemVycyIsIGRlcGVuZGVuY2llcz1UUlVFKQojIGluc3RhbGwucGFja2FnZXMoInF1YW50ZWRhLnRleHRwbG90cyIsIGRlcGVuZGVuY2llcz1UUlVFKQojIGluc3RhbGwucGFja2FnZXMoIlJDb2xvckJyZXdlciIsIGRlcGVuZGVuY2llcz1UUlVFKQpgYGAKCk5vdGUgdGhhdCBxdWFudGVkYSBoYXMgaW4gcmVjZW50IHZlcnNpb25zIG1vdmVkIGFuYWx5c2lzIGFuZCBwbG90dGluZyBmdW5jdGlvbnMgdG8gbmV3IHBhY2thZ2VzICoqcXVhbnRlZGEudGV4dHBsb3RzKiosICoqcXVhbnRlZGEudGV4dG1vZGVscyoqIChjbGFzc2lmaWNhdGlvbiBhbmQgc2NhbGluZyBtb2RlbHMpLCBhbmQgKipxdWFudGVkYS50ZXh0c3RhdHMqKi4KCk5vdyBsb2FkIHF1YW50ZWRhOgpgYGB7cn0KbGlicmFyeShxdWFudGVkYSkKYGBgCgpJZiB5b3UgYXJlIHdvcmtpbmcgb24gUlN0dWRpbyBDbG91ZCwgeW91IG1heSBoYXZlIHJlY2VpdmVkIGEgd2FybmluZyBtZXNzYWdlIGFib3V0IHRoZSAibG9jYWxlLiIgWW91IHNldCB0aGUgbG9jYWxlIGZvciBCcml0aXNoIEVuZ2xpc2ggKCJlbl9HQiIpIHdpdGggdGhlIGBzdHJpX2xvY2FsZV9zZXRgIGNvbW1hbiBpbiB0aGUgYWxyZWFkeSBsb2FkZWQgc3RyaW5naSBwYWNrYWdlLiBZb3UgbWF5ICB3aXNoIHRvIHNldCBpdCB0byBhc3N1bWUgeW91IGFyZSB3b3JraW5nIGluIGEgZGlmZmVyZW50IGNvbnRleHQgKGUuZy4sICJlbl9VUyIgZm9yIFVTIEVuZ2xpc2gpIG9yIGxhbmd1YWdlIChlLmcuLCAicHRfQlIiIGZvciBCcmF6aWxpYW4gUG9ydHVndWVzZSkuIFRoaXMgc2VlbXMgdG8gaGFwcGVuIGV2ZXJ5IHRpbWUgYW4gUlN0dWRpbyBDbG91ZCBwcm9qZWN0IHdpdGggcXVhbnRlZGEgbG9hZGVkIGlzIHJlb3BlbmVkLCBzbyB5b3UgaGF2ZSB0byByZWlzc3VlIHRoaXMgY29tbWFuZCB0byBtYWtlIHRoZSB3YXJuaW5nIG1lc3NhZ2UgZ28gYXdheS4KCmBgYHtyfQojIHN0cmluZ2k6OnN0cmlfbG9jYWxlX3NldCgiZW5fR0IiKQpgYGAKCiMgQSBmaXJzdCBjb3JwdXMgIwoKUXVhbnRlZGEgY29tZXMgd2l0aCBzZXZlcmFsIGNvcnBvcmEgaW5jbHVkZWQuIExldHMgbG9hZCBpbiB0aGUgY29ycHVzIG9mIFVTIHByZXNpZGVudGlhbCBpbmF1Z3VyYWwgYWRkcmVzc2VzIGFuZCBzZWUgd2hhdCBpdCBsb29rcyBsaWtlOgoKYGBge3J9CmNvcnAgPC0gcXVhbnRlZGE6OmRhdGFfY29ycHVzX2luYXVndXJhbAoKc3VtbWFyeShjb3JwKQpgYGAKCldoYXQgZG9lcyBhIGRvY3VtZW50IGxvb2sgbGlrZT8gTGV0J3MgbG9vayBhdCBvbmUgZG9jdW1lbnQgKEdlb3JnZSBXYXNoaW5ndG9uJ3MgZmlyc3QgaW5hdWd1cmFsKSwgd2hpY2ggY2FuIGJlIGFjY2Vzc2VkIHdpdGggdGhlIGBhcy5jaGFyYWN0ZXJgIG1ldGhvZC4gKFRoZSBwcmV2aW91cyBjb21tYW5kIGB0ZXh0c2AgaGFzIGJlZW4gZGVwcmVjYXRlZC4pCgpgYGB7cn0KYXMuY2hhcmFjdGVyKGNvcnBbMV0pCmBgYAoKIyBUb2tlbml6aW5nIC0gd2hhdCBpcyBpbiB0aGUgYmFnIG9mIHdvcmRzPwoKVGhlIGZpcnN0IHRhc2sgaXMgKip0b2tlbml6aW5nKiouIFlvdSBjYW4gYXBwbHkgYSB0b2tlbml6ZXIgaW4gcXVhbnRlZGEgd2l0aCB0aGUgYHRva2Vuc2AgY29tbWFuZCwgdHVybmluZyBhICJjb3JwdXMiIG9iamVjdCAtLSBvciBqdXN0IGEgdmVjdG9yIG9mIHRleHRzIC0tIGludG8gYSAidG9rZW5zIiBvYmplY3QuIEluIHRoZSBsYXRlc3QgdmVyc2lvbiBvZiBRdWFudGVkYSwgbW9zdCBjb21tYW5kcyBvcGVyYXRlIG9uIGEgdG9rZW5zIG9iamVjdC4KClRoZSBleGFtcGxlcyBmcm9tIHRoZSBoZWxwIGZpbGUgd2lsbCBiZSB1c2VkIHRvIHNob3cgYSBmZXcgb2YgdGhlIG9wdGlvbnM6CgpgYGB7cn0KdHh0IDwtIGMoZG9jMSA9ICJBIHNlbnRlbmNlLCBzaG93aW5nIGhvdyB0b2tlbnMoKSB3b3Jrcy4iLAogICAgICAgICBkb2MyID0gIkBxdWFudGVkYWluaXQgYW5kICN0ZXh0YW5hbHlzaXMgaHR0cHM6Ly9leGFtcGxlLmNvbT9wPTEyMy4iLAogICAgICAgICBkb2MzID0gIlNlbGYtZG9jdW1lbnRpbmcgY29kZT8/IiwKICAgICAgICAgZG9jNCA9ICLCozEsMDAwLDAwMCBmb3IgNTDCoiBpcyBncjggNGV2ZXIgXFUwMDAxZjYwMCIpCnRva2Vucyh0eHQpCmBgYAoKVGhlIGB3aGF0YCBvcHRpb24gc2VsZWN0cyBkaWZmZXJlbnQgdG9rZW5pemVycy4gVGhlIGRlZmF1bHQgaXMgYHdvcmRgIHdoaWNoIHJlcGxhY2VzIGEgc2xvd2VyIGFuZCBsZXNzIHN1YnRsZSBgd29yZDFgIGxlZ2FjeSB2ZXJzaW9uLgoKYGBge3J9CnRva2Vucyh0eHQsIHdoYXQgPSAid29yZDEiKQpgYGAKCkZvciBzb21lIHB1cnBvc2VzIHlvdSBtYXkgd2lzaCB0byB0b2tlbml6ZSBieSBjaGFyYWN0ZXJzOgoKYGBge3J9CnRva2Vucyh0eHRbMV0sIHdoYXQgPSAiY2hhcmFjdGVyIikKYGBgCgpZb3UgY2FuICJ0b2tlbml6ZSIgKHRoZSB1c3VhbCB0ZXJtIGlzICJzZWdtZW50IikgYnkgc2VudGVuY2UgaW4gUXVhbnRlZGEsIGJ1dCBub3RlIHRoYXQgdGhleSByZWNvbW1lbmQgdGhlICpzcGFjeXIqIHBhY2thZ2UgKGRpc2N1c3NlZCBpbiBhIHNlcGFyYXRlIG5vdGVib29rKSBmb3IgYmV0dGVyIHNlbnRlbmNlIHNlZ21lbnRhdGlvbi4gTGV0J3MgdHJ5IGl0IG9uIFdhc2hpbmd0b24ncyBpbmF1Z3VyYWw6CgpgYGB7cn0KdG9rZW5zKGNvcnBbMV0sIHdoYXQgPSAic2VudGVuY2UiKQpgYGAKCldvdywgdGhvc2UgYXJlIGxvbmcgc2VudGVuY2VzLiBPdXQgb2YgY3VyaW9zaXR5LCBsZXQncyBsb29rIGF0IFRydW1wJ3M6CmBgYHtyfQp0b2tlbnMoY29ycFs1OF0sIHdoYXQgPSAic2VudGVuY2UiKQpgYGAKClRob3NlIGFyZSAuLi4gc2hvcnRlci4KClRoZXJlIGFyZSBhIG51bWJlciBvZiBvcHRpb25zIHlvdSBjYW4gYXBwbHkgd2l0aCB0aGUgYHRva2Vuc2AgY29tbWFuZCwgY29udHJvbGxpbmcgaG93IHRoZSB0b2tlbml6ZXIgZGVhbHMgd2l0aCBwdW5jdHVhdGlvbiwgbnVtYmVycywgc3ltYm9scywgaHlwaGVuaXphdGlvbiwgZXRjLiBBZ2FpbiwganVzdCB0aGUgaGVscCBmaWxlIGV4YW1wbGVzOgoKYGBge3J9CiMgcmVtb3ZpbmcgcHVuY3R1YXRpb24gbWFya3MgYnV0IGtlZXBpbmcgdGFncyBhbmQgVVJMcwp0b2tlbnModHh0WzE6Ml0sIHJlbW92ZV9wdW5jdCA9IFRSVUUpCgojIHNwbGl0dGluZyBoeXBoZW5hdGVkIHdvcmRzCnRva2Vucyh0eHRbM10pCnRva2Vucyh0eHRbM10sIHNwbGl0X2h5cGhlbnMgPSBUUlVFKQoKIyBzeW1ib2xzIGFuZCBudW1iZXJzCnRva2Vucyh0eHRbNF0pCnRva2Vucyh0eHRbNF0sIHJlbW92ZV9udW1iZXJzID0gVFJVRSkKdG9rZW5zKHR4dFs0XSwgcmVtb3ZlX251bWJlcnMgPSBUUlVFLCByZW1vdmVfc3ltYm9scyA9IFRSVUUpCmBgYAoKIyMgRXh0ZXJuYWwgdG9rZW5pemVycwoKWW91IGNhbiB1c2Ugb3RoZXIgdG9rZW5pemVycywgbGlrZSB0aG9zZSBmcm9tIHRoZSAidG9rZW5pemVycyIgcGFja2FnZS4gVGhlIG91dHB1dCBvZiBhIGNvbW1hbmQgbGlrZSBgdG9rZW5pemVyczo6dG9rZW5pemVfd29yZHNgIGNhbiBiZSBwYXNzZWQgdG8gdGhlIHRva2VucyBjb21tYW5kOgoKYGBge3J9CiMgaW5zdGFsbC5wYWNrYWdlcygidG9rZW5pemVycyIpCmxpYnJhcnkodG9rZW5pemVycykKdG9rZW5zKHRva2VuaXplcnM6OnRva2VuaXplX3dvcmRzKHR4dFs0XSksIHJlbW92ZV9zeW1ib2xzID0gVFJVRSkKCiMgdXNpbmcgcGlwZSBub3RhdGlvbgp0b2tlbml6ZXJzOjp0b2tlbml6ZV93b3Jkcyh0eHQsIGxvd2VyY2FzZSA9IEZBTFNFLCBzdHJpcF9wdW5jdCA9IEZBTFNFKSAlPiUKICB0b2tlbnMocmVtb3ZlX3N5bWJvbHMgPSBUUlVFKQoKdG9rZW5pemVyczo6dG9rZW5pemVfY2hhcmFjdGVycyh0eHRbM10sIHN0cmlwX25vbl9hbHBoYW51bSA9IEZBTFNFKSAlPiUKICAgIHRva2VucyhyZW1vdmVfcHVuY3QgPSBUUlVFKQoKdG9rZW5pemVyczo6dG9rZW5pemVfc2VudGVuY2VzKAogICAgIlRoZSBxdWljayBicm93biBmb3guICBJdCBqdW1wZWQgb3ZlciB0aGUgbGF6eSBkb2cuIikgJT4lCiAgICB0b2tlbnMoKQpgYGAKCkxvb2sgY2FyZWZ1bGx5IC0tIHdoYXQgZGlkIGl0IGRvIGRpZmZlcmVudGx5PwoKTGV0J3MgbWFrZSBhIGZhaXJseSBnZW5lcmljIHRva2VucyBvYmplY3QgZnJvbSBvdXIgaW5hdWd1cmFsIHNwZWVjaGVzIGNvcnB1cy4KCmBgYHtyfQppbmF1Z3VyYWxfdG9rZW5zIDwtIHF1YW50ZWRhOjp0b2tlbnMoY29ycCwKICAgICAgICAgICAgICAgICAgICAgICB3aGF0ID0gIndvcmQiLAogICAgICAgICAgICAgICAgICAgICAgIHJlbW92ZV9wdW5jdCA9IFRSVUUsICMgZGVmYXVsdCBGQUxTRQogICAgICAgICAgICAgICAgICAgICAgIHJlbW92ZV9zeW1ib2xzID0gVFJVRSwgIyBkZWZhdWx0IEZBTFNFCiAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlX251bWJlcnMgPSBGQUxTRSwKICAgICAgICAgICAgICAgICAgICAgICByZW1vdmVfdXJsID0gVFJVRSwgIyBkZWZhdWx0IEZBTFNFCiAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlX3NlcGFyYXRvcnMgPSBUUlVFLAogICAgICAgICAgICAgICAgICAgICAgIHNwbGl0X2h5cGhlbnMgPSBGQUxTRSwKICAgICAgICAgICAgICAgICAgICAgICBpbmNsdWRlX2RvY3ZhcnMgPSBUUlVFLAogICAgICAgICAgICAgICAgICAgICAgIHBhZGRpbmcgPSBGQUxTRSwKICAgICAgICAgICAgICAgICAgICAgICB2ZXJib3NlID0gcXVhbnRlZGFfb3B0aW9ucygidmVyYm9zZSIpCiAgICAgICAgICAgICAgICAgICAgICAgKQpgYGAKClRoaXMgcHJvZHVjZXMgYSBgdG9rZW5zYCBjbGFzcyBvYmplY3QuIEV4cGFuZCB0aGUgb2JqZWN0IGluIHlvdXIgUlN0dWRpbyBFbnZpcm9ubWVudCB0YWIgdG8gdGFrZSBhIGxvb2sgYXQgaXQuCgpGb3JlbW9zdCwgaXQncyBhIGxpc3Qgd2l0aCBvbmUgZW50cnkgcGVyIGRvY3VtZW50IGNvbnNpc3Rpbmcgb2YgYSBjaGFyYWN0ZXIgdmVjdG9yIG9mIHRoZSBkb2N1bWVudCdzIHRva2Vucy4KCmBgYHtyfQppbmF1Z3VyYWxfdG9rZW5zW1siMjAxNy1UcnVtcCJdXVsxOjMwXQpgYGAKCiMjIFRva2VucyB2cy4gdHlwZXMKCkl0IGFsc28gaGFzIGEgdmVjdG9yIG9mIHRoZSAidHlwZXMiIC0tIHRoZSB2b2NhYnVsYXJ5IG9mIHRva2VucyBpbiB0aGUgd2hvbGUgY29ycHVzL29iamVjdC4gVGhpcyBhdHRyaWJ1dGUgY2FuIGJlIGFjY2Vzc2VkIHRocm91Z2ggdGhlIGBhdHRyYCBmdW5jdGlvbi4KCmBgYHtyfQphdHRyKGluYXVndXJhbF90b2tlbnMsInR5cGVzIilbMTozMF0KbGVuZ3RoKGF0dHIoaW5hdWd1cmFsX3Rva2VucywgInR5cGVzIikpCmBgYAoKSnVzdCBvdmVyIDEwMDAwIHVuaXF1ZSB0b2tlbnMgaGF2ZSBiZWVuIHVzZWQuIE5vdGljZSBgdGhlYCBhcHBlYXJzIHRoaXJkIGFuZCBuZXZlciBhZ2Fpbi4gQnV0IC4uLiBgVGhlYCBkb2VzOgoKYGBge3J9CndoaWNoKGF0dHIoaW5hdWd1cmFsX3Rva2VucywidHlwZXMiKT09IlRoZSIpCmBgYAoKV2h5IGFyZSB0aGV5ICJ0aGUiIGFuZCAiVGhlIiBkaWZmZXJlbnQgdHlwZXM/IFdoeSBpcyAiRmVsbG93LUNpdGl6ZW5zIiBvbmUgdHlwZT8KClVuZGVyIHRoZSBob29kLCB0aGUgYHRva2Vuc2AgdmVjdG9yIGlzbid0IGEgdmVjdG9yIG9mIHN0cmluZ3MuIEl0J3MgYSB2ZWN0b3Igb2YgaW50ZWdlcnMsIGluZGljYXRpbmcgdGhlIGluZGV4IG9mIHRoZSB0b2tlbiBpbiB0aGUgdHlwZSB2ZWN0b3IuIFNvIGV2ZXJ5IHRpbWUgYHRoZWAgYXBwZWFycywgaXQgaXMgc3RvcmVkIGFzIHRoZSBpbnRlZ2VyIDMuCgpCeSBkZWZhdWx0LCB0aGUgYHRva2Vuc2Agb2JqZWN0IGFsc28gcmV0YWlucyBhbGwgb2YgdGhlIGRvY3VtZW50IG1ldGFkYXRhIHRoYXQgY2FtZSB3aXRoIHRoZSBjb3JwdXMuCgojIyBLZXkgV29yZHMgaW4gQ29udGV4dAoKVGhlIHRva2VucyBvYmplY3QgYWxzbyBwcm92aWRlcyBhY2Nlc3MgdG8gYSB2YXJpZXR5IG9mIHF1YW50ZWRhIHV0aWxpdGllcy4gRm9yIGV4YW1wbGUsIGEgdmVyeSBoZWxwZnVsIHRyYWRpdGlvbmFsIHF1YWxpdGF0aXZlIHRvb2wgaXMgdGhlIEtleSBXb3JkcyBpbiBDb250ZXh0IG9yIGBrd2ljYCBjb21tYW5kOgoKYGBge3J9Cmt3aWMoaW5hdWd1cmFsX3Rva2VucywgImh1bWJsZSIsIHdpbmRvdz0zKQoKa3dpYyhpbmF1Z3VyYWxfdG9rZW5zLCAidG9tYnN0b25lcyIsIHdpbmRvdz00KQpgYGAKCkhtbW0uIE1vdmluZyBvbi4KCiMjIFN0ZW1taW5nCgpTdGVtbWluZyBpcyB0aGUgdHJ1bmNhdGlvbiBvZiB3b3JkcyBpbiBhbiBlZmZvcnQgdG8gYXNzb2NpYXRlIHJlbGF0ZWQgd29yZHMgd2l0aCBhIGNvbW1vbiB0b2tlbiwgZS5nLiwgImJhYnkiIGFuZCAiYmFiaWVzIiAtPiAiYmFiaSIuIAoKVGhlIHRva2VuaXplcnMgcGFja2FnZSBwcm92aWRlcyBhIHdyYXBwZXIgdG8gdGhlIGB3b3JkU3RlbWAgZnVuY3Rpb24gZnJvbSB0aGUgU25vd2JhbGxDIHBhY2thZ2UsIHdoaWNoIGFwcGxpZXMgYSBzdGFuZGFyZCBzdGVtbWVyIGNhbGxlZCB0aGUgUG9ydGVyIHN0ZW1tZXIuIChUaGUgZnVuY3Rpb24gdGFrZXMgYXMgaW5wdXQgYSB2ZWN0b3Igb2YgdGV4dHMgb3IgY29ycHVzLCBhbmQgcmV0dXJucyBhIGxpc3QsIGVhY2ggZWxlbWVudCBhIHZlY3RvciBvZiB0aGUgc3RlbXMgZm9yIHRoZSBjb3JyZXNwb25kaW5nIHRleHQuKQoKYGBge3J9CnRva2VuaXplcnM6OnRva2VuaXplX3dvcmRfc3RlbXMoY29ycCkkYDIwMTctVHJ1bXBgWzE6NTBdCmBgYAoKIyBGcm9tIHRleHQgdG8gZGF0YSAtIHRoZSBkb2N1bWVudC10ZXJtLW1hdHJpeAoKUXVhbnRlZGEgaXMgZm9jdXNlZCBsYXJnZWx5IG9uIGJhZy1vZi13b3JkcyAob3IgYmFnLW9mLXRva2VucyBvciBiYWctb2YtdGVybXMpIG1vZGVscyB0aGF0IHdvcmsgZnJvbSBhIGRvY3VtZW50LXRlcm0gbWF0cml4LiB3aGVyZSBlYWNoIHJvdyByZXByZXNlbnRzIGEgZG9jdW1lbnQsIGVhY2ggY29sdW1uIHJlcHJlc2VudHMgYSB0eXBlIChhICJ0ZXJtIiBpbiB0aGUgdm9jYWJ1bGFyeSkgYW5kIHRoZSBlbnRyaWVzIGFyZSB0aGUgY291bnRzIG9mIHRva2VucyBtYXRjaGluZyB0aGUgdGVybSBpbiB0aGUgY3VycmVudCBkb2N1bWVudC4KCkZvciB0aGlzIHdlIHdpbGwgdXNlIHF1YW50ZWRhJ3MgImRmbSIgY29tbWFuZCB3aXRoIHNvbWUgY29tbW9ubHkgY2hvc2VuIHByZXByb2Nlc3Npbmcgb3B0aW9ucy4gSW4gb2xkZXIgdmVyc2lvbiBvcyBxdWFudGVkYSwgdGhlIGRmbSBmdW5jdGlvbiB3YXMgYXBwbGllZCB0byBhIGNvcnB1cywgd2l0aCB0b2tlbml6aW5nIGFuZCBub3JtYWxpemluZyBvcHRpb25zIGFwcGxpZWQgdGhlcmUuIEl0IGlzIG5vdyBhcHBsaWVkIHRvIGEgdG9rZW5zIG9iamVjdCB3aGVyZSBtb3N0IG9mIHRoYXQgaGFzIGFscmVhZHkgYmVlbiBkb25lLiBIZXJlLCB3ZSdsbCBhZGQgY2FzZS1mb2xkaW5nLCBtZXJnaW5nIGB0aGVgIGFuZCBgVGhlYCwgYW1vbmcgb3RoZXIgdGhpbmdzLCBpbnRvIGEgc2luZ2xlIHR5cGUuCgpgYGB7cn0KZG9jX3Rlcm1fbWF0cml4IDwtIHF1YW50ZWRhOjpkZm0oaW5hdWd1cmFsX3Rva2VucywKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdG9sb3dlciA9IFRSVUUgICMgY2FzZS1mb2xkCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICkKYGBgCgpXaGF0IGtpbmQgb2Ygb2JqZWN0IGlzIGRvY190ZXJtX21hdHJpeD8gCgpgYGB7cn0KY2xhc3MoZG9jX3Rlcm1fbWF0cml4KQpgYGAKClR5cGluZyB0aGUgZGZtJ3MgbmFtZSB3aWxsIHNob3cgYW4gb2JqZWN0IHN1bW1hcnkuIFRoaXMgaXMgYSBtYXRyaXgsIHNvIGhvdyBtYW55IHJvd3MgZG9lcyBpdCBoYXZlPyBIb3cgbWFueSBjb2x1bW5zPyBXaGF0IGRvZXMgIjkxLjg5JSBzcGFyc2UiIG1lYW4/CgpgYGB7cn0KZG9jX3Rlcm1fbWF0cml4CmBgYAoKWW91IGNhbiBwZWFrIGluc2lkZSBpdCwgaW5kZXhpbmcgaXQgbGlrZSB5b3Ugd291bGQgYSBgbWF0cml4YCBvciBgTWF0cml4YCBvYmplY3Q6CmBgYHtyfQpkb2NfdGVybV9tYXRyaXhbMTo1LDE6NV0KYGBgCgojIyBXaGF0IGFyZSB0aGUgbW9zdCBmcmVxdWVudCB0ZXJtcz8KCldoYXQgYXJlIHRoZSBtb3N0IGZyZXF1ZW50IHRlcm1zPwpgYGB7cn0KdG9wZmVhdHVyZXMoZG9jX3Rlcm1fbWF0cml4LDQwKQpgYGAKCllvdSBjYW4gZ2V0IHRoZSBzYW1lIHRoaW5nIHRocm91Z2ggc29ydGluZyBhIGNvbHVtbiBzdW0gb2YgdGhlIGR0bToKYGBge3IgY29sdW1uLXN1bX0Kd29yZF9mcmVxIDwtIGNvbFN1bXMoZG9jX3Rlcm1fbWF0cml4KQpzb3J0KHdvcmRfZnJlcSxkZWNyZWFzaW5nPVRSVUUpWzE6NDBdCmBgYAoKIyMgU3RvcHdvcmRzCgpGb3Igc29tZSBwdXJwb3NlcywgeW91IG1heSB3aXNoIHRvIHJlbW92ZSAic3RvcHdvcmRzLiIgVGhlcmUgYXJlIHN0b3B3b3JkIGxpc3RzIGFjY2Vzc2libGUgdGhyb3VnaCB0aGUgYHN0b3B3b3Jkc2AgZnVuY3Rpb24sIGV4cG9ydGVkIGZyb20gdGhlIGF1dG9tYXRpY2FsbHkgbG9hZGVkIGBzdG9wd29yZHNgIHBhY2thZ2UuIFRoZSBkZWZhdWx0IGlzIEVuZ2xpc2ggZnJvbSB0aGUgU25vd2JhbGwgY29sbGVjdGlvbi4gR2V0IGEgbGlzdCBvZiBzb3VyY2VzIHdpdGggYHN0b3B3b3Jkc19nZXRzb3VyY2VzKClgIGFuZCBhIGxpc3Qgb2YgbGFuZ3VhZ2VzIGZvciB0aGUgc291cmNlIHdpdGggYHN0b3B3b3Jkc19nZXRsYW5ndWFnZXMoKWAKClRoZSBkZWZhdWx0IEVuZ2xpc2ggbGlzdCBpcyBmYWlybHkgc2hvcnQuCmBgYHtyfQpzdG9wd29yZHMoJ2VuJylbMToxMF0gI1Nub3diYWxsCmxlbmd0aChzdG9wd29yZHMoJ2VuJykpCmBgYAoKVGhpcyBvbmUncyB0aHJlZSB0aW1lcyBsb25nZXIuCmBgYHtyfQpzdG9wd29yZHMoJ2VuJywgc291cmNlPSdzbWFydCcpWzE6MTBdCmxlbmd0aChzdG9wd29yZHMoJ2VuJywgc291cmNlPSdzbWFydCcpKQpgYGAKClRoaXMgb25lJ3MgYWxtb3N0IHRlbiB0aW1lcyBhcyBsb25nIGFuZCBpcyAuLi4gaW50ZXJlc3RpbmcKYGBge3J9CnN0b3B3b3JkcygnZW4nLCBzb3VyY2U9J3N0b3B3b3Jkcy1pc28nKVsxOjEwXQpsZW5ndGgoc3RvcHdvcmRzKCdlbicsIHNvdXJjZT0nc3RvcHdvcmRzLWlzbycpKQpgYGAKClRoZSBiZWdpbm5pbmcgb2YgYSBHZXJtYW4gbGlzdC4KYGBge3J9CnN0b3B3b3JkcygnZGUnKVsxOjEwXQpgYGAKCkEgc2xpY2UgZnJvbSBhbiBBbmNpZW50IEdyZWVrIGxpc3Q6CmBgYHtyfQpzdG9wd29yZHMoJ2dyYycsc291cmNlPSdhbmNpZW50JylbMjY0OjI4OF0KYGBgCgpMZXRzIGNhc2UtZm9sZCBvdXIgdG9rZW5zIG9iamVjdCB0byBsb3dlcmNhc2UsIHJlbW92ZSB0aGUgc3RvcHdvcmRzLCB0aGVuIG1ha2UgYSBuZXcgZHRtIGFuZCBzZWUgaG93IGl0J3MgZGlmZmVyZW50LgoKYGBge3J9CmluYXVndXJhbF90b2tlbnMubm9zdG9wIDwtIGluYXVndXJhbF90b2tlbnMgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgICB0b2tlbnNfdG9sb3dlcigpICU+JQogICAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3JlbW92ZShzdG9wd29yZHMoJ2VuJykpCmR0bS5ub3N0b3AgPC0gZGZtKGluYXVndXJhbF90b2tlbnMubm9zdG9wKQpkdG0ubm9zdG9wCmBgYAoKV2UndmUgZ290IGFib3V0IDEwMDAgZmV3ZXIgZmVhdHVyZXMsIGFuZCBpdCBpcyBzbGlnaHRseSAqbW9yZSogc3BhcnNlLiBXaHk/CgpXaGF0IGFyZSB0aGUgbW9zdCBmcmVxdWVudCB0b2tlbnMgbm93PwpgYGB7cn0KdG9wZmVhdHVyZXMoZHRtLm5vc3RvcCw0MCkKYGBgCgojIyBIb3cgaXMgdGhpcyBkb2N1bWVudCBkaWZmZXJlbnQgZnJvbSB0aG9zZSBkb2N1bWVudHM/ICMjCgpJJ20ganVzdCBjdXJpb3VzLiBCZXNpZGVzICJ0b21ic3RvbmVzLCIgd2hhdCBvdGhlciB3b3JkcyBtYWRlIHRoZWlyIGluYXVndXJhbCBkZWJ1dCBpbiAyMDE3PwoKYGBge3J9CnVuaXF1ZV90b190cnVtcCA8LSBhcy52ZWN0b3IoY29sU3Vtcyhkb2NfdGVybV9tYXRyaXgpID09IGRvY190ZXJtX21hdHJpeFsiMjAxNy1UcnVtcCIsXSkKY29sbmFtZXMoZG9jX3Rlcm1fbWF0cml4KVt1bmlxdWVfdG9fdHJ1bXBdCmBgYAoKT0shCgojIyBUaGUgaW1wYWN0IG9mIHByZXByb2Nlc3NpbmcgZGVjaXNpb25zICMKCldlIGNhbiBhbHNvIGNoYW5nZSB0aGUgc2V0dGluZ3MuIFdoYXQgaGFwcGVucyBpZiB3ZSBkb24ndCByZW1vdmUgcHVuY3R1YXRpb24/CgpgYGB7cn0KaW5hdWd1cmFsX3Rva2Vucy53cHVuY3QgPC0gcXVhbnRlZGE6OnRva2Vucyhjb3JwLAogICAgICAgICAgICAgICAgICAgICAgICAgIHdoYXQgPSAid29yZCIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlX3B1bmN0ID0gRkFMU0UpICU+JQogICAgICAgICAgICAgICAgICAgICAgICAgIHRva2Vuc190b2xvd2VyKCkgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3JlbW92ZShzdG9wd29yZHMoJ2VuJykpCiAgCmR0bS53cHVuY3QgPC0gZGZtKGluYXVndXJhbF90b2tlbnMud3B1bmN0KQpkdG0ud3B1bmN0CnRvcGZlYXR1cmVzKGR0bS53cHVuY3QsNDApCmBgYAoKSG93IGJpZyBpcyBpdCBub3c/IEhvdyBzcGFyc2UgaXMgaXQgbm93PwoKCldoYXQgaGFwcGVucyBpZiB3ZSBsb3dlciBjYXNlIGFuZCBzdGVtPwoKYGBge3J9CmluYXVndXJhbF90b2tlbnMuc3RlbXMgPC0gcXVhbnRlZGE6OnRva2Vucyhjb3JwLAogICAgICAgICAgICAgICAgICAgICAgICAgIHdoYXQgPSAid29yZCIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlX3B1bmN0ID0gVFJVRSkgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3RvbG93ZXIoKSAlPiUKICAgICAgICAgICAgICAgICAgICAgICAgICB0b2tlbnNfcmVtb3ZlKHN0b3B3b3JkcygnZW4nKSkgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3dvcmRzdGVtKCkKICAKZHRtLnN0ZW1zIDwtIGRmbShpbmF1Z3VyYWxfdG9rZW5zLnN0ZW1zKQpkdG0uc3RlbXMKdG9wZmVhdHVyZXMoZHRtLnN0ZW1zLDQwKQpgYGAKCiMjIFppcGYncyBMYXcgYW5kIGEgcG93ZXIgbGF3IAoKSXQncyBzb21ld2hhdCBkaWZmaWN1bHQgdG8gZ2V0IHlvdXIgaGVhZCBhcm91bmQgdGhlc2Ugc29ydHMgb2YgdGhpbmdzIGJ1dCB0aGVyZSBhcmUgc3RhdGlzdGljYWwgcmVndWxhcml0aWVzIGhlcmUuIEZvciBleGFtcGxlLCB0aGVzZSBmcmVxdWVuY2llcyB0ZW5kIHRvIGJlIGRpc3RyaWJ1dGVkIGJ5ICJaaXBmJ3MgTGF3IiBhbmQgYnkgYSAocmVsYXRlZCkgInBvd2VyIGxhdy4iCgpgYGB7ciB6aXBmfQpwbG90KDE6bmNvbChkb2NfdGVybV9tYXRyaXgpLHNvcnQoY29sU3Vtcyhkb2NfdGVybV9tYXRyaXgpLGRlYz1UKSwgbWFpbiA9ICJaaXBmJ3MgTGF3PyIsIHlsYWI9IkZyZXF1ZW5jeSIsIHhsYWIgPSAiRnJlcXVlbmN5IFJhbmsiKQpgYGAKClRoYXQgbWFrZXMgdGhlICJsb25nIHRhaWwiIGNsZWFyLiBUaGUgZ3JhbmQgcmVsYXRpb25zaGlwIGJlY29tZXMgY2xlYXJlciBpbiBhIGxvZ2FyaXRobWljIHNjYWxlOgoKYGBge3IgemlwZi1sb2d9CnBsb3QoMTpuY29sKGRvY190ZXJtX21hdHJpeCksc29ydChjb2xTdW1zKGRvY190ZXJtX21hdHJpeCksZGVjPVQpLCBtYWluID0gIlppcGYncyBMYXc/IiwgeWxhYj0iRnJlcXVlbmN5IiwgeGxhYiA9ICJGcmVxdWVuY3kgUmFuayIsIGxvZz0ieHkiKQpgYGAKCkZvciB0aGUgcG93ZXIgbGF3LCB3ZSBuZWVkIHRoZSBudW1iZXIgb2Ygd29yZHMgdGhhdCBhcHBlYXIgYXQgYW55IGdpdmVuIGZyZXF1ZW5jeS4gV2UnbGwgdHVybiBgd29yZF9mcmVxYCBpbnRvIGEgY2F0ZWdvcmljYWwgdmFyaWFibGUgYnkgbWFraW5nIGl0IGEgImZhY3RvciIuIFRoZSBjYXRlZ29yaWVzIGFyZSAiMSIsICIyIiwgLi4uIjE3IiAuLi5ldGMuIGFuZCB0aGVuIHVzZSBgc3VtbWFyeWAgdG8gZ2l2ZSB1cyBjb3VudHMgb2YgZWFjaCAiY2F0ZWdvcnkuIiAoVGhlIGBtYXhzdW1gIG9wdGlvbiBpcyB1c2VkIHRvIGJlIHN1cmUgaXQgZG9lc24ndCBzdG9wIGF0IDEwMCBhbmQgbHVtcCBldmVyeXRoaW5nIGVsc2UgYXMgIk90aGVyIgoKYGBge3IgcG93ZXItbGF3fQp3b3Jkc193aXRoX2ZyZXEgPC0gc3VtbWFyeShhcy5mYWN0b3Iod29yZF9mcmVxKSxtYXhzdW09MTAwMDApCmZyZXFfYmluIDwtIGFzLmludGVnZXIobmFtZXMod29yZHNfd2l0aF9mcmVxKSkKCnBsb3QoZnJlcV9iaW4sIHdvcmRzX3dpdGhfZnJlcSwgbWFpbj0iUG93ZXIgTGF3PyIsIHhsYWI9IldvcmQgRnJlcXVlbmN5IiwgeWxhYj0iTnVtYmVyIG9mIFdvcmRzIiwgbG9nPSJ4eSIpCgpgYGAKClppcGYncyBsYXcgaW1wbGllcyB0aGF0LCBpbiBhIG5ldyBjb3JwdXMgc2F5LCBhIHNtYWxsIG51bWJlciBvZiB0ZXJtcyB3aWxsIGJlIHZlcnkgY29tbW9uICh3ZSdsbCBrbm93IGEgbG90IGFib3V0IHRoZW0sIGJ1dCB0aGV5IHdvbid0IGhlbHAgdXMgZGlzdGluZ3Vpc2ggZG9jdW1lbnRzKSwgYSBsYXJnZSBudW1iZXIgb2YgdGVybXMgd2lsbCBiZSB2ZXJ5IHJhcmUgKHdlJ2xsIGtub3cgdmVyeSBsaXR0bGUgYWJvdXQgdGhlbSksIGFuZCB0aGF0IHRoZXJlIHdpbGwgYmUgc29tZSBudW1iZXIgb2YgdGVybXMgd2UgKmhhdmUgbmV2ZXIgc2VlbiBiZWZvcmUqLiBUaGlzICJvdXQtb2Ytdm9jYWJ1bGFyeSIgKE9PVikgcHJvYmxlbSBpcyBhbiBpbXBvcnRhbnQgb25lIGluIHNvbWUgYXBwbGljYXRpb25zLgoKIyMgQSBzdGVwIHRvd2FyZCB3b3JkIG9yZGVyIG1hdHRlcmluZzogbi1ncmFtcwoKTGV0J3MgZ28gYmFjayB0byBwcmVwcm9jZXNzaW5nIGNob2ljZXMuIFdoYXQgaGFwcGVucyBpZiB3ZSBjb3VudCBiaWdyYW1zPyBMZXQncyBmaXJzdCBkbyBpdCB3aXRob3V0IHJlbW92aW5nIHN0b3B3b3Jkcy4KCmBgYHtyIGJpZ3JhbXN9CmluYXVndXJhbF90b2tlbnMuMmdyYW1zIDwtIGluYXVndXJhbF90b2tlbnMgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3RvbG93ZXIoKSAlPiUKICAgICAgICAgICAgICAgICAgICAgICAgICB0b2tlbnNfbmdyYW1zKG49MikKICAKZHRtLjJncmFtcyA8LSBkZm0oaW5hdWd1cmFsX3Rva2Vucy4yZ3JhbXMpCmR0bS4yZ3JhbXMKdG9wZmVhdHVyZXMoZHRtLjJncmFtcyw0MCkKYGBgCgpIb3cgYmlnIGlzIGl0PyBIb3cgc3BhcnNlPyBJdCBkb2Vzbid0IGdpdmUgdXMgYSBsb3Qgb2Ygc2Vuc2Ugb2YgY29udGVudCwgYnV0IGl0IGRvZXMgb2ZmZXIgc29tZSBydWRpbWVudGFyeSBpbnNpZ2h0cyBpbnRvIGhvdyBFbmdsaXNoIGlzIHN0cnVjdHVyZWQuCgpGb3IgZXhhbXBsZSwgd2UgY2FuIGNyZWF0ZSBhIHJ1ZGltZW50YXJ5IHN0YXRpc3RpY2FsIGxhbmd1YWdlIG1vZGVsIHRoYXQgInByZWRpY3RzIiB0aGUgbmV4dCB3b3JkIGJhc2VkIG9uIGJpZ3JhbSBmcmVxdWVuY2llcy4gV2UgYXBwbHkgQmF5ZXMnIFRoZW9yZW0gYnkgY2FsY3VsYXRpbmcgdGhlIGZyZXF1ZW5jeSBvZiBiaWdyYW1zIHN0YXJ0aW5nIHdpdGggdGhlIGN1cnJlbnQgd29yZCBhbmQgdGhlbiBzY2FsaW5nIHRoYXQgYnkgZGl2aWRpbmcgYnkgdGhlIHRvdGFsIGZyZXF1ZW5jeSAoc2VlIEp1cmFmc2t5IGFuZCBNYXJ0aW4sICpTcGVlY2ggYW5kIExhbmd1YWdlIFByb2Nlc3NpbmcqLCBDaGFwdGVyIDM6IGh0dHBzOi8vd2ViLnN0YW5mb3JkLmVkdS9+anVyYWZza3kvc2xwMy8gZm9yIG1vcmUgZGV0YWlsIC8gbnVhbmNlKS4KCklmIHRoZSBjdXJyZW50IHdvcmQgaXMgImFtZXJpY2FuIiB3aGF0IGlzIHByb2JhYmx5IG5leHQsIGluIHRoaXMgY29ycHVzPwoKRmlyc3Qgd2UgZmluZCB0aGUgcmlnaHQgYmlncmFtcyB1c2luZyBhIHJlZ3VsYXIgZXhwcmVzc2lvbi4gU2VlIHRoZSByZWd1bGFyIGV4cHJlc3Npb25zIG5vdGVib29rIGZvciBtb3JlIGRldGFpbCBpZiB0aGF0IGlzIHVuZmFtaWxpYXIuCmBgYHtyIGFtZXJpY2FuX2JpZ3JhbXN9CmFtZXJpY2FuX2JpZ3JhbXMgPC0gZ3JlcCgiXmFtZXJpY2FuXyIsY29sbmFtZXMoZHRtLjJncmFtcyksdmFsdWU9VFJVRSkKYW1lcmljYW5fYmlncmFtcwpgYGAKCk1vc3QgbGlrZWx5IGJpZ3JhbXMgc3RhcnRpbmcgd2l0aCAiYW1lcmljYW4iOgpgYGB7cn0KZnJlcV9hbWVyaWNhbl9iaWdyYW1zIDwtIGNvbFN1bXMoZHRtLjJncmFtc1ssYW1lcmljYW5fYmlncmFtc10pCm1vc3RfbGlrZWx5X2JpZ3JhbXMgPC0gc29ydChmcmVxX2FtZXJpY2FuX2JpZ3JhbXMvc3VtKGZyZXFfYW1lcmljYW5fYmlncmFtcyksZGVjPVRSVUUpWzE6MTBdCm1vc3RfbGlrZWx5X2JpZ3JhbXMKYGBgCgpMZXQncyBzZWUgd2hhdCBoYXBwZW5zIGlmIHdlIHJlbW92ZSB0aGUgc3RvcHdvcmRzIGZpcnN0LgpgYGB7ciBiaWdyYW1zLW5vc3RvcH0KaW5hdWd1cmFsX3Rva2Vucy4yZ3JhbXMubm9zdG9wIDwtIGluYXVndXJhbF90b2tlbnMgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX3RvbG93ZXIoKSAlPiUKICAgICAgICAgICAgICAgICAgICAgICAgICB0b2tlbnNfcmVtb3ZlKHN0b3B3b3JkcygnZW4nKSkgJT4lCiAgICAgICAgICAgICAgICAgICAgICAgICAgdG9rZW5zX25ncmFtcyhuPTIpCiAgCmR0bS4yZ3JhbXMubm9zdG9wIDwtIGRmbShpbmF1Z3VyYWxfdG9rZW5zLjJncmFtcy5ub3N0b3ApCmR0bS4yZ3JhbXMubm9zdG9wCnRvcGZlYXR1cmVzKGR0bS4yZ3JhbXMubm9zdG9wLDQwKQpgYGAKCkhvdyBiaWcgaXMgaXQ/IEhvdyBzcGFyc2U/IEl0IGdpdmVzIHNvbWUgaW50ZXJlc3RpbmcgY29udGVudCAtLSAiZ3JlYXRfbmF0aW9uIiwgImFsbWlnaHR5X2dvZCIsICJwdWJsaWNfZGVidCIgLS0gYnV0IHNvbWUgY29uZnVzaW5nIGNvbnRydWN0aW9ucywgZS5nLiAicGVvcGxlX3dvcmxkIiB3aGljaCBpcyByZWFsbHkgdGhpbmdzIGxpa2UgInBlb3BsZV9vZl90aGVfd29ybGQuIiAKCiMjIENhbiBJIGRyYXcgdGhvc2Ugc2xpY2sgd29yZGNsb3Vkcz8gIyMKClVnaCwgd2VsbCwgeWVzLCBpZiB5b3UgbXVzdC4gV29yZGNsb3VkcyBhcmUgYW4gYWJvbWluYXRpb24gLS0gSSdsbCByYW50IGFib3V0IHRoYXQgYXQgYSBsYXRlciBkYXRlIC0tIGJ1dCBoZXJlJ3MgVHJ1bXAncyBmaXJzdCBpbmF1Z3VyYWwgaW4gYSB3b3JkY2xvdWQgLi4uCgpgYGB7ciB3b3JkY2xvdWQsIGZpZy53aWR0aD02fQpsaWJyYXJ5KHF1YW50ZWRhLnRleHRwbG90cykKCnNldC5zZWVkKDEwMCkKdGV4dHBsb3Rfd29yZGNsb3VkKGR0bS5ub3N0b3BbIjIwMTctVHJ1bXAiLF0sIG1pbl9jb3VudCA9IDEsIHJhbmRvbV9vcmRlciA9IEZBTFNFLAogICAgICAgICAgICAgICAgICAgcm90YXRpb24gPSAuMjUsIAogICAgICAgICAgICAgICAgICAgY29sb3IgPSBSQ29sb3JCcmV3ZXI6OmJyZXdlci5wYWwoOCwiRGFyazIiKSkKYGBgCgoKIyBQcmFjdGljZSBFeGVyY2lzZXMKClNhdmUgYSBjb3B5IG9mIHRoZSBub3RlYm9vayBhbmQgdXNlIGl0IHRvIGFuc3dlciB0aGUgcXVlc3Rpb25zIGJlbG93LiBUaG9zZSBsYWJlbGVkICJDaGFsbGVuZ2UiIHJlcXVpcmUgbW9yZSB0aGFuIGRlbW9uc3RyYXRlZCBhYm92ZS4KCioqMSkqKiBVc2UgdGhlIGBpbmF1Z3VyYWxfdG9rZW5zLm5vc3RvcGAgb2JqZWN0LiBEZWZpbmUgYSB3b3JkJ3MgImNvbnRleHQiIGFzIGEgd2luZG93IG9mIGZpdmUgd29yZHMvdG9rZW5zIGJlZm9yZSBhbmQgYWZ0ZXIgYSB3b3JkJ3MgdXNhZ2UuIEluIHdoYXQgY29udGV4dHMgZG9lcyB0aGUgd29yZCAiUm9tYW4iIGFwcGVhciBpbiB0aGlzIGNvcnB1cz8gIAoKCgoqKjIpKiogVXNpbmcgYGR0bS53cHVuY3RgLCB3aGljaCBwcmVzaWRlbnQgdXNlZCB0aGUgbW9zdCBleGNsYW1hdGlvbiBwb2ludHMgaW4gaGlzIGluYXVndXJhbCBhZGRyZXNzPwoKCioqMykqKiBVc2UgYGR0bS5ub3N0b3BgIGZvciB0aGVzZSBxdWVzdGlvbnMuIAoKKiphKSoqIERvIGFueSB0ZXJtcyBhcHBlYXIgKipvbmx5KiogaW4gdGhlIGRvY3VtZW50IGNvbnRhaW5pbmcgQWJyYWhhbSBMaW5jb2xuJ3MgZmlyc3QgaW5hdWd1cmFsIGFkZHJlc3M/CgoqKmIpKiogKipDaGFsbGVuZ2UqKjogSG93IG1hbnkgdGVybXMgYXBwZWFyZWQgKipmaXJzdCoqIGluIEFicmFoYW0gTGluY29sbidzIGZpcnN0IGluYXVndXJhbCBhZGRyZXNzPwoKKipjKSoqIEhvdyBtYW55IHRpbWVzIGhhcyB0aGUgd29yZCAic2xhdmUiIGJlZW4gdXNlZCBpbiBpbmF1Z3VyYWwgYWRkcmVzc2VzPwoKKipkKSoqICoqQ2hhbGxlbmdlKio6IEhvdyBtYW55IHRpbWVzIGhhcyBhIHdvcmQgdGhhdCAqKmluY2x1ZGVkKiogInNsYXZlIiAobGlrZSAic2xhdmVyeSIgb3IgImVuc2xhdmVkIikgYmVlbiB1c2VkIGluIGluYXVndXJhbCBhZGRyZXNzZXM/CgoKKio0KSoqIENvbnN0cnVjdCBhIGR0bSBvZiAqKnRyaWdyYW1zKiogKGxvd2VyIGNhc2UsIG5vdCBzdGVtbWVkLCBubyBzdG9wIHdvcmRzIHJlbW92ZWQpLgoKKiphKSoqIEhvdyBiaWcgaXMgdGhlIG1hdHJpeD8gSG93IHNwYXJzZSBpcyBpdD8KCgoqKmIpKiogV2hhdCBhcmUgdGhlIDUwIG1vc3QgZnJlcXVlbnQgdHJpZ3JhbXM/CgoqKmMpKiogKipDaGFsbGVuZ2UqKiBIb3cgbWFueSB0cmlncmFtcyBhcHBlYXIgb25seSBvbmNlPwoKKio1KSoqIFRva2VuaXplIHRoZSBmb2xsb3dpbmcgc3RyaW5nIG9mIHR3ZWV0cyB1c2luZyB0aGUgYnVpbHQtaW4gYHdvcmRgIHRva2VuaXplciwgdGhlIGB0b2tlbml6ZV93b3Jkc2AgdG9rZW5pemVyIGZyb20gdGhlIGB0b2tlbml6ZXJzYCBwYWNrYWdlLCBhbmQgdGhlIGB0b2tlbml6ZV90d2VldHNgIHRva2VuaXplciBmcm9tIHRoZSBgdG9rZW5pemVyc2AgcGFja2FnZSwgYW5kIGV4cGxhaW4gd2hhdCdzIGRpZmZlcmVudC4KCj5odHRwczovL3QuY28vOXoySjNQMzNVYyBGQiBuZWVkcyB0byBodXJyeSB1cCBhbmQgYWRkIGEgbGF1Z2gvY3J5IGJ1dHRvbiDwn5is8J+YrfCfmJPwn6Si8J+ZhPCfmLEgU2luY2UgZWF0aW5nIG15IGZlZWxpbmdzIGhhcyBub3QgZml4ZWQgdGhlIHdvcmxkJ3MgcHJvYmxlbXMsIEkgZ3Vlc3MgSSdsbCB0cnkgdG8gc2xlZXAuLi4gSE9MWSBDUkFQOiBEZVZvcyBxdWVzdGlvbm5haXJlIGFwcGVhcnMgdG8gaW5jbHVkZSBwYXNzYWdlcyBmcm9tIHVuY2l0ZWQgc291cmNlcyBodHRwczovL3QuY28vRk5Sb09sZnc5cyB3ZWxsIHBsYXllZCwgU2VuYXRvciBNdXJyYXkgS2VlcCB0aGUgcHJlc3N1cmUgb246IGh0dHBzOi8vdC5jby80aGZPc21kazBsIEBkYXRhZ2VuZXJhbCB0aHggTXIgVGF1c3NpZyBJdCdzIGludGVyZXN0aW5nIGhvdyBtYW55IHBlb3BsZSBjb250YWN0IG1lIGFib3V0IGFwcGx5aW5nIGZvciBhIFBoRCBhbmQgZG9uJ3Qgc3BlbGwgbXkgbmFtZSByaWdodC4KCgoKCgoKCgoKCg==

An Introduction to Text as Data with quanteda

Penn State and Essex courses in "Text as Data"

Burt L. Monroe