Updated September 2021.
The quanteda package (https://quanteda.io) is a very general and well-documented ecosystem for text analysis in R. A very large percentage of what is typically done in social science text-as-data research can be done with, or at least through, quanteda. Among the "competitors" to quanteda are the classic package tm and the tidyverse-consistent package tidytext. These actually are interrelated, with shared code and conversion utilities available, so they aren't necessarily in conflict.
Official description:
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
In addition to the extensive documentation, Stefan Muller and Ken Benoit have a very helpful cheatsheet here: https://muellerstefan.net/files/quanteda-cheatsheet.pdf.
In this notebook, we will use quanteda to turn a collection of texts, a corpus, into quantitative data, with each document represented by the counts of the "words" in it. Since we do away with word order this is called a bag-of-words representation.
Install the following packages if you haven't.
# install.packages("quanteda", dependencies=TRUE)
# install.packages("tokenizers", dependencies=TRUE)
# install.packages("quanteda.textplots", dependencies=TRUE)
# install.packages("RColorBrewer", dependencies=TRUE)
Note that quanteda has in recent versions moved analysis and plotting functions to new packages quanteda.textplots, quanteda.textmodels (classification and scaling models), and quanteda.textstats.
Now load quanteda:
library(quanteda)
If you are working on RStudio Cloud, you may have received a warning message about the "locale." You set the locale for British English ("en_GB") with the stri_locale_set
comman in the already loaded stringi package. You may wish to set it to assume you are working in a different context (e.g., "en_US" for US English) or language (e.g., "pt_BR" for Brazilian Portuguese). This seems to happen every time an RStudio Cloud project with quanteda loaded is reopened, so you have to reissue this command to make the warning message go away.
# stringi::stri_locale_set("en_GB")
Quanteda comes with several corpora included. Lets load in the corpus of US presidential inaugural addresses and see what it looks like:
corp <- quanteda::data_corpus_inaugural
summary(corp)
Corpus consisting of 59 documents, showing 59 documents:
Text Types Tokens Sentences Year
1789-Washington 625 1537 23 1789
1793-Washington 96 147 4 1793
1797-Adams 826 2577 37 1797
1801-Jefferson 717 1923 41 1801
1805-Jefferson 804 2380 45 1805
1809-Madison 535 1261 21 1809
1813-Madison 541 1302 33 1813
1817-Monroe 1040 3677 121 1817
1821-Monroe 1259 4886 131 1821
1825-Adams 1003 3147 74 1825
1829-Jackson 517 1208 25 1829
1833-Jackson 499 1267 29 1833
1837-VanBuren 1315 4158 95 1837
1841-Harrison 1898 9123 210 1841
1845-Polk 1334 5186 153 1845
1849-Taylor 496 1178 22 1849
1853-Pierce 1165 3636 104 1853
1857-Buchanan 945 3083 89 1857
1861-Lincoln 1075 3999 135 1861
1865-Lincoln 360 775 26 1865
1869-Grant 485 1229 40 1869
1873-Grant 552 1472 43 1873
1877-Hayes 831 2707 59 1877
1881-Garfield 1021 3209 111 1881
1885-Cleveland 676 1816 44 1885
1889-Harrison 1352 4721 157 1889
1893-Cleveland 821 2125 58 1893
1897-McKinley 1232 4353 130 1897
1901-McKinley 854 2437 100 1901
1905-Roosevelt 404 1079 33 1905
1909-Taft 1437 5821 158 1909
1913-Wilson 658 1882 68 1913
1917-Wilson 549 1652 59 1917
1921-Harding 1169 3719 148 1921
1925-Coolidge 1220 4440 196 1925
1929-Hoover 1090 3860 158 1929
1933-Roosevelt 743 2057 85 1933
1937-Roosevelt 725 1989 96 1937
1941-Roosevelt 526 1519 68 1941
1945-Roosevelt 275 633 27 1945
1949-Truman 781 2504 116 1949
1953-Eisenhower 900 2743 119 1953
1957-Eisenhower 621 1907 92 1957
1961-Kennedy 566 1541 52 1961
1965-Johnson 568 1710 93 1965
1969-Nixon 743 2416 103 1969
1973-Nixon 544 1995 68 1973
1977-Carter 527 1369 52 1977
1981-Reagan 902 2780 129 1981
1985-Reagan 925 2909 123 1985
1989-Bush 795 2673 141 1989
1993-Clinton 642 1833 81 1993
1997-Clinton 773 2436 111 1997
2001-Bush 621 1806 97 2001
2005-Bush 772 2312 99 2005
2009-Obama 938 2689 110 2009
2013-Obama 814 2317 88 2013
2017-Trump 582 1660 88 2017
2021-Biden.txt 811 2766 216 2021
President FirstName Party
Washington George none
Washington George none
Adams John Federalist
Jefferson Thomas Democratic-Republican
Jefferson Thomas Democratic-Republican
Madison James Democratic-Republican
Madison James Democratic-Republican
Monroe James Democratic-Republican
Monroe James Democratic-Republican
Adams John Quincy Democratic-Republican
Jackson Andrew Democratic
Jackson Andrew Democratic
Van Buren Martin Democratic
Harrison William Henry Whig
Polk James Knox Whig
Taylor Zachary Whig
Pierce Franklin Democratic
Buchanan James Democratic
Lincoln Abraham Republican
Lincoln Abraham Republican
Grant Ulysses S. Republican
Grant Ulysses S. Republican
Hayes Rutherford B. Republican
Garfield James A. Republican
Cleveland Grover Democratic
Harrison Benjamin Republican
Cleveland Grover Democratic
McKinley William Republican
McKinley William Republican
Roosevelt Theodore Republican
Taft William Howard Republican
Wilson Woodrow Democratic
Wilson Woodrow Democratic
Harding Warren G. Republican
Coolidge Calvin Republican
Hoover Herbert Republican
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Truman Harry S. Democratic
Eisenhower Dwight D. Republican
Eisenhower Dwight D. Republican
Kennedy John F. Democratic
Johnson Lyndon Baines Democratic
Nixon Richard Milhous Republican
Nixon Richard Milhous Republican
Carter Jimmy Democratic
Reagan Ronald Republican
Reagan Ronald Republican
Bush George Republican
Clinton Bill Democratic
Clinton Bill Democratic
Bush George W. Republican
Bush George W. Republican
Obama Barack Democratic
Obama Barack Democratic
Trump Donald J. Republican
Biden Joseph R. Democratic
What does a document look like? Let's look at one document (George Washington's first inaugural), which can be accessed with the as.character
method. (The previous command texts
has been deprecated.)
as.character(corp[1])
1789-Washington
"Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years - a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "
The first task is tokenizing. You can apply a tokenizer in quanteda with the tokens
command, turning a "corpus" object -- or just a vector of texts -- into a "tokens" object. In the latest version of Quanteda, most commands operate on a tokens object.
The examples from the help file will be used to show a few of the options:
txt <- c(doc1 = "A sentence, showing how tokens() works.",
doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
doc3 = "Self-documenting code??",
doc4 = "£1,000,000 for 50¢ is gr8 4ever \U0001f600")
tokens(txt)
Tokens consisting of 4 documents.
doc1 :
[1] "A" "sentence" "," "showing"
[5] "how" "tokens" "(" ")"
[9] "works" "."
doc2 :
[1] "@quantedainit"
[2] "and"
[3] "#textanalysis"
[4] "https://example.com?p=123."
doc3 :
[1] "Self-documenting" "code"
[3] "?" "?"
doc4 :
[1] "£" "1,000,000" "for"
[4] "50" "¢" "is"
[7] "gr8" "4ever" "\U0001f600"
The what
option selects different tokenizers. The default is word
which replaces a slower and less subtle word1
legacy version.
tokens(txt, what = "word1")
Tokens consisting of 4 documents.
doc1 :
[1] "A" "sentence" "," "showing"
[5] "how" "tokens" "(" ")"
[9] "works" "."
doc2 :
[1] "@" "quantedainit" "and"
[4] "#" "textanalysis" "https"
[7] ":" "/" "/"
[10] "example.com" "?" "p"
[ ... and 3 more ]
doc3 :
[1] "Self-documenting" "code"
[3] "?" "?"
doc4 :
[1] "£" "1,000,000" "for"
[4] "50" "¢" "is"
[7] "gr8" "4ever" "\U0001f600"
For some purposes you may wish to tokenize by characters:
tokens(txt[1], what = "character")
Tokens consisting of 1 document.
doc1 :
[1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
[ ... and 22 more ]
You can "tokenize" (the usual term is "segment") by sentence in Quanteda, but note that they recommend the spacyr package (discussed in a separate notebook) for better sentence segmentation. Let's try it on Washington's inaugural:
tokens(corp[1], what = "sentence")
Tokens consisting of 1 document and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month."
[2] "On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years - a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time."
[3] "On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies."
[4] "In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected."
[5] "All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated."
[6] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge."
[7] "In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either."
[8] "No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States."
[9] "Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage."
[10] "These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed."
[11] "You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence."
[12] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\""
[ ... and 11 more ]
Wow, those are long sentences. Out of curiosity, let's look at Trump's:
tokens(corp[58], what = "sentence")
Tokens consisting of 1 document and 4 docvars.
2017-Trump :
[1] "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you."
[2] "We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people."
[3] "Together, we will determine the course of America and the world for many, many years to come."
[4] "We will face challenges."
[5] "We will confront hardships."
[6] "But we will get the job done."
[7] "Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition."
[8] "They have been magnificent."
[9] "Thank you."
[10] "Today's ceremony, however, has very special meaning."
[11] "Because today we are not merely transferring power from one Administration to another, or from one party to another - but we are transferring power from Washington DC and giving it back to you, the people."
[12] "For too long, a small group in our nation's Capital has reaped the rewards of government while the people have borne the cost."
[ ... and 76 more ]
Those are ... shorter.
There are a number of options you can apply with the tokens
command, controlling how the tokenizer deals with punctuation, numbers, symbols, hyphenization, etc. Again, just the help file examples:
# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)
Tokens consisting of 2 documents.
doc1 :
[1] "A" "sentence" "showing" "how"
[5] "tokens" "works"
doc2 :
[1] "@quantedainit"
[2] "and"
[3] "#textanalysis"
[4] "https://example.com?p=123."
# splitting hyphenated words
tokens(txt[3])
Tokens consisting of 1 document.
doc3 :
[1] "Self-documenting" "code"
[3] "?" "?"
tokens(txt[3], split_hyphens = TRUE)
Tokens consisting of 1 document.
doc3 :
[1] "Self" "-" "documenting"
[4] "code" "?" "?"
# symbols and numbers
tokens(txt[4])
Tokens consisting of 1 document.
doc4 :
[1] "£" "1,000,000" "for"
[4] "50" "¢" "is"
[7] "gr8" "4ever" "\U0001f600"
tokens(txt[4], remove_numbers = TRUE)
Tokens consisting of 1 document.
doc4 :
[1] "£" "for" "¢"
[4] "is" "gr8" "4ever"
[7] "\U0001f600"
tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)
Tokens consisting of 1 document.
doc4 :
[1] "for" "is" "gr8" "4ever"
You can use other tokenizers, like those from the "tokenizers" package. The output of a command like tokenizers::tokenize_words
can be passed to the tokens command:
# install.packages("tokenizers")
library(tokenizers)
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)
Tokens consisting of 1 document.
doc4 :
[1] "1,000,000" "for" "50" "is"
[5] "gr8" "4ever"
# using pipe notation
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
tokens(remove_symbols = TRUE)
Tokens consisting of 4 documents.
doc1 :
[1] "A" "sentence" "," "showing"
[5] "how" "tokens" "(" ")"
[9] "works" "."
doc2 :
[1] "@" "quantedainit" "and"
[4] "#" "textanalysis" "https"
[7] ":" "/" "/"
[10] "example.com" "?" "p"
[ ... and 2 more ]
doc3 :
[1] "Self" "-" "documenting"
[4] "code" "?" "?"
doc4 :
[1] "1,000,000" "for" "50" "is"
[5] "gr8" "4ever"
tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) %>%
tokens(remove_punct = TRUE)
Tokens consisting of 1 document.
doc3 :
[1] "s" "e" "l" "f" "d" "o" "c" "u" "m" "e" "n" "t"
[ ... and 7 more ]
tokenizers::tokenize_sentences(
"The quick brown fox. It jumped over the lazy dog.") %>%
tokens()
Tokens consisting of 1 document.
text1 :
[1] "The quick brown fox."
[2] "It jumped over the lazy dog."
Look carefully -- what did it do differently?
Let's make a fairly generic tokens object from our inaugural speeches corpus.
inaugural_tokens <- quanteda::tokens(corp,
what = "word",
remove_punct = TRUE, # default FALSE
remove_symbols = TRUE, # default FALSE
remove_numbers = FALSE,
remove_url = TRUE, # default FALSE
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = quanteda_options("verbose")
)
This produces a tokens
class object. Expand the object in your RStudio Environment tab to take a look at it.
Foremost, it's a list with one entry per document consisting of a character vector of the document's tokens.
inaugural_tokens[["2017-Trump"]][1:30]
[1] "Chief" "Justice" "Roberts" "President"
[5] "Carter" "President" "Clinton" "President"
[9] "Bush" "President" "Obama" "fellow"
[13] "Americans" "and" "people" "of"
[17] "the" "world" "thank" "you"
[21] "We" "the" "citizens" "of"
[25] "America" "are" "now" "joined"
[29] "in" "a"
It also has a vector of the "types" -- the vocabulary of tokens in the whole corpus/object. This attribute can be accessed through the attr
function.
attr(inaugural_tokens,"types")[1:30]
[1] "Fellow-Citizens" "of"
[3] "the" "Senate"
[5] "and" "House"
[7] "Representatives" "Among"
[9] "vicissitudes" "incident"
[11] "to" "life"
[13] "no" "event"
[15] "could" "have"
[17] "filled" "me"
[19] "with" "greater"
[21] "anxieties" "than"
[23] "that" "which"
[25] "notification" "was"
[27] "transmitted" "by"
[29] "your" "order"
length(attr(inaugural_tokens, "types"))
[1] 10147
Just over 10000 unique tokens have been used. Notice the
appears third and never again. But ... The
does:
which(attr(inaugural_tokens,"types")=="The")
[1] 339
Why are they "the" and "The" different types? Why is "Fellow-Citizens" one type?
Under the hood, the tokens
vector isn't a vector of strings. It's a vector of integers, indicating the index of the token in the type vector. So every time the
appears, it is stored as the integer 3.
By default, the tokens
object also retains all of the document metadata that came with the corpus.
The tokens object also provides access to a variety of quanteda utilities. For example, a very helpful traditional qualitative tool is the Key Words in Context or kwic
command:
kwic(inaugural_tokens, "humble", window=3)
Keyword-in-context with 13 matches.
[1789-Washington, 572] along with an |
[1789-Washington, 1359] Human Race in |
[1797-Adams, 2123] age and with |
[1801-Jefferson, 169] the contemplation and |
[1821-Monroe, 173] favor of my |
[1825-Adams, 2902] I commit with |
[1829-Jackson, 85] dedication of my |
[1833-Jackson, 91] extent of my |
[1853-Pierce, 3174] in the nation's |
[1857-Buchanan, 1204] I feel an |
[1953-Eisenhower, 765] of the most |
[1997-Clinton, 586] a new century |
[2009-Obama, 1760] we remember with |
humble | anticipation of the
humble | supplication that since
humble | reverence I feel
humble | myself before the
humble | pretensions the difficulties
humble | but fearless confidence
humble | abilities to their
humble | abilities in continued
humble | acknowledged dependence upon
humble | confidence that the
humble | and of the
humble | enough not to
humble | gratitude those brave
kwic(inaugural_tokens, "tombstones", window=4)
Keyword-in-context with 1 match.
[2017-Trump, 456]
rusted-out factories scattered like | tombstones |
across the landscape of
Hmmm. Moving on.
Stemming is the truncation of words in an effort to associate related words with a common token, e.g., "baby" and "babies" -> "babi".
The tokenizers package provides a wrapper to the wordStem
function from the SnowballC package, which applies a standard stemmer called the Porter stemmer. (The function takes as input a vector of texts or corpus, and returns a list, each element a vector of the stems for the corresponding text.)
tokenizers::tokenize_word_stems(corp)$`2017-Trump`[1:50]
[1] "chief" "justic" "robert" "presid"
[5] "carter" "presid" "clinton" "presid"
[9] "bush" "presid" "obama" "fellow"
[13] "american" "and" "peopl" "of"
[17] "the" "world" "thank" "you"
[21] "we" "the" "citizen" "of"
[25] "america" "are" "now" "join"
[29] "in" "a" "great" "nation"
[33] "effort" "to" "rebuild" "our"
[37] "countri" "and" "restor" "it"
[41] "promis" "for" "all" "of"
[45] "our" "peopl" "togeth" "we"
[49] "will" "determin"
Quanteda is focused largely on bag-of-words (or bag-of-tokens or bag-of-terms) models that work from a document-term matrix. where each row represents a document, each column represents a type (a "term" in the vocabulary) and the entries are the counts of tokens matching the term in the current document.
For this we will use quanteda's "dfm" command with some commonly chosen preprocessing options. In older version os quanteda, the dfm function was applied to a corpus, with tokenizing and normalizing options applied there. It is now applied to a tokens object where most of that has already been done. Here, we'll add case-folding, merging the
and The
, among other things, into a single type.
doc_term_matrix <- quanteda::dfm(inaugural_tokens,
tolower = TRUE # case-fold
)
What kind of object is doc_term_matrix?
class(doc_term_matrix)
[1] "dfm"
attr(,"package")
[1] "quanteda"
Typing the dfm's name will show an object summary. This is a matrix, so how many rows does it have? How many columns? What does "91.89% sparse" mean?
doc_term_matrix
Document-feature matrix of: 59 documents, 9,422 features (91.89% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and
1789-Washington 1 71 116 1 48
1793-Washington 0 11 13 0 2
1797-Adams 3 140 163 1 130
1801-Jefferson 2 104 130 0 81
1805-Jefferson 0 101 143 0 93
1809-Madison 1 69 104 0 43
features
docs house representatives among
1789-Washington 2 2 1
1793-Washington 0 0 0
1797-Adams 0 2 4
1801-Jefferson 0 0 1
1805-Jefferson 0 0 7
1809-Madison 0 0 0
features
docs vicissitudes incident
1789-Washington 1 1
1793-Washington 0 0
1797-Adams 0 0
1801-Jefferson 0 0
1805-Jefferson 0 0
1809-Madison 0 0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,412 more features ]
You can peak inside it, indexing it like you would a matrix
or Matrix
object:
doc_term_matrix[1:5,1:5]
Document-feature matrix of: 5 documents, 5 features (20.00% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and
1789-Washington 1 71 116 1 48
1793-Washington 0 11 13 0 2
1797-Adams 3 140 163 1 130
1801-Jefferson 2 104 130 0 81
1805-Jefferson 0 101 143 0 93
What are the most frequent terms?
topfeatures(doc_term_matrix,40)
the of and to
10183 7180 5406 4591
in a our we
2827 2292 2224 1827
that be is it
1813 1502 1491 1398
for by have which
1230 1091 1031 1007
not with as will
980 970 966 944
this i all are
874 871 836 828
their but has people
761 670 631 584
from its government or
578 573 564 563
on my us been
544 515 505 496
can no they so
487 470 463 397
You can get the same thing through sorting a column sum of the dtm:
word_freq <- colSums(doc_term_matrix)
sort(word_freq,decreasing=TRUE)[1:40]
the of and to
10183 7180 5406 4591
in a our we
2827 2292 2224 1827
that be is it
1813 1502 1491 1398
for by have which
1230 1091 1031 1007
not with as will
980 970 966 944
this i all are
874 871 836 828
their but has people
761 670 631 584
from its government or
578 573 564 563
on my us been
544 515 505 496
can no they so
487 470 463 397
For some purposes, you may wish to remove "stopwords." There are stopword lists accessible through the stopwords
function, exported from the automatically loaded stopwords
package. The default is English from the Snowball collection. Get a list of sources with stopwords_getsources()
and a list of languages for the source with stopwords_getlanguages()
The default English list is fairly short.
stopwords('en')[1:10] #Snowball
[1] "i" "me" "my" "myself"
[5] "we" "our" "ours" "ourselves"
[9] "you" "your"
length(stopwords('en'))
[1] 175
This one's three times longer.
stopwords('en', source='smart')[1:10]
[1] "a" "a's" "able"
[4] "about" "above" "according"
[7] "accordingly" "across" "actually"
[10] "after"
length(stopwords('en', source='smart'))
[1] 571
This one's almost ten times as long and is ... interesting
stopwords('en', source='stopwords-iso')[1:10]
[1] "'ll" "'tis" "'twas" "'ve"
[5] "10" "39" "a" "a's"
[9] "able" "ableabout"
length(stopwords('en', source='stopwords-iso'))
[1] 1298
The beginning of a German list.
stopwords('de')[1:10]
[1] "aber" "alle" "allem" "allen" "aller" "alles"
[7] "als" "also" "am" "an"
A slice from an Ancient Greek list:
stopwords('grc',source='ancient')[264:288]
[1] "xxx" "xxxi" "xxxii" "xxxiii"
[5] "xxxiv" "xxxix" "xxxv" "xxxvi"
[9] "xxxvii" "xxxviii" "y" "z"
[13] "α" "ἅ" "ἃ" "ᾇ"
[17] "ἄγαν" "ἄγε" "ἄγχι" "ἀγχοῦ"
[21] "ἁγώ" "ἁγὼ" "ἅγωγ" "ἁγών"
[25] "ἁγὼν"
Lets case-fold our tokens object to lowercase, remove the stopwords, then make a new dtm and see how it's different.
inaugural_tokens.nostop <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('en'))
dtm.nostop <- dfm(inaugural_tokens.nostop)
dtm.nostop
Document-feature matrix of: 59 documents, 9,284 features (92.70% sparse) and 4 docvars.
features
docs fellow-citizens senate house
1789-Washington 1 1 2
1793-Washington 0 0 0
1797-Adams 3 1 0
1801-Jefferson 2 0 0
1805-Jefferson 0 0 0
1809-Madison 1 0 0
features
docs representatives among vicissitudes
1789-Washington 2 1 1
1793-Washington 0 0 0
1797-Adams 2 4 0
1801-Jefferson 0 1 0
1805-Jefferson 0 7 0
1809-Madison 0 0 0
features
docs incident life event filled
1789-Washington 1 1 2 1
1793-Washington 0 0 0 0
1797-Adams 0 2 0 0
1801-Jefferson 0 1 0 0
1805-Jefferson 0 2 0 0
1809-Madison 0 1 0 1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,274 more features ]
We've got about 1000 fewer features, and it is slightly more sparse. Why?
What are the most frequent tokens now?
topfeatures(dtm.nostop,40)
people government us can
584 564 505 487
must upon great may
376 371 344 343
states world shall country
334 319 316 308
nation every one peace
305 300 267 258
new power now public
250 241 229 225
time citizens constitution united
220 209 209 203
america nations union freedom
202 199 190 185
free war american let
183 181 172 160
national made good make
158 156 149 147
years justice men without
143 142 140 140
I'm just curious. Besides "tombstones," what other words made their inaugural debut in 2017?
unique_to_trump <- as.vector(colSums(doc_term_matrix) == doc_term_matrix["2017-Trump",])
colnames(doc_term_matrix)[unique_to_trump]
[1] "obama" "hardships"
[3] "lady" "michelle"
[5] "transferring" "dc"
[7] "reaped" "politicians"
[9] "2017" "listening"
[11] "likes" "neighborhoods"
[13] "trapped" "rusted-out"
[15] "tombstones" "landscape"
[17] "flush" "stolen"
[19] "robbed" "unrealized"
[21] "carnage" "stops"
[23] "we've" "subsidized"
[25] "allowing" "sad"
[27] "depletion" "trillions"
[29] "overseas" "infrastructure"
[31] "disrepair" "ripped"
[33] "redistributed" "issuing"
[35] "ravages" "stealing"
[37] "tunnels" "hire"
[39] "goodwill" "shine"
[41] "reinforce" "islamic"
[43] "bedrock" "disagreements"
[45] "solidarity" "unstoppable"
[47] "complaining" "arrives"
[49] "mysteries" "brown"
[51] "bleed" "sprawl"
[53] "windswept" "nebraska"
[55] "ignored"
OK!
We can also change the settings. What happens if we don't remove punctuation?
inaugural_tokens.wpunct <- quanteda::tokens(corp,
what = "word",
remove_punct = FALSE) %>%
tokens_tolower() %>%
tokens_remove(stopwords('en'))
dtm.wpunct <- dfm(inaugural_tokens.wpunct)
dtm.wpunct
Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
features
docs fellow-citizens senate house
1789-Washington 1 1 2
1793-Washington 0 0 0
1797-Adams 3 1 0
1801-Jefferson 2 0 0
1805-Jefferson 0 0 0
1809-Madison 1 0 0
features
docs representatives : among
1789-Washington 2 1 1
1793-Washington 0 1 0
1797-Adams 2 0 4
1801-Jefferson 0 1 1
1805-Jefferson 0 0 7
1809-Madison 0 0 0
features
docs vicissitudes incident life event
1789-Washington 1 1 1 2
1793-Washington 0 0 0 0
1797-Adams 0 0 2 0
1801-Jefferson 0 0 1 0
1805-Jefferson 0 0 2 0
1809-Madison 0 0 1 0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]
topfeatures(dtm.wpunct,40)
, . people ;
7173 5155 584 565
government us can -
564 505 487 403
must upon great may
376 371 344 343
states world shall country
334 319 316 308
nation every one peace
305 300 267 258
" new power now
256 250 241 229
public time citizens constitution
225 220 209 209
united america nations union
203 202 199 190
freedom free war american
185 183 181 172
let national made good
160 158 156 149
How big is it now? How sparse is it now?
What happens if we lower case and stem?
inaugural_tokens.stems <- quanteda::tokens(corp,
what = "word",
remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords('en')) %>%
tokens_wordstem()
dtm.stems <- dfm(inaugural_tokens.stems)
dtm.stems
Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
features
docs fellow-citizen senat hous repres
1789-Washington 1 1 2 2
1793-Washington 0 0 0 0
1797-Adams 3 1 3 3
1801-Jefferson 2 0 0 1
1805-Jefferson 0 0 0 0
1809-Madison 1 0 0 1
features
docs among vicissitud incid life event
1789-Washington 1 1 1 1 2
1793-Washington 0 0 0 0 0
1797-Adams 4 0 0 2 0
1801-Jefferson 1 0 0 1 0
1805-Jefferson 7 0 0 2 1
1809-Madison 0 1 0 1 0
features
docs fill
1789-Washington 1
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]
topfeatures(dtm.stems,40)
nation govern peopl us can
691 657 632 505 487
state great must power upon
452 378 376 375 371
countri world may shall everi
359 347 343 316 300
constitut peac one right time
289 288 279 279 271
law citizen american new america
271 265 257 250 242
public now unit duti war
229 229 225 212 204
make interest union freedom free
202 197 190 190 184
secur hope year good let
178 176 176 163 160
It's somewhat difficult to get your head around these sorts of things but there are statistical regularities here. For example, these frequencies tend to be distributed by "Zipf's Law" and by a (related) "power law."
plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank")
That makes the "long tail" clear. The grand relationship becomes clearer in a logarithmic scale:
plot(1:ncol(doc_term_matrix),sort(colSums(doc_term_matrix),dec=T), main = "Zipf's Law?", ylab="Frequency", xlab = "Frequency Rank", log="xy")
For the power law, we need the number of words that appear at any given frequency. We'll turn word_freq
into a categorical variable by making it a "factor". The categories are "1", "2", ..."17" ...etc. and then use summary
to give us counts of each "category." (The maxsum
option is used to be sure it doesn't stop at 100 and lump everything else as "Other"
words_with_freq <- summary(as.factor(word_freq),maxsum=10000)
freq_bin <- as.integer(names(words_with_freq))
plot(freq_bin, words_with_freq, main="Power Law?", xlab="Word Frequency", ylab="Number of Words", log="xy")
Zipf's law implies that, in a new corpus say, a small number of terms will be very common (we'll know a lot about them, but they won't help us distinguish documents), a large number of terms will be very rare (we'll know very little about them), and that there will be some number of terms we have never seen before. This "out-of-vocabulary" (OOV) problem is an important one in some applications.
Let's go back to preprocessing choices. What happens if we count bigrams? Let's first do it without removing stopwords.
inaugural_tokens.2grams <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_ngrams(n=2)
dtm.2grams <- dfm(inaugural_tokens.2grams)
dtm.2grams
Document-feature matrix of: 59 documents, 65,497 features (97.10% sparse) and 4 docvars.
features
docs fellow-citizens_of of_the
1789-Washington 1 20
1793-Washington 0 4
1797-Adams 0 29
1801-Jefferson 0 28
1805-Jefferson 0 17
1809-Madison 0 20
features
docs the_senate senate_and and_of
1789-Washington 1 1 2
1793-Washington 0 0 1
1797-Adams 0 0 2
1801-Jefferson 0 0 3
1805-Jefferson 0 0 1
1809-Madison 0 0 2
features
docs the_house house_of
1789-Washington 2 2
1793-Washington 0 0
1797-Adams 0 0
1801-Jefferson 0 0
1805-Jefferson 0 0
1809-Madison 0 0
features
docs of_representatives
1789-Washington 2
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 0
features
docs representatives_among among_the
1789-Washington 1 1
1793-Washington 0 0
1797-Adams 0 2
1801-Jefferson 0 0
1805-Jefferson 0 1
1809-Madison 0 0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 65,487 more features ]
topfeatures(dtm.2grams,40)
of_the in_the to_the
1772 821 727
of_our and_the it_is
628 474 324
by_the for_the to_be
322 315 313
the_people we_have of_a
270 264 262
with_the that_the the_world
240 221 214
have_been will_be has_been
205 199 185
on_the is_the we_are
183 183 178
from_the and_to the_government
168 166 165
the_united united_states all_the
164 158 157
in_our the_constitution we_will
156 156 156
of_all of_this of_their
153 142 141
should_be in_a and_in
139 138 132
those_who we_must of_my
129 128 126
may_be
126
How big is it? How sparse? It doesn't give us a lot of sense of content, but it does offer some rudimentary insights into how English is structured.
For example, we can create a rudimentary statistical language model that "predicts" the next word based on bigram frequencies. We apply Bayes' Theorem by calculating the frequency of bigrams starting with the current word and then scaling that by dividing by the total frequency (see Jurafsky and Martin, Speech and Language Processing, Chapter 3: https://web.stanford.edu/~jurafsky/slp3/ for more detail / nuance).
If the current word is "american" what is probably next, in this corpus?
First we find the right bigrams using a regular expression. See the regular expressions notebook for more detail if that is unfamiliar.
american_bigrams <- grep("^american_",colnames(dtm.2grams),value=TRUE)
american_bigrams
[1] "american_covenant"
[2] "american_lives"
[3] "american_treasure"
[4] "american_belief"
[5] "american_heart"
[6] "american_spirit"
[7] "american_people"
[8] "american_business"
[9] "american_to"
[10] "american_policy"
[11] "american_citizen"
[12] "american_democracy"
[13] "american_auspices"
[14] "american_instinct"
[15] "american_interests"
[16] "american_freemen"
[17] "american_industries"
[18] "american_labor"
[19] "american_life"
[20] "american_citizens"
[21] "american_dream"
[22] "american_if"
[23] "american_name"
[24] "american_citizenship"
[25] "american_rights"
[26] "american_flag"
[27] "american_above"
[28] "american_merchant"
[29] "american_navy"
[30] "american_market"
[31] "american_family"
[32] "american_products"
[33] "american_arms"
[34] "american_states"
[35] "american_that"
[36] "american_he"
[37] "american_steamship"
[38] "american_control"
[39] "american_failure"
[40] "american_experiment"
[41] "american_story"
[42] "american_sovereignty"
[43] "american_renewal"
[44] "american_on"
[45] "american_enjoys"
[46] "american_revolution"
[47] "american_must"
[48] "american_slavery"
[49] "american_sense"
[50] "american_standards"
[51] "american_manliness"
[52] "american_bottoms"
[53] "american_childhood"
[54] "american_character"
[55] "american_achievement"
[56] "american_century"
[57] "american_we"
[58] "american_here"
[59] "american_in"
[60] "american_promise"
[61] "american_today"
[62] "american_conscience"
[63] "american_carnage"
[64] "american_industry"
[65] "american_workers"
[66] "american_way"
[67] "american_families"
[68] "american_hands"
[69] "american_and"
[70] "american_statesmen"
[71] "american_ideals"
[72] "american_statesmanship"
[73] "american_freedom"
[74] "american_ideal"
[75] "american_a"
[76] "american_destiny"
[77] "american_history"
[78] "american_who"
[79] "american_emancipation"
[80] "american_i"
[81] "american_opportunity"
[82] "american_subjects"
[83] "american_is"
[84] "american_anthem"
[85] "american_political"
[86] "american_soldiers"
[87] "american_sound"
[88] "american_she"
[89] "american_being"
Most likely bigrams starting with "american":
freq_american_bigrams <- colSums(dtm.2grams[,american_bigrams])
most_likely_bigrams <- sort(freq_american_bigrams/sum(freq_american_bigrams),dec=TRUE)[1:10]
most_likely_bigrams
american_people american_citizen
0.23255814 0.03488372
american_story american_dream
0.03488372 0.02906977
american_to american_citizens
0.02325581 0.02325581
american_citizenship american_policy
0.02325581 0.01744186
american_labor american_that
0.01744186 0.01744186
Let's see what happens if we remove the stopwords first.
inaugural_tokens.2grams.nostop <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('en')) %>%
tokens_ngrams(n=2)
dtm.2grams.nostop <- dfm(inaugural_tokens.2grams.nostop)
dtm.2grams.nostop
Document-feature matrix of: 59 documents, 57,723 features (98.12% sparse) and 4 docvars.
features
docs fellow-citizens_senate
1789-Washington 1
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 0
features
docs senate_house house_representatives
1789-Washington 1 2
1793-Washington 0 0
1797-Adams 0 0
1801-Jefferson 0 0
1805-Jefferson 0 0
1809-Madison 0 0
features
docs representatives_among
1789-Washington 1
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 0
features
docs among_vicissitudes
1789-Washington 1
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 0
features
docs vicissitudes_incident
1789-Washington 1
1793-Washington 0
1797-Adams 0
1801-Jefferson 0
1805-Jefferson 0
1809-Madison 0
features
docs incident_life life_event
1789-Washington 1 1
1793-Washington 0 0
1797-Adams 0 0
1801-Jefferson 0 0
1805-Jefferson 0 0
1809-Madison 0 0
features
docs event_filled filled_greater
1789-Washington 1 1
1793-Washington 0 0
1797-Adams 0 0
1801-Jefferson 0 0
1805-Jefferson 0 0
1809-Madison 0 0
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 57,713 more features ]
topfeatures(dtm.2grams.nostop,40)
united_states let_us
158 105
fellow_citizens american_people
78 40
federal_government men_women
32 28
years_ago four_years
27 26
upon_us general_government
25 25
one_another government_can
22 20
every_citizen constitution_united
20 20
fellow_americans vice_president
19 18
great_nation among_nations
17 17
government_people god_bless
17 16
people_can people_united
15 15
foreign_nations may_well
15 15
almighty_god peace_world
15 14
form_government chief_justice
14 14
nations_world among_people
14 14
national_life every_american
13 13
free_people can_never
13 12
administration_government people_world
12 12
within_limits constitution_laws
12 12
public_debt one_nation
12 11
How big is it? How sparse? It gives some interesting content -- "great_nation", "almighty_god", "public_debt" -- but some confusing contructions, e.g. "people_world" which is really things like "people_of_the_world."
Ugh, well, yes, if you must. Wordclouds are an abomination -- I'll rant about that at a later date -- but here's Trump's first inaugural in a wordcloud ...
library(quanteda.textplots)
set.seed(100)
textplot_wordcloud(dtm.nostop["2017-Trump",], min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
Save a copy of the notebook and use it to answer the questions below. Those labeled "Challenge" require more than demonstrated above.
1) Use the inaugural_tokens.nostop
object. Define a word's "context" as a window of five words/tokens before and after a word's usage. In what contexts does the word "Roman" appear in this corpus?
2) Using dtm.wpunct
, which president used the most exclamation points in his inaugural address?
3) Use dtm.nostop
for these questions.
a) Do any terms appear only in the document containing Abraham Lincoln's first inaugural address?
b) Challenge: How many terms appeared first in Abraham Lincoln's first inaugural address?
c) How many times has the word "slave" been used in inaugural addresses?
d) Challenge: How many times has a word that included "slave" (like "slavery" or "enslaved") been used in inaugural addresses?
4) Construct a dtm of trigrams (lower case, not stemmed, no stop words removed).
a) How big is the matrix? How sparse is it?
b) What are the 50 most frequent trigrams?
c) Challenge How many trigrams appear only once?
5) Tokenize the following string of tweets using the built-in word
tokenizer, the tokenize_words
tokenizer from the tokenizers
package, and the tokenize_tweets
tokenizer from the tokenizers
package, and explain what's different.
https://t.co/9z2J3P33Uc FB needs to hurry up and add a laugh/cry button 😬😭😓🤢🙄😱 Since eating my feelings has not fixed the world's problems, I guess I'll try to sleep... HOLY CRAP: DeVos questionnaire appears to include passages from uncited sources https://t.co/FNRoOlfw9s well played, Senator Murray Keep the pressure on: https://t.co/4hfOsmdk0l @datageneral thx Mr Taussig It's interesting how many people contact me about applying for a PhD and don't spell my name right.