Updated September 2021
This is a brief look at how document similarity, especially cosine similarity, is calculated, how it can be used to compare documents, and the impact of term weighting procedures, including tf-idf.
Within quanteda
, the dfm_weight
and dfm_tfidf
commands provide easy access to various weighting schemes. Within the quanteda
ecosystem, the quanteda.textstats
package provides easy access to efficient calculation of similarities with its dfm
objects. Both of these maintain the computational advantages of the underlying sparse matrix objects. We will also calculate some of these less efficiently "by hand" using dense matrices to show the component steps.
It is worth noting that the quanteda.textstats
package provides a variety of other statistics, such as Fleischman readability scores, which I don't discuss here but which you may find of use.
To keep things simple, we'll again use the inaugural speech corpus that comes with quanteda
.
If you need to, run this chunk to install quanteda
and quanteda.textstats
:
# install.packages("quanteda", dependencies = TRUE)
# install.packages("quanteda.textstats", dependencies = TRUE)
Now load them:
library(quanteda)
library(quanteda.textstats)
Lets again load in the corpus of presidential inaugural addresses and see what it looks like:
corp <- quanteda::data_corpus_inaugural
summary(corp)
Corpus consisting of 59 documents, showing 59 documents:
Text Types Tokens Sentences Year
1789-Washington 625 1537 23 1789
1793-Washington 96 147 4 1793
1797-Adams 826 2577 37 1797
1801-Jefferson 717 1923 41 1801
1805-Jefferson 804 2380 45 1805
1809-Madison 535 1261 21 1809
1813-Madison 541 1302 33 1813
1817-Monroe 1040 3677 121 1817
1821-Monroe 1259 4886 131 1821
1825-Adams 1003 3147 74 1825
1829-Jackson 517 1208 25 1829
1833-Jackson 499 1267 29 1833
1837-VanBuren 1315 4158 95 1837
1841-Harrison 1898 9123 210 1841
1845-Polk 1334 5186 153 1845
1849-Taylor 496 1178 22 1849
1853-Pierce 1165 3636 104 1853
1857-Buchanan 945 3083 89 1857
1861-Lincoln 1075 3999 135 1861
1865-Lincoln 360 775 26 1865
1869-Grant 485 1229 40 1869
1873-Grant 552 1472 43 1873
1877-Hayes 831 2707 59 1877
1881-Garfield 1021 3209 111 1881
1885-Cleveland 676 1816 44 1885
1889-Harrison 1352 4721 157 1889
1893-Cleveland 821 2125 58 1893
1897-McKinley 1232 4353 130 1897
1901-McKinley 854 2437 100 1901
1905-Roosevelt 404 1079 33 1905
1909-Taft 1437 5821 158 1909
1913-Wilson 658 1882 68 1913
1917-Wilson 549 1652 59 1917
1921-Harding 1169 3719 148 1921
1925-Coolidge 1220 4440 196 1925
1929-Hoover 1090 3860 158 1929
1933-Roosevelt 743 2057 85 1933
1937-Roosevelt 725 1989 96 1937
1941-Roosevelt 526 1519 68 1941
1945-Roosevelt 275 633 27 1945
1949-Truman 781 2504 116 1949
1953-Eisenhower 900 2743 119 1953
1957-Eisenhower 621 1907 92 1957
1961-Kennedy 566 1541 52 1961
1965-Johnson 568 1710 93 1965
1969-Nixon 743 2416 103 1969
1973-Nixon 544 1995 68 1973
1977-Carter 527 1369 52 1977
1981-Reagan 902 2780 129 1981
1985-Reagan 925 2909 123 1985
1989-Bush 795 2673 141 1989
1993-Clinton 642 1833 81 1993
1997-Clinton 773 2436 111 1997
2001-Bush 621 1806 97 2001
2005-Bush 772 2312 99 2005
2009-Obama 938 2689 110 2009
2013-Obama 814 2317 88 2013
2017-Trump 582 1660 88 2017
2021-Biden.txt 811 2766 216 2021
President FirstName Party
Washington George none
Washington George none
Adams John Federalist
Jefferson Thomas Democratic-Republican
Jefferson Thomas Democratic-Republican
Madison James Democratic-Republican
Madison James Democratic-Republican
Monroe James Democratic-Republican
Monroe James Democratic-Republican
Adams John Quincy Democratic-Republican
Jackson Andrew Democratic
Jackson Andrew Democratic
Van Buren Martin Democratic
Harrison William Henry Whig
Polk James Knox Whig
Taylor Zachary Whig
Pierce Franklin Democratic
Buchanan James Democratic
Lincoln Abraham Republican
Lincoln Abraham Republican
Grant Ulysses S. Republican
Grant Ulysses S. Republican
Hayes Rutherford B. Republican
Garfield James A. Republican
Cleveland Grover Democratic
Harrison Benjamin Republican
Cleveland Grover Democratic
McKinley William Republican
McKinley William Republican
Roosevelt Theodore Republican
Taft William Howard Republican
Wilson Woodrow Democratic
Wilson Woodrow Democratic
Harding Warren G. Republican
Coolidge Calvin Republican
Hoover Herbert Republican
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Roosevelt Franklin D. Democratic
Truman Harry S. Democratic
Eisenhower Dwight D. Republican
Eisenhower Dwight D. Republican
Kennedy John F. Democratic
Johnson Lyndon Baines Democratic
Nixon Richard Milhous Republican
Nixon Richard Milhous Republican
Carter Jimmy Democratic
Reagan Ronald Republican
Reagan Ronald Republican
Bush George Republican
Clinton Bill Democratic
Clinton Bill Democratic
Bush George W. Republican
Bush George W. Republican
Obama Barack Democratic
Obama Barack Democratic
Trump Donald J. Republican
Biden Joseph R. Democratic
As before, we can use quanteda's tokens
command to tokenize and the dfm
command to generate a document-term matrix from this tokens object, e.g.:
inaugural_tokens <- quanteda::tokens(corp,
what = "word",
remove_punct = TRUE, # default FALSE
remove_symbols = TRUE, # default FALSE
remove_numbers = FALSE,
remove_url = TRUE, # default FALSE
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = quanteda_options("verbose")
)
dtm <- quanteda::dfm(inaugural_tokens,
tolower = TRUE # casefold
)
The most frequent terms can be seen a couple of ways. One is the textstat_frequency
command, which calculates overall (i.e., corpus) frequency and also shows document frequencies (the number of documents in which it appears).
dtm %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 the 10183 1 59 all
2 of 7180 2 59 all
3 and 5406 3 59 all
4 to 4591 4 59 all
5 in 2827 5 59 all
6 a 2292 6 58 all
7 our 2224 7 58 all
8 we 1827 8 58 all
9 that 1813 9 59 all
10 be 1502 10 59 all
The most frequent terms are essentially the "stop words." Most of these terms are used in every inaugural, although there are exceptions ("a", "our", and "we" are left out of one each).
You can, of course, remove words on a stopwords list and get a more substantive-sounding list of most frequent terms:
inaugural_tokens_nostop <- inaugural_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('en'))
dtm_nostop <- dfm(inaugural_tokens_nostop)
dtm_nostop %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 people 584 1 57 all
2 government 564 2 52 all
3 us 505 3 56 all
4 can 487 4 56 all
5 must 376 5 52 all
6 upon 371 6 47 all
7 great 344 7 56 all
8 may 343 8 54 all
9 states 334 9 47 all
10 world 319 10 53 all
The variable dtm
as calculated above, represents each document as a row of the document-term matrix, with values equal to the raw count of the term indexed by each column.
For example, the 601st to 610th entry in the representation of Obama's 2009 inaugural (document 56) are:
dtm[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
features
docs endeavor express high sense entertain
2009-Obama 0 0 1 0 0
features
docs honor reposed america previous
2009-Obama 1 0 8 0
features
docs execution
2009-Obama 0
You may for some purposes, wish to represent a document just by the presence or absence of terms. This is sometimes called a document-incidence matrix and often makes sense for a corpus of short documents, like tweets.
dim <- dfm_weight(dtm, scheme="boolean")
dim[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
features
docs endeavor express high sense entertain
2009-Obama 0 0 1 0 0
features
docs honor reposed america previous
2009-Obama 1 0 1 0
features
docs execution
2009-Obama 0
Here, the corpus frequency and document frequency of a given term will be the same.
dim %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 of 59 1 59 all
2 the 59 1 59 all
3 and 59 1 59 all
4 to 59 1 59 all
5 have 59 1 59 all
6 that 59 1 59 all
7 by 59 1 59 all
8 in 59 1 59 all
9 it 59 1 59 all
10 be 59 1 59 all
You can see that the entry for "america" is now simply "1", indicating that the term is present at least once, but not keeping the information that it appears eight times in the speech.
We can think of both the original document-frequency matrix and the document-incidence matrix as "weighted" representations, in which terms are weighted by frequency in the former case, and equally in the latter.
In the document-frequency matrix, the most heavily weighted terms are obviously the most common terms:
dtm %>%
dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 the 135 1 1 all
2 and 111 2 1 all
3 of 82 3 1 all
4 to 70 4 1 all
5 our 67 5 1 all
6 we 62 6 1 all
7 that 49 7 1 all
8 a 47 8 1 all
9 is 36 9 1 all
10 in 25 10 1 all
In the document-incidence matrix, all terms are weighted equally and the "most frequent" terms list is essentially random (in this case it is the order in which the terms are encountered in the corpus):
dim %>%
dfm_subset(subset= docid(dim)=="2009-Obama") %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 of 1 1 1 all
2 the 1 1 1 all
3 and 1 1 1 all
4 to 1 1 1 all
5 life 1 1 1 all
6 no 1 1 1 all
7 could 1 1 1 all
8 have 1 1 1 all
9 filled 1 1 1 all
10 with 1 1 1 all
For many purposes, it makes sense to transform count data to logged count data. This can be done here with the dfm_weight command as well. Note that \(\log (f_{dw})\) is ill-defined where counts equal zero. Smoothing, or addition of a prior, solves this problem, but at the expense of needing to create a dense matrix. A common approach to maintain sparsity is to calculate either \(\log(1+f_{dw})\), which equals zero when the count equals zero, or \(1+\log(f_{dw})\) only for non-zero values, leaving zero counts as zeros. The "logcount" weighting scheme in the dfm_weight
command does the latter. (Note that if a base isn't specified, it assumes base 10, whereas the generic log
function would assume natural logarithm, i.e., base=exp(1)
.)
dtm %>%
dfm_weight(scheme="logcount",base=exp(1)) %>%
dfm_subset(subset= docid(dim)=="2009-Obama") %>%
textstat_frequency() %>%
head(10)
feature frequency rank docfreq group
1 the 5.905275 1 1 all
2 and 5.709530 2 1 all
3 of 5.406719 3 1 all
4 to 5.248495 4 1 all
5 our 5.204693 5 1 all
6 we 5.127134 6 1 all
7 that 4.891820 7 1 all
8 a 4.850148 8 1 all
9 is 4.583519 9 1 all
10 in 4.218876 10 1 all
The most commonly used weighting scheme is tf-idf. The general idea is to take a term frequency or logged term frequency and downweight that according to (logged) document frequency. The intuition is that the most important words are those that are used a lot in a given document but relatively rare in the corpus overall.
Note that many sources allow for a wide variety of interpretations of both "tf" and "idf." In particular, Manning & Schutze are inconsistent about the "default" for "tf", stating \(\log(1+f_{dw})\) in some instances and \(1+\log(f_{dw})\) (for \(f_{dw} > 0\)) in others. The most common implmentation of "df" is \(log\frac{D}{d_w}\), where \(d_w\) is the number of documents in which word \(w\) appears, and \(D\) is the total number of documents. These implementations maintain or even increase sparsity, with zero counts remaining zeros, and new zeros for any words that appear in all documents.
The little run from the Obama speech is then:
dtm.w <- dfm_tfidf(dtm)
dtm.w[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
features
docs endeavor express high sense
2009-Obama 0 0 0.1910684 0
features
docs entertain honor reposed america
2009-Obama 0 0.2523381 0 2.235923
features
docs previous execution
2009-Obama 0 0
The most heavily weighted words in the Obama speech become:
dtm.w %>%
dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
textstat_frequency(force=TRUE) %>%
head(10)
feature frequency rank docfreq group
1 icy 3.541704 1 1 all
2 jobs 2.978102 2 1 all
3 waters 2.939644 3 1 all
4 storms 2.939644 3 1 all
5 generation 2.702015 5 1 all
6 father 2.603286 6 1 all
7 journey 2.603286 6 1 all
8 traveled 2.587462 8 1 all
9 winter 2.587462 8 1 all
10 you 2.517024 10 1 all
# force=TRUE is needed for quanteda to summarize "frequencies" that have been weighted
You may wish to access the weighted values more directly, without using the textstat_frequency
command.
obama_tfidf <- dtm.w %>%
dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
as.numeric()
names(obama_tfidf) <- colnames(dtm)
sort(obama_tfidf,dec=T)[1:10]
icy jobs waters storms
3.541704 2.978102 2.939644 2.939644
generation father journey traveled
2.702015 2.603286 2.603286 2.587462
winter you
2.587462 2.517024
You can calculate cosine similarity efficiently with the textstat_simil
command:
cos_dtm <- textstat_simil(dtm, method="cosine")
dim(cos_dtm)
[1] 59 59
This is a \(D \times D\) symmetric matrix.
You can find the most similar documents to a given document using the given row (or column) of the similarity matrix. For example, the most similar speeches to Kennedy's (1961) inaugural are as follows:
sort(cos_dtm[,"1961-Kennedy"],dec=T)
1961-Kennedy 1925-Coolidge 1953-Eisenhower
1.0000000 0.9341794 0.9335726
1969-Nixon 1937-Roosevelt 1997-Clinton
0.9234509 0.9220879 0.9205719
1957-Eisenhower 1889-Harrison 1933-Roosevelt
0.9176332 0.9167532 0.9163688
1837-VanBuren 1941-Roosevelt 1949-Truman
0.9162733 0.9161075 0.9155213
1981-Reagan 1985-Reagan 1929-Hoover
0.9154402 0.9153323 0.9150125
1921-Harding 1853-Pierce 1801-Jefferson
0.9139960 0.9132859 0.9124118
2009-Obama 1909-Taft 1857-Buchanan
0.9115104 0.9095213 0.9084186
1905-Roosevelt 1897-McKinley 1817-Monroe
0.9052003 0.9046669 0.9040922
2005-Bush 1845-Polk 1877-Hayes
0.9033540 0.9029782 0.9029498
1913-Wilson 1917-Wilson 1893-Cleveland
0.9017107 0.9012049 0.9006467
1973-Nixon 1901-McKinley 1841-Harrison
0.9001845 0.8988335 0.8967638
1965-Johnson 1833-Jackson 1881-Garfield
0.8947204 0.8941043 0.8933200
1809-Madison 1873-Grant 1821-Monroe
0.8932276 0.8908578 0.8900598
1805-Jefferson 1829-Jackson 1813-Madison
0.8898084 0.8890661 0.8889099
1885-Cleveland 1825-Adams 2013-Obama
0.8870841 0.8866805 0.8865157
1989-Bush 1861-Lincoln 1797-Adams
0.8859851 0.8850230 0.8835635
2021-Biden.txt 1849-Taylor 1993-Clinton
0.8799674 0.8782968 0.8770482
1977-Carter 1869-Grant 1945-Roosevelt
0.8711636 0.8663567 0.8614835
1865-Lincoln 1789-Washington 2001-Bush
0.8591446 0.8559081 0.8481015
2017-Trump 1793-Washington
0.8316804 0.7518340
This doesn't make much sense. Let's see what happens if we use tf-idf.
cos_dtm.w <- textstat_simil(dtm.w, method="cosine")
sort(cos_dtm.w[,"1961-Kennedy"],dec=T)
1961-Kennedy 1981-Reagan 2005-Bush
1.00000000 0.13913142 0.12881338
2009-Obama 1993-Clinton 1973-Nixon
0.12204132 0.12106314 0.12010097
1957-Eisenhower 1953-Eisenhower 1989-Bush
0.11332337 0.11302037 0.11283880
1969-Nixon 1997-Clinton 1985-Reagan
0.11090511 0.10903508 0.10842288
2013-Obama 1965-Johnson 2001-Bush
0.10561653 0.10163961 0.10023038
2021-Biden.txt 1977-Carter 1925-Coolidge
0.09247814 0.08895975 0.08326778
1949-Truman 1853-Pierce 1881-Garfield
0.08180853 0.08041863 0.07808556
1929-Hoover 1801-Jefferson 2017-Trump
0.07759004 0.07758295 0.07561688
1841-Harrison 1933-Roosevelt 1941-Roosevelt
0.07446879 0.07191991 0.07089801
1905-Roosevelt 1921-Harding 1861-Lincoln
0.06938615 0.06522225 0.06379865
1837-VanBuren 1917-Wilson 1909-Taft
0.06353748 0.06256859 0.06071223
1857-Buchanan 1937-Roosevelt 1913-Wilson
0.06065495 0.05901388 0.05847482
1825-Adams 1877-Hayes 1817-Monroe
0.05746515 0.05724274 0.05602361
1805-Jefferson 1845-Polk 1821-Monroe
0.05581724 0.05576344 0.05554536
1889-Harrison 1897-McKinley 1789-Washington
0.05467709 0.05401196 0.05309571
1945-Roosevelt 1813-Madison 1865-Lincoln
0.05287510 0.05286289 0.05226755
1797-Adams 1885-Cleveland 1833-Jackson
0.04893401 0.04764906 0.04674641
1829-Jackson 1901-McKinley 1869-Grant
0.04560525 0.04557633 0.04063747
1873-Grant 1893-Cleveland 1849-Taylor
0.04021104 0.03742261 0.03704218
1809-Madison 1793-Washington
0.03101232 0.01493663
It's still not completely intuitive -- Kennedy and Reagan don't seem superficially to be similar -- but note that the similarities are greater with every post-war (WWII) president, except Trump, than with any pre-war president. We'll look in more detail at the calculation to determine what's going on.
Let's again calculate the cosine similarity between documents using just counts, but this time doing it step by step. First, let's make a regular matrix object out of our dtm. (It's easier to understand what's going on if we make this a "dense" matrix, but it's not something we would normally do.)
dtmat <- as.matrix(dtm)
Now let's "norm" the documents to length 1 using the \(L_2\) norm. The \(L_2\) norm is the square root of the sum of squares for each document -- the "length" of the document vector, or the "Euclidean distance" of its tip from the origin:
l2.dtmat <- sqrt(rowSums(dtmat^2))
Now divide the rows by the norm. The row sum of squares should now be one.
dtmat.l2normed <- sweep(dtmat,1,l2.dtmat,"/")
summary(rowSums(dtmat.l2normed^2))
To find the cosine similarity between any two, calculate the dot product of these vectors (multiply the two vectors element by element, and then sum those up).
cos.obama1.obama2 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2013-Obama",])
cos.obama1.obama2
[1] 0.9604997
cos.obama1.trump1 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2017-Trump",])
cos.obama1.trump1
[1] 0.9112379
To find the cosine similarity for all pairs, take the matrix crossproduct. (That is, calculate the dot product of every row / document with every other row / document -- this will result in a 58 \(\times\) 58 matrix.) This matrix has all ones on its diagonal -- why? This matrix is symmetric -- why?
cos.dtmat <- dtmat.l2normed %*% t(dtmat.l2normed)
dim(cos.dtmat)
[1] 59 59
sort(cos.dtmat[,"1961-Kennedy"],dec=T)
1961-Kennedy 1925-Coolidge 1953-Eisenhower
1.0000000 0.9341794 0.9335726
1969-Nixon 1937-Roosevelt 1997-Clinton
0.9234509 0.9220879 0.9205719
1957-Eisenhower 1889-Harrison 1933-Roosevelt
0.9176332 0.9167532 0.9163688
1837-VanBuren 1941-Roosevelt 1949-Truman
0.9162733 0.9161075 0.9155213
1981-Reagan 1985-Reagan 1929-Hoover
0.9154402 0.9153323 0.9150125
1921-Harding 1853-Pierce 1801-Jefferson
0.9139960 0.9132859 0.9124118
2009-Obama 1909-Taft 1857-Buchanan
0.9115104 0.9095213 0.9084186
1905-Roosevelt 1897-McKinley 1817-Monroe
0.9052003 0.9046669 0.9040922
2005-Bush 1845-Polk 1877-Hayes
0.9033540 0.9029782 0.9029498
1913-Wilson 1917-Wilson 1893-Cleveland
0.9017107 0.9012049 0.9006467
1973-Nixon 1901-McKinley 1841-Harrison
0.9001845 0.8988335 0.8967638
1965-Johnson 1833-Jackson 1881-Garfield
0.8947204 0.8941043 0.8933200
1809-Madison 1873-Grant 1821-Monroe
0.8932276 0.8908578 0.8900598
1805-Jefferson 1829-Jackson 1813-Madison
0.8898084 0.8890661 0.8889099
1885-Cleveland 1825-Adams 2013-Obama
0.8870841 0.8866805 0.8865157
1989-Bush 1861-Lincoln 1797-Adams
0.8859851 0.8850230 0.8835635
2021-Biden.txt 1849-Taylor 1993-Clinton
0.8799674 0.8782968 0.8770482
1977-Carter 1869-Grant 1945-Roosevelt
0.8711636 0.8663567 0.8614835
1865-Lincoln 1789-Washington 2001-Bush
0.8591446 0.8559081 0.8481015
2017-Trump 1793-Washington
0.8316804 0.7518340
As we should, we get the same answer for the speeches most similar to the Kennedy inaugural.
Now let's focus on the strangeness. It looks like most inaugurals are relatively similar to one another (.8 + seems like a pretty high number) and it seems odd that Coolidge would be the most similar to Kennedy. Let's break down what words are contributing the most to the similarity rankings by looking at the contributions of individual words to the dot-product:
sort(dtmat.l2normed["1961-Kennedy",]*dtmat.l2normed["1925-Coolidge",], dec=T)[1:20]
the of and to
0.323609353 0.193983954 0.086301594 0.083692074
we a in that
0.038061512 0.032193695 0.026614224 0.018742411
not our but is
0.016709580 0.016651911 0.007684389 0.006559844
for be it are
0.006458923 0.005651558 0.005420882 0.004930696
can which all this
0.003373634 0.003128541 0.002984369 0.002941117
Ahhhh! The cosine similarity is being driven by the relative use of common words ... the, of, and, to, and so on. This is arguably what we want in some applications like stylometry, where we are trying to guess authorship for example, but almost definitely not what we're after here.
Let's look instead at the tf-idf cosine similarities.
dtmat.w <- as.matrix(dtm.w)
l2.dtmat.w <- sqrt(rowSums(dtmat.w^2))
dtmat.w.l2normed <- sweep(dtmat.w,1,l2.dtmat.w,"/")
cos.dtmat.w <- dtmat.w.l2normed %*% t(dtmat.w.l2normed)
dim(cos.dtmat.w)
[1] 59 59
sort(cos.dtmat.w[,"1961-Kennedy"],dec=T)
1961-Kennedy 1981-Reagan 2005-Bush
1.00000000 0.13913142 0.12881338
2009-Obama 1993-Clinton 1973-Nixon
0.12204132 0.12106314 0.12010097
1957-Eisenhower 1953-Eisenhower 1989-Bush
0.11332337 0.11302037 0.11283880
1969-Nixon 1997-Clinton 1985-Reagan
0.11090511 0.10903508 0.10842288
2013-Obama 1965-Johnson 2001-Bush
0.10561653 0.10163961 0.10023038
2021-Biden.txt 1977-Carter 1925-Coolidge
0.09247814 0.08895975 0.08326778
1949-Truman 1853-Pierce 1881-Garfield
0.08180853 0.08041863 0.07808556
1929-Hoover 1801-Jefferson 2017-Trump
0.07759004 0.07758295 0.07561688
1841-Harrison 1933-Roosevelt 1941-Roosevelt
0.07446879 0.07191991 0.07089801
1905-Roosevelt 1921-Harding 1861-Lincoln
0.06938615 0.06522225 0.06379865
1837-VanBuren 1917-Wilson 1909-Taft
0.06353748 0.06256859 0.06071223
1857-Buchanan 1937-Roosevelt 1913-Wilson
0.06065495 0.05901388 0.05847482
1825-Adams 1877-Hayes 1817-Monroe
0.05746515 0.05724274 0.05602361
1805-Jefferson 1845-Polk 1821-Monroe
0.05581724 0.05576344 0.05554536
1889-Harrison 1897-McKinley 1789-Washington
0.05467709 0.05401196 0.05309571
1945-Roosevelt 1813-Madison 1865-Lincoln
0.05287510 0.05286289 0.05226755
1797-Adams 1885-Cleveland 1833-Jackson
0.04893401 0.04764906 0.04674641
1829-Jackson 1901-McKinley 1869-Grant
0.04560525 0.04557633 0.04063747
1873-Grant 1893-Cleveland 1849-Taylor
0.04021104 0.03742261 0.03704218
1809-Madison 1793-Washington
0.03101232 0.01493663
Same answer as before. Similarities are lower, but they reflect similarity among distinctive content. That is, they are similar in what makes them different from the others. So, why is Reagan the most similar to Kennedy?
sort(dtmat.w.l2normed["1961-Kennedy",]*dtmat.w.l2normed["1981-Reagan",], dec=T)[1:20]
begin americans sides loyalty
0.008795749 0.007535292 0.006083563 0.004495064
deeds negotiate let you
0.003817306 0.003817306 0.003538579 0.003354278
mr help back prey
0.003273326 0.003020558 0.002930115 0.002463591
unleashed burden price your
0.002463591 0.002420332 0.001968381 0.001932147
cultural special vice freedom
0.001908653 0.001780228 0.001780228 0.001679748
This suggests they both framed their presidencies as an opportunity to "begin" something new, for example, and were relatively unusual in doing so.
kwic(inaugural_tokens,"begin", window=4)
Keyword-in-context with 12 matches.
[1953-Eisenhower, 5] My friends before I |
[1961-Kennedy, 665] request that both sides |
[1961-Kennedy, 773] war So let us |
[1961-Kennedy, 1009] planet But let us |
[1981-Reagan, 395] we are going to |
[1981-Reagan, 761] world So as we |
[1981-Reagan, 1137] our command let us |
[1985-Reagan, 1381] budget We can then |
[1993-Clinton, 1510] 21st Century let us |
[2001-Bush, 37] new beginnings As I |
[2009-Obama, 722] dust ourselves off and |
[2009-Obama, 1376] between nations We will |
begin | the expression of those
begin | anew the quest for
begin | anew remembering on both
begin | In your hands my
begin | to act beginning today
begin | let us take inventory
begin | an era of national
begin | reducing the national debt
begin | anew with energy and
begin | I thank President Clinton
begin | again the work of
begin | to responsibly leave Iraq
Finally, I will note that an alternative approach to tf-idf weighting is to represent each document by a vector of "keyness" statistics. The textstat_keyness
command provides several, such as chi-squared and PMI (pointwise mutual information). I personally often use the "Fightin' Words" statistic (zeta), calculated with each document as a "group" to be compared to the others. In this example, this gives results that are correlated with tf-idf, but with some minor qualitative differences.