Updated September 2021

Introduction

This is a brief look at how document similarity, especially cosine similarity, is calculated, how it can be used to compare documents, and the impact of term weighting procedures, including tf-idf.

Within quanteda , the dfm_weight and dfm_tfidf commands provide easy access to various weighting schemes. Within the quanteda ecosystem, the quanteda.textstats package provides easy access to efficient calculation of similarities with its dfm objects. Both of these maintain the computational advantages of the underlying sparse matrix objects. We will also calculate some of these less efficiently "by hand" using dense matrices to show the component steps.

It is worth noting that the quanteda.textstats package provides a variety of other statistics, such as Fleischman readability scores, which I don't discuss here but which you may find of use.

To keep things simple, we'll again use the inaugural speech corpus that comes with quanteda.

If you need to, run this chunk to install quanteda and quanteda.textstats:

# install.packages("quanteda", dependencies = TRUE)
# install.packages("quanteda.textstats", dependencies = TRUE)

Now load them:

library(quanteda)
library(quanteda.textstats)

Lets again load in the corpus of presidential inaugural addresses and see what it looks like:

corp <- quanteda::data_corpus_inaugural
summary(corp)
Corpus consisting of 59 documents, showing 59 documents:

            Text Types Tokens Sentences Year
 1789-Washington   625   1537        23 1789
 1793-Washington    96    147         4 1793
      1797-Adams   826   2577        37 1797
  1801-Jefferson   717   1923        41 1801
  1805-Jefferson   804   2380        45 1805
    1809-Madison   535   1261        21 1809
    1813-Madison   541   1302        33 1813
     1817-Monroe  1040   3677       121 1817
     1821-Monroe  1259   4886       131 1821
      1825-Adams  1003   3147        74 1825
    1829-Jackson   517   1208        25 1829
    1833-Jackson   499   1267        29 1833
   1837-VanBuren  1315   4158        95 1837
   1841-Harrison  1898   9123       210 1841
       1845-Polk  1334   5186       153 1845
     1849-Taylor   496   1178        22 1849
     1853-Pierce  1165   3636       104 1853
   1857-Buchanan   945   3083        89 1857
    1861-Lincoln  1075   3999       135 1861
    1865-Lincoln   360    775        26 1865
      1869-Grant   485   1229        40 1869
      1873-Grant   552   1472        43 1873
      1877-Hayes   831   2707        59 1877
   1881-Garfield  1021   3209       111 1881
  1885-Cleveland   676   1816        44 1885
   1889-Harrison  1352   4721       157 1889
  1893-Cleveland   821   2125        58 1893
   1897-McKinley  1232   4353       130 1897
   1901-McKinley   854   2437       100 1901
  1905-Roosevelt   404   1079        33 1905
       1909-Taft  1437   5821       158 1909
     1913-Wilson   658   1882        68 1913
     1917-Wilson   549   1652        59 1917
    1921-Harding  1169   3719       148 1921
   1925-Coolidge  1220   4440       196 1925
     1929-Hoover  1090   3860       158 1929
  1933-Roosevelt   743   2057        85 1933
  1937-Roosevelt   725   1989        96 1937
  1941-Roosevelt   526   1519        68 1941
  1945-Roosevelt   275    633        27 1945
     1949-Truman   781   2504       116 1949
 1953-Eisenhower   900   2743       119 1953
 1957-Eisenhower   621   1907        92 1957
    1961-Kennedy   566   1541        52 1961
    1965-Johnson   568   1710        93 1965
      1969-Nixon   743   2416       103 1969
      1973-Nixon   544   1995        68 1973
     1977-Carter   527   1369        52 1977
     1981-Reagan   902   2780       129 1981
     1985-Reagan   925   2909       123 1985
       1989-Bush   795   2673       141 1989
    1993-Clinton   642   1833        81 1993
    1997-Clinton   773   2436       111 1997
       2001-Bush   621   1806        97 2001
       2005-Bush   772   2312        99 2005
      2009-Obama   938   2689       110 2009
      2013-Obama   814   2317        88 2013
      2017-Trump   582   1660        88 2017
  2021-Biden.txt   811   2766       216 2021
  President       FirstName                 Party
 Washington          George                  none
 Washington          George                  none
      Adams            John            Federalist
  Jefferson          Thomas Democratic-Republican
  Jefferson          Thomas Democratic-Republican
    Madison           James Democratic-Republican
    Madison           James Democratic-Republican
     Monroe           James Democratic-Republican
     Monroe           James Democratic-Republican
      Adams     John Quincy Democratic-Republican
    Jackson          Andrew            Democratic
    Jackson          Andrew            Democratic
  Van Buren          Martin            Democratic
   Harrison   William Henry                  Whig
       Polk      James Knox                  Whig
     Taylor         Zachary                  Whig
     Pierce        Franklin            Democratic
   Buchanan           James            Democratic
    Lincoln         Abraham            Republican
    Lincoln         Abraham            Republican
      Grant      Ulysses S.            Republican
      Grant      Ulysses S.            Republican
      Hayes   Rutherford B.            Republican
   Garfield        James A.            Republican
  Cleveland          Grover            Democratic
   Harrison        Benjamin            Republican
  Cleveland          Grover            Democratic
   McKinley         William            Republican
   McKinley         William            Republican
  Roosevelt        Theodore            Republican
       Taft  William Howard            Republican
     Wilson         Woodrow            Democratic
     Wilson         Woodrow            Democratic
    Harding       Warren G.            Republican
   Coolidge          Calvin            Republican
     Hoover         Herbert            Republican
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
  Roosevelt     Franklin D.            Democratic
     Truman        Harry S.            Democratic
 Eisenhower       Dwight D.            Republican
 Eisenhower       Dwight D.            Republican
    Kennedy         John F.            Democratic
    Johnson   Lyndon Baines            Democratic
      Nixon Richard Milhous            Republican
      Nixon Richard Milhous            Republican
     Carter           Jimmy            Democratic
     Reagan          Ronald            Republican
     Reagan          Ronald            Republican
       Bush          George            Republican
    Clinton            Bill            Democratic
    Clinton            Bill            Democratic
       Bush       George W.            Republican
       Bush       George W.            Republican
      Obama          Barack            Democratic
      Obama          Barack            Democratic
      Trump       Donald J.            Republican
      Biden       Joseph R.            Democratic

As before, we can use quanteda's tokens command to tokenize and the dfm command to generate a document-term matrix from this tokens object, e.g.:

inaugural_tokens <- quanteda::tokens(corp,
                       what = "word",
                       remove_punct = TRUE, # default FALSE
                       remove_symbols = TRUE, # default FALSE
                       remove_numbers = FALSE,
                       remove_url = TRUE, # default FALSE
                       remove_separators = TRUE,
                       split_hyphens = FALSE,
                       include_docvars = TRUE,
                       padding = FALSE,
                       verbose = quanteda_options("verbose")
                       )
dtm <- quanteda::dfm(inaugural_tokens,
                                 tolower = TRUE    # casefold
)

The most frequent terms can be seen a couple of ways. One is the textstat_frequency command, which calculates overall (i.e., corpus) frequency and also shows document frequencies (the number of documents in which it appears).

dtm %>%
  textstat_frequency() %>% 
  head(10)
   feature frequency rank docfreq group
1      the     10183    1      59   all
2       of      7180    2      59   all
3      and      5406    3      59   all
4       to      4591    4      59   all
5       in      2827    5      59   all
6        a      2292    6      58   all
7      our      2224    7      58   all
8       we      1827    8      58   all
9     that      1813    9      59   all
10      be      1502   10      59   all

The most frequent terms are essentially the "stop words." Most of these terms are used in every inaugural, although there are exceptions ("a", "our", and "we" are left out of one each).

You can, of course, remove words on a stopwords list and get a more substantive-sounding list of most frequent terms:

inaugural_tokens_nostop <- inaugural_tokens %>%
                            tokens_tolower() %>%
                          tokens_remove(stopwords('en'))
dtm_nostop <- dfm(inaugural_tokens_nostop)
dtm_nostop %>%
  textstat_frequency() %>% 
  head(10)
      feature frequency rank docfreq group
1      people       584    1      57   all
2  government       564    2      52   all
3          us       505    3      56   all
4         can       487    4      56   all
5        must       376    5      52   all
6        upon       371    6      47   all
7       great       344    7      56   all
8         may       343    8      54   all
9      states       334    9      47   all
10      world       319   10      53   all

Document representation and weighting

The variable dtm as calculated above, represents each document as a row of the document-term matrix, with values equal to the raw count of the term indexed by each column.

For example, the 601st to 610th entry in the representation of Obama's 2009 inaugural (document 56) are:

dtm[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
            features
docs         endeavor express high sense entertain
  2009-Obama        0       0    1     0         0
            features
docs         honor reposed america previous
  2009-Obama     1       0       8        0
            features
docs         execution
  2009-Obama         0

Document-incidence matrix

You may for some purposes, wish to represent a document just by the presence or absence of terms. This is sometimes called a document-incidence matrix and often makes sense for a corpus of short documents, like tweets.

dim <- dfm_weight(dtm, scheme="boolean")
dim[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
            features
docs         endeavor express high sense entertain
  2009-Obama        0       0    1     0         0
            features
docs         honor reposed america previous
  2009-Obama     1       0       1        0
            features
docs         execution
  2009-Obama         0

Here, the corpus frequency and document frequency of a given term will be the same.

dim %>%
  textstat_frequency() %>%
  head(10)
   feature frequency rank docfreq group
1       of        59    1      59   all
2      the        59    1      59   all
3      and        59    1      59   all
4       to        59    1      59   all
5     have        59    1      59   all
6     that        59    1      59   all
7       by        59    1      59   all
8       in        59    1      59   all
9       it        59    1      59   all
10      be        59    1      59   all

You can see that the entry for "america" is now simply "1", indicating that the term is present at least once, but not keeping the information that it appears eight times in the speech.

We can think of both the original document-frequency matrix and the document-incidence matrix as "weighted" representations, in which terms are weighted by frequency in the former case, and equally in the latter.

In the document-frequency matrix, the most heavily weighted terms are obviously the most common terms:

dtm %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
   feature frequency rank docfreq group
1      the       135    1       1   all
2      and       111    2       1   all
3       of        82    3       1   all
4       to        70    4       1   all
5      our        67    5       1   all
6       we        62    6       1   all
7     that        49    7       1   all
8        a        47    8       1   all
9       is        36    9       1   all
10      in        25   10       1   all

In the document-incidence matrix, all terms are weighted equally and the "most frequent" terms list is essentially random (in this case it is the order in which the terms are encountered in the corpus):

dim %>%
  dfm_subset(subset= docid(dim)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
   feature frequency rank docfreq group
1       of         1    1       1   all
2      the         1    1       1   all
3      and         1    1       1   all
4       to         1    1       1   all
5     life         1    1       1   all
6       no         1    1       1   all
7    could         1    1       1   all
8     have         1    1       1   all
9   filled         1    1       1   all
10    with         1    1       1   all

Term frequency and logged term frequency

For many purposes, it makes sense to transform count data to logged count data. This can be done here with the dfm_weight command as well. Note that \(\log (f_{dw})\) is ill-defined where counts equal zero. Smoothing, or addition of a prior, solves this problem, but at the expense of needing to create a dense matrix. A common approach to maintain sparsity is to calculate either \(\log(1+f_{dw})\), which equals zero when the count equals zero, or \(1+\log(f_{dw})\) only for non-zero values, leaving zero counts as zeros. The "logcount" weighting scheme in the dfm_weight command does the latter. (Note that if a base isn't specified, it assumes base 10, whereas the generic log function would assume natural logarithm, i.e., base=exp(1).)

dtm %>%
  dfm_weight(scheme="logcount",base=exp(1)) %>%
  dfm_subset(subset= docid(dim)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
   feature frequency rank docfreq group
1      the  5.905275    1       1   all
2      and  5.709530    2       1   all
3       of  5.406719    3       1   all
4       to  5.248495    4       1   all
5      our  5.204693    5       1   all
6       we  5.127134    6       1   all
7     that  4.891820    7       1   all
8        a  4.850148    8       1   all
9       is  4.583519    9       1   all
10      in  4.218876   10       1   all

Document frequency and tf-idf

The most commonly used weighting scheme is tf-idf. The general idea is to take a term frequency or logged term frequency and downweight that according to (logged) document frequency. The intuition is that the most important words are those that are used a lot in a given document but relatively rare in the corpus overall.

Note that many sources allow for a wide variety of interpretations of both "tf" and "idf." In particular, Manning & Schutze are inconsistent about the "default" for "tf", stating \(\log(1+f_{dw})\) in some instances and \(1+\log(f_{dw})\) (for \(f_{dw} > 0\)) in others. The most common implmentation of "df" is \(log\frac{D}{d_w}\), where \(d_w\) is the number of documents in which word \(w\) appears, and \(D\) is the total number of documents. These implementations maintain or even increase sparsity, with zero counts remaining zeros, and new zeros for any words that appear in all documents.

The little run from the Obama speech is then:

dtm.w <- dfm_tfidf(dtm)
dtm.w[56,601:610]
Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 4 docvars.
            features
docs         endeavor express      high sense
  2009-Obama        0       0 0.1910684     0
            features
docs         entertain     honor reposed  america
  2009-Obama         0 0.2523381       0 2.235923
            features
docs         previous execution
  2009-Obama        0         0

The most heavily weighted words in the Obama speech become:

dtm.w %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  textstat_frequency(force=TRUE) %>%
  head(10)
      feature frequency rank docfreq group
1         icy  3.541704    1       1   all
2        jobs  2.978102    2       1   all
3      waters  2.939644    3       1   all
4      storms  2.939644    3       1   all
5  generation  2.702015    5       1   all
6      father  2.603286    6       1   all
7     journey  2.603286    6       1   all
8    traveled  2.587462    8       1   all
9      winter  2.587462    8       1   all
10        you  2.517024   10       1   all
# force=TRUE is needed for quanteda to summarize "frequencies" that have been weighted

You may wish to access the weighted values more directly, without using the textstat_frequency command.

obama_tfidf <- dtm.w %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  as.numeric()
names(obama_tfidf) <- colnames(dtm)
sort(obama_tfidf,dec=T)[1:10]
       icy       jobs     waters     storms 
  3.541704   2.978102   2.939644   2.939644 
generation     father    journey   traveled 
  2.702015   2.603286   2.603286   2.587462 
    winter        you 
  2.587462   2.517024 

Cosine similarity

With quanteda

You can calculate cosine similarity efficiently with the textstat_simil command:

cos_dtm <- textstat_simil(dtm, method="cosine")
dim(cos_dtm)
[1] 59 59

This is a \(D \times D\) symmetric matrix.

You can find the most similar documents to a given document using the given row (or column) of the similarity matrix. For example, the most similar speeches to Kennedy's (1961) inaugural are as follows:

sort(cos_dtm[,"1961-Kennedy"],dec=T)
   1961-Kennedy   1925-Coolidge 1953-Eisenhower 
      1.0000000       0.9341794       0.9335726 
     1969-Nixon  1937-Roosevelt    1997-Clinton 
      0.9234509       0.9220879       0.9205719 
1957-Eisenhower   1889-Harrison  1933-Roosevelt 
      0.9176332       0.9167532       0.9163688 
  1837-VanBuren  1941-Roosevelt     1949-Truman 
      0.9162733       0.9161075       0.9155213 
    1981-Reagan     1985-Reagan     1929-Hoover 
      0.9154402       0.9153323       0.9150125 
   1921-Harding     1853-Pierce  1801-Jefferson 
      0.9139960       0.9132859       0.9124118 
     2009-Obama       1909-Taft   1857-Buchanan 
      0.9115104       0.9095213       0.9084186 
 1905-Roosevelt   1897-McKinley     1817-Monroe 
      0.9052003       0.9046669       0.9040922 
      2005-Bush       1845-Polk      1877-Hayes 
      0.9033540       0.9029782       0.9029498 
    1913-Wilson     1917-Wilson  1893-Cleveland 
      0.9017107       0.9012049       0.9006467 
     1973-Nixon   1901-McKinley   1841-Harrison 
      0.9001845       0.8988335       0.8967638 
   1965-Johnson    1833-Jackson   1881-Garfield 
      0.8947204       0.8941043       0.8933200 
   1809-Madison      1873-Grant     1821-Monroe 
      0.8932276       0.8908578       0.8900598 
 1805-Jefferson    1829-Jackson    1813-Madison 
      0.8898084       0.8890661       0.8889099 
 1885-Cleveland      1825-Adams      2013-Obama 
      0.8870841       0.8866805       0.8865157 
      1989-Bush    1861-Lincoln      1797-Adams 
      0.8859851       0.8850230       0.8835635 
 2021-Biden.txt     1849-Taylor    1993-Clinton 
      0.8799674       0.8782968       0.8770482 
    1977-Carter      1869-Grant  1945-Roosevelt 
      0.8711636       0.8663567       0.8614835 
   1865-Lincoln 1789-Washington       2001-Bush 
      0.8591446       0.8559081       0.8481015 
     2017-Trump 1793-Washington 
      0.8316804       0.7518340 

This doesn't make much sense. Let's see what happens if we use tf-idf.

cos_dtm.w <- textstat_simil(dtm.w, method="cosine")
sort(cos_dtm.w[,"1961-Kennedy"],dec=T)
   1961-Kennedy     1981-Reagan       2005-Bush 
     1.00000000      0.13913142      0.12881338 
     2009-Obama    1993-Clinton      1973-Nixon 
     0.12204132      0.12106314      0.12010097 
1957-Eisenhower 1953-Eisenhower       1989-Bush 
     0.11332337      0.11302037      0.11283880 
     1969-Nixon    1997-Clinton     1985-Reagan 
     0.11090511      0.10903508      0.10842288 
     2013-Obama    1965-Johnson       2001-Bush 
     0.10561653      0.10163961      0.10023038 
 2021-Biden.txt     1977-Carter   1925-Coolidge 
     0.09247814      0.08895975      0.08326778 
    1949-Truman     1853-Pierce   1881-Garfield 
     0.08180853      0.08041863      0.07808556 
    1929-Hoover  1801-Jefferson      2017-Trump 
     0.07759004      0.07758295      0.07561688 
  1841-Harrison  1933-Roosevelt  1941-Roosevelt 
     0.07446879      0.07191991      0.07089801 
 1905-Roosevelt    1921-Harding    1861-Lincoln 
     0.06938615      0.06522225      0.06379865 
  1837-VanBuren     1917-Wilson       1909-Taft 
     0.06353748      0.06256859      0.06071223 
  1857-Buchanan  1937-Roosevelt     1913-Wilson 
     0.06065495      0.05901388      0.05847482 
     1825-Adams      1877-Hayes     1817-Monroe 
     0.05746515      0.05724274      0.05602361 
 1805-Jefferson       1845-Polk     1821-Monroe 
     0.05581724      0.05576344      0.05554536 
  1889-Harrison   1897-McKinley 1789-Washington 
     0.05467709      0.05401196      0.05309571 
 1945-Roosevelt    1813-Madison    1865-Lincoln 
     0.05287510      0.05286289      0.05226755 
     1797-Adams  1885-Cleveland    1833-Jackson 
     0.04893401      0.04764906      0.04674641 
   1829-Jackson   1901-McKinley      1869-Grant 
     0.04560525      0.04557633      0.04063747 
     1873-Grant  1893-Cleveland     1849-Taylor 
     0.04021104      0.03742261      0.03704218 
   1809-Madison 1793-Washington 
     0.03101232      0.01493663 

It's still not completely intuitive -- Kennedy and Reagan don't seem superficially to be similar -- but note that the similarities are greater with every post-war (WWII) president, except Trump, than with any pre-war president. We'll look in more detail at the calculation to determine what's going on.

Cosine similarity step by step

Let's again calculate the cosine similarity between documents using just counts, but this time doing it step by step. First, let's make a regular matrix object out of our dtm. (It's easier to understand what's going on if we make this a "dense" matrix, but it's not something we would normally do.)

dtmat <- as.matrix(dtm)

The L2-norm

Now let's "norm" the documents to length 1 using the \(L_2\) norm. The \(L_2\) norm is the square root of the sum of squares for each document -- the "length" of the document vector, or the "Euclidean distance" of its tip from the origin:

 l2.dtmat <- sqrt(rowSums(dtmat^2))

Now divide the rows by the norm. The row sum of squares should now be one.

dtmat.l2normed <- sweep(dtmat,1,l2.dtmat,"/")
summary(rowSums(dtmat.l2normed^2))

The dot product

To find the cosine similarity between any two, calculate the dot product of these vectors (multiply the two vectors element by element, and then sum those up).

cos.obama1.obama2 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2013-Obama",])
cos.obama1.obama2
[1] 0.9604997
cos.obama1.trump1 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2017-Trump",])
cos.obama1.trump1
[1] 0.9112379

To find the cosine similarity for all pairs, take the matrix crossproduct. (That is, calculate the dot product of every row / document with every other row / document -- this will result in a 58 \(\times\) 58 matrix.) This matrix has all ones on its diagonal -- why? This matrix is symmetric -- why?

cos.dtmat <- dtmat.l2normed %*% t(dtmat.l2normed)
dim(cos.dtmat)
[1] 59 59
sort(cos.dtmat[,"1961-Kennedy"],dec=T)
   1961-Kennedy   1925-Coolidge 1953-Eisenhower 
      1.0000000       0.9341794       0.9335726 
     1969-Nixon  1937-Roosevelt    1997-Clinton 
      0.9234509       0.9220879       0.9205719 
1957-Eisenhower   1889-Harrison  1933-Roosevelt 
      0.9176332       0.9167532       0.9163688 
  1837-VanBuren  1941-Roosevelt     1949-Truman 
      0.9162733       0.9161075       0.9155213 
    1981-Reagan     1985-Reagan     1929-Hoover 
      0.9154402       0.9153323       0.9150125 
   1921-Harding     1853-Pierce  1801-Jefferson 
      0.9139960       0.9132859       0.9124118 
     2009-Obama       1909-Taft   1857-Buchanan 
      0.9115104       0.9095213       0.9084186 
 1905-Roosevelt   1897-McKinley     1817-Monroe 
      0.9052003       0.9046669       0.9040922 
      2005-Bush       1845-Polk      1877-Hayes 
      0.9033540       0.9029782       0.9029498 
    1913-Wilson     1917-Wilson  1893-Cleveland 
      0.9017107       0.9012049       0.9006467 
     1973-Nixon   1901-McKinley   1841-Harrison 
      0.9001845       0.8988335       0.8967638 
   1965-Johnson    1833-Jackson   1881-Garfield 
      0.8947204       0.8941043       0.8933200 
   1809-Madison      1873-Grant     1821-Monroe 
      0.8932276       0.8908578       0.8900598 
 1805-Jefferson    1829-Jackson    1813-Madison 
      0.8898084       0.8890661       0.8889099 
 1885-Cleveland      1825-Adams      2013-Obama 
      0.8870841       0.8866805       0.8865157 
      1989-Bush    1861-Lincoln      1797-Adams 
      0.8859851       0.8850230       0.8835635 
 2021-Biden.txt     1849-Taylor    1993-Clinton 
      0.8799674       0.8782968       0.8770482 
    1977-Carter      1869-Grant  1945-Roosevelt 
      0.8711636       0.8663567       0.8614835 
   1865-Lincoln 1789-Washington       2001-Bush 
      0.8591446       0.8559081       0.8481015 
     2017-Trump 1793-Washington 
      0.8316804       0.7518340 

As we should, we get the same answer for the speeches most similar to the Kennedy inaugural.

Now let's focus on the strangeness. It looks like most inaugurals are relatively similar to one another (.8 + seems like a pretty high number) and it seems odd that Coolidge would be the most similar to Kennedy. Let's break down what words are contributing the most to the similarity rankings by looking at the contributions of individual words to the dot-product:

sort(dtmat.l2normed["1961-Kennedy",]*dtmat.l2normed["1925-Coolidge",], dec=T)[1:20]
        the          of         and          to 
0.323609353 0.193983954 0.086301594 0.083692074 
         we           a          in        that 
0.038061512 0.032193695 0.026614224 0.018742411 
        not         our         but          is 
0.016709580 0.016651911 0.007684389 0.006559844 
        for          be          it         are 
0.006458923 0.005651558 0.005420882 0.004930696 
        can       which         all        this 
0.003373634 0.003128541 0.002984369 0.002941117 

Ahhhh! The cosine similarity is being driven by the relative use of common words ... the, of, and, to, and so on. This is arguably what we want in some applications like stylometry, where we are trying to guess authorship for example, but almost definitely not what we're after here.

Let's look instead at the tf-idf cosine similarities.

dtmat.w <- as.matrix(dtm.w)
l2.dtmat.w <- sqrt(rowSums(dtmat.w^2))
dtmat.w.l2normed <- sweep(dtmat.w,1,l2.dtmat.w,"/")
cos.dtmat.w <- dtmat.w.l2normed %*% t(dtmat.w.l2normed)
dim(cos.dtmat.w)
[1] 59 59
sort(cos.dtmat.w[,"1961-Kennedy"],dec=T)
   1961-Kennedy     1981-Reagan       2005-Bush 
     1.00000000      0.13913142      0.12881338 
     2009-Obama    1993-Clinton      1973-Nixon 
     0.12204132      0.12106314      0.12010097 
1957-Eisenhower 1953-Eisenhower       1989-Bush 
     0.11332337      0.11302037      0.11283880 
     1969-Nixon    1997-Clinton     1985-Reagan 
     0.11090511      0.10903508      0.10842288 
     2013-Obama    1965-Johnson       2001-Bush 
     0.10561653      0.10163961      0.10023038 
 2021-Biden.txt     1977-Carter   1925-Coolidge 
     0.09247814      0.08895975      0.08326778 
    1949-Truman     1853-Pierce   1881-Garfield 
     0.08180853      0.08041863      0.07808556 
    1929-Hoover  1801-Jefferson      2017-Trump 
     0.07759004      0.07758295      0.07561688 
  1841-Harrison  1933-Roosevelt  1941-Roosevelt 
     0.07446879      0.07191991      0.07089801 
 1905-Roosevelt    1921-Harding    1861-Lincoln 
     0.06938615      0.06522225      0.06379865 
  1837-VanBuren     1917-Wilson       1909-Taft 
     0.06353748      0.06256859      0.06071223 
  1857-Buchanan  1937-Roosevelt     1913-Wilson 
     0.06065495      0.05901388      0.05847482 
     1825-Adams      1877-Hayes     1817-Monroe 
     0.05746515      0.05724274      0.05602361 
 1805-Jefferson       1845-Polk     1821-Monroe 
     0.05581724      0.05576344      0.05554536 
  1889-Harrison   1897-McKinley 1789-Washington 
     0.05467709      0.05401196      0.05309571 
 1945-Roosevelt    1813-Madison    1865-Lincoln 
     0.05287510      0.05286289      0.05226755 
     1797-Adams  1885-Cleveland    1833-Jackson 
     0.04893401      0.04764906      0.04674641 
   1829-Jackson   1901-McKinley      1869-Grant 
     0.04560525      0.04557633      0.04063747 
     1873-Grant  1893-Cleveland     1849-Taylor 
     0.04021104      0.03742261      0.03704218 
   1809-Madison 1793-Washington 
     0.03101232      0.01493663 

Same answer as before. Similarities are lower, but they reflect similarity among distinctive content. That is, they are similar in what makes them different from the others. So, why is Reagan the most similar to Kennedy?

sort(dtmat.w.l2normed["1961-Kennedy",]*dtmat.w.l2normed["1981-Reagan",], dec=T)[1:20]
      begin   americans       sides     loyalty 
0.008795749 0.007535292 0.006083563 0.004495064 
      deeds   negotiate         let         you 
0.003817306 0.003817306 0.003538579 0.003354278 
         mr        help        back        prey 
0.003273326 0.003020558 0.002930115 0.002463591 
  unleashed      burden       price        your 
0.002463591 0.002420332 0.001968381 0.001932147 
   cultural     special        vice     freedom 
0.001908653 0.001780228 0.001780228 0.001679748 

This suggests they both framed their presidencies as an opportunity to "begin" something new, for example, and were relatively unusual in doing so.

kwic(inaugural_tokens,"begin", window=4)
Keyword-in-context with 12 matches.                                               
 [1953-Eisenhower, 5]     My friends before I |
  [1961-Kennedy, 665] request that both sides |
  [1961-Kennedy, 773]           war So let us |
 [1961-Kennedy, 1009]       planet But let us |
   [1981-Reagan, 395]         we are going to |
   [1981-Reagan, 761]          world So as we |
  [1981-Reagan, 1137]      our command let us |
  [1985-Reagan, 1381]      budget We can then |
 [1993-Clinton, 1510]     21st Century let us |
      [2001-Bush, 37]     new beginnings As I |
    [2009-Obama, 722]  dust ourselves off and |
   [2009-Obama, 1376] between nations We will |
                                   
 begin | the expression of those   
 begin | anew the quest for        
 begin | anew remembering on both  
 begin | In your hands my          
 begin | to act beginning today    
 begin | let us take inventory     
 begin | an era of national        
 begin | reducing the national debt
 begin | anew with energy and      
 begin | I thank President Clinton 
 begin | again the work of         
 begin | to responsibly leave Iraq 

PMI, Fightin' Words, and "keyness" approaches to weighting

Finally, I will note that an alternative approach to tf-idf weighting is to represent each document by a vector of "keyness" statistics. The textstat_keyness command provides several, such as chi-squared and PMI (pointwise mutual information). I personally often use the "Fightin' Words" statistic (zeta), calculated with each document as a "group" to be compared to the others. In this example, this gives results that are correlated with tf-idf, but with some minor qualitative differences.

---
title: "Term Weighting (including tf-idf) and Cosine Similarity"
subtitle: "PLSC 597, Text as Data, Penn State"
author: "Burt L. Monroe"
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    code_folding: show
    highlight: tango
    theme: united
    toc: yes
    df_print: paged
---
Updated September 2021

# Introduction

This is a brief look at how document similarity, especially cosine similarity, is calculated, how it can be used to compare documents, and the impact of term weighting procedures, including tf-idf.

Within  `quanteda` , the `dfm_weight` and `dfm_tfidf` commands provide easy access to various weighting schemes. Within the `quanteda` ecosystem, the `quanteda.textstats` package provides easy access to efficient calculation of similarities with its `dfm` objects. Both of these maintain the computational advantages of the underlying sparse matrix objects. We will also calculate some of these less efficiently "by hand" using dense matrices to show the component steps.

It is worth noting that the `quanteda.textstats` package provides a variety of other statistics, such as Fleischman readability scores, which I don't discuss here but which you may find of use.

To keep things simple, we'll again use the inaugural speech corpus that comes with `quanteda`.

If you need to, run this chunk to install `quanteda` and `quanteda.textstats`:

```{r}
# install.packages("quanteda", dependencies = TRUE)
# install.packages("quanteda.textstats", dependencies = TRUE)
```

Now load them:
```{r}
library(quanteda)
library(quanteda.textstats)
```

Lets again load in the corpus of presidential inaugural addresses and see what it looks like:

```{r}
corp <- quanteda::data_corpus_inaugural

summary(corp)
```

As before, we can use quanteda's `tokens` command to tokenize and the `dfm` command to generate a document-term matrix from this tokens object, e.g.:

```{r}
inaugural_tokens <- quanteda::tokens(corp,
                       what = "word",
                       remove_punct = TRUE, # default FALSE
                       remove_symbols = TRUE, # default FALSE
                       remove_numbers = FALSE,
                       remove_url = TRUE, # default FALSE
                       remove_separators = TRUE,
                       split_hyphens = FALSE,
                       include_docvars = TRUE,
                       padding = FALSE,
                       verbose = quanteda_options("verbose")
                       )

dtm <- quanteda::dfm(inaugural_tokens,
                                 tolower = TRUE    # casefold
)
```

The most frequent terms can be seen a couple of ways. One is the `textstat_frequency` command, which calculates overall (i.e., corpus) frequency and also shows document frequencies (the number of documents in which it appears).

```{r}
dtm %>%
  textstat_frequency() %>% 
  head(10)
```

The most frequent terms are essentially the 
"stop words." Most of these terms are used in every inaugural, although there are exceptions ("a", "our", and "we" are left out of one each).

You can, of course, remove words on a stopwords list and get a more substantive-sounding list of most frequent terms:

```{r}
inaugural_tokens_nostop <- inaugural_tokens %>%
                            tokens_tolower() %>%
                          tokens_remove(stopwords('en'))
dtm_nostop <- dfm(inaugural_tokens_nostop)

dtm_nostop %>%
  textstat_frequency() %>% 
  head(10)
```


# Document representation and weighting

The variable `dtm` as calculated above, represents each document as a row of the document-term matrix, with values equal to the raw count of the term indexed by each column.

For example, the 601st to 610th entry in the representation of Obama's 2009 inaugural (document 56) are:

```{r}
dtm[56,601:610]
```

## Document-incidence matrix

You may for some purposes, wish to represent a document just by the presence or absence of terms. This is sometimes called a **document-incidence matrix** and often makes sense for a corpus of short documents, like tweets. 

```{r}
dim <- dfm_weight(dtm, scheme="boolean")
dim[56,601:610]
```

Here, the corpus frequency and document frequency of a given term will be the same.

```{r}
dim %>%
  textstat_frequency() %>%
  head(10)
```
  
You can see that the entry for "america" is now simply "1", indicating that the term is present at least once, but not keeping the information that it appears eight times in the speech.

We can think of both the original document-frequency matrix and the document-incidence matrix as "weighted" representations, in which terms are weighted by frequency in the former case, and equally in the latter.

In the document-frequency matrix, the most heavily weighted terms are obviously the most common terms:

```{r}
dtm %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
```

In the document-incidence matrix, all terms are weighted equally and the "most frequent" terms list is essentially random (in this case it is the order in which the terms are encountered in the corpus):

```{r}
dim %>%
  dfm_subset(subset= docid(dim)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
```

## Term frequency and logged term frequency

For many purposes, it makes sense to transform count data to logged count data. This can be done here with the dfm_weight command as well. Note that $\log (f_{dw})$ is ill-defined where counts equal zero. Smoothing, or addition of a prior, solves this problem, but at the expense of needing to create a dense matrix. A common approach to maintain sparsity is to calculate either $\log(1+f_{dw})$, which equals zero when the count equals zero, or $1+\log(f_{dw})$ *only for non-zero values*, leaving zero counts as zeros. The "logcount" weighting scheme in the `dfm_weight` command does the latter. (Note that if a base isn't specified, it assumes base 10, whereas the generic `log` function would assume natural logarithm, i.e., `base=exp(1)`.)

```{r}
dtm %>%
  dfm_weight(scheme="logcount",base=exp(1)) %>%
  dfm_subset(subset= docid(dim)=="2009-Obama") %>%
  textstat_frequency() %>% 
  head(10)
```

## Document frequency and tf-idf

The most commonly used weighting scheme is **tf-idf**. The general idea is to take a term frequency or logged term frequency and *downweight* that according to (logged) document frequency. The intuition is that the most important words are those that are used a lot in a given document but relatively rare in the corpus overall.

Note that many sources allow for a wide variety of interpretations of both "tf" and "idf." In particular, Manning & Schutze  are inconsistent about the "default" for "tf", stating $\log(1+f_{dw})$ in some instances and $1+\log(f_{dw})$ (for $f_{dw} > 0$) in others. The most common implmentation of "df" is $log\frac{D}{d_w}$, where $d_w$ is the number of documents in which word $w$ appears, and $D$ is the total number of documents. These implementations maintain or even increase sparsity, with zero counts remaining zeros, and new zeros for any words that appear in all documents.

The little run from the Obama speech is then:
```{r}
dtm.w <- dfm_tfidf(dtm)
dtm.w[56,601:610]
```

The most heavily weighted words in the Obama speech become:

```{r}
dtm.w %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  textstat_frequency(force=TRUE) %>%
  head(10)

# force=TRUE is needed for quanteda to summarize "frequencies" that have been weighted
```

You may wish to access the weighted values more directly, without using the `textstat_frequency` command.
```{r}
obama_tfidf <- dtm.w %>%
  dfm_subset(subset= docid(dtm)=="2009-Obama") %>%
  as.numeric()
names(obama_tfidf) <- colnames(dtm)

sort(obama_tfidf,dec=T)[1:10]
```

# Cosine similarity

## With quanteda

You can calculate cosine similarity efficiently with the `textstat_simil` command:

```{r}
cos_dtm <- textstat_simil(dtm, method="cosine")
dim(cos_dtm)
```

This is a $D \times D$ symmetric matrix.

You can find the most similar documents to a given document using the given row (or column) of the similarity matrix. For example, the most similar speeches to Kennedy's (1961) inaugural are as follows:

```{r}
sort(cos_dtm[,"1961-Kennedy"],dec=T)
```

This doesn't make much sense. Let's see what happens if we use tf-idf.

```{r}
cos_dtm.w <- textstat_simil(dtm.w, method="cosine")
sort(cos_dtm.w[,"1961-Kennedy"],dec=T)
```

It's still not completely intuitive -- Kennedy and Reagan don't seem superficially to be similar -- but note that the similarities are greater with *every* post-war (WWII) president, except Trump, than with *any* pre-war president. We'll look in more detail at the calculation to determine what's going on.

## Cosine similarity step by step

Let's again calculate the cosine similarity between documents using just counts, but this time doing it step by step. First, let's make a regular matrix object out of our dtm. (It's easier to understand what's going on if we make this a "dense" matrix, but it's not something we would normally do.)

```{r}
dtmat <- as.matrix(dtm)
```

### The L2-norm

Now let's "norm" the documents to length 1 using the $L_2$ norm. The $L_2$ norm is the square root of the sum of squares for each document -- the "length" of the document vector, or the "Euclidean distance" of its tip from the origin:

```{r}
 l2.dtmat <- sqrt(rowSums(dtmat^2))
```

Now divide the rows by the norm. The row sum of squares should now be one.

```{r}
dtmat.l2normed <- sweep(dtmat,1,l2.dtmat,"/")
summary(rowSums(dtmat.l2normed^2))
```

### The dot product

To find the cosine similarity between any two, calculate the dot product of these vectors (multiply the two vectors element by element, and then sum those up).

```{r}
cos.obama1.obama2 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2013-Obama",])
cos.obama1.obama2

cos.obama1.trump1 <- sum(dtmat.l2normed["2009-Obama",]*dtmat.l2normed["2017-Trump",])
cos.obama1.trump1
```

To find the cosine similarity for all pairs, take the matrix crossproduct. (That is, calculate the dot product of every row / document with every other row / document -- this will result in a 58 $\times$ 58 matrix.) This matrix has all ones on its diagonal -- why? This matrix is symmetric -- why?



```{r}
cos.dtmat <- dtmat.l2normed %*% t(dtmat.l2normed)
dim(cos.dtmat)

sort(cos.dtmat[,"1961-Kennedy"],dec=T)
```

As we should, we get the same answer for the speeches most similar to the Kennedy inaugural.

Now let's focus on the strangeness. It looks like most inaugurals are relatively similar to one another (.8 + seems like a pretty high number) and it seems odd that Coolidge would be the most similar to Kennedy. Let's break down what words are contributing the most to the similarity rankings by looking at the contributions of individual words to the dot-product:

```{r}
sort(dtmat.l2normed["1961-Kennedy",]*dtmat.l2normed["1925-Coolidge",], dec=T)[1:20]
```

Ahhhh! The cosine similarity is being driven by the relative use of common words ... the, of, and, to, and so on. This is arguably what we want in some applications like stylometry, where we are trying to guess authorship for example, but almost definitely not what we're after here.

Let's look instead at the tf-idf cosine similarities.

```{r}
dtmat.w <- as.matrix(dtm.w)

l2.dtmat.w <- sqrt(rowSums(dtmat.w^2))
dtmat.w.l2normed <- sweep(dtmat.w,1,l2.dtmat.w,"/")

cos.dtmat.w <- dtmat.w.l2normed %*% t(dtmat.w.l2normed)
dim(cos.dtmat.w)

sort(cos.dtmat.w[,"1961-Kennedy"],dec=T)
```

Same answer as before. Similarities are lower, but they reflect similarity among distinctive content. That is, they are similar in what makes them different from the others. So, why is Reagan the most similar to Kennedy?

```{r}
sort(dtmat.w.l2normed["1961-Kennedy",]*dtmat.w.l2normed["1981-Reagan",], dec=T)[1:20]
```

This suggests they both framed their presidencies as an opportunity to "begin" something new, for example, and were relatively unusual in doing so.

```{r}
kwic(inaugural_tokens,"begin", window=4)
```


### PMI, Fightin' Words, and "keyness" approaches to weighting

Finally, I will note that an alternative approach to tf-idf weighting is to represent each document by a vector of "keyness" statistics. The `textstat_keyness` command provides several, such as chi-squared and PMI (pointwise mutual information). I personally often use the "Fightin' Words" statistic (zeta), calculated with each document as a "group" to be compared to the others. In this example, this gives results that are correlated with tf-idf, but with some minor qualitative differences.
