We are going to use the Quanteda R package. If you need to, run this command to install quanteda:
#install.packages("quanteda", dependencies = TRUE)
Now load it:
library(quanteda)
Package version: 1.3.14
Parallel computing: 2 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Attaching package: ‘quanteda’
The following object is masked from ‘package:utils’:
View
Lets again load in the corpus of presidential inaugural addresses and see what it looks like:
corp <- quanteda::data_corpus_inaugural
summary(corp)
Corpus consisting of 58 documents:
Text Types Tokens Sentences Year President FirstName
1789-Washington 625 1538 23 1789 Washington George
1793-Washington 96 147 4 1793 Washington George
1797-Adams 826 2578 37 1797 Adams John
1801-Jefferson 717 1927 41 1801 Jefferson Thomas
1805-Jefferson 804 2381 45 1805 Jefferson Thomas
1809-Madison 535 1263 21 1809 Madison James
1813-Madison 541 1302 33 1813 Madison James
1817-Monroe 1040 3680 121 1817 Monroe James
1821-Monroe 1259 4886 129 1821 Monroe James
1825-Adams 1003 3152 74 1825 Adams John Quincy
1829-Jackson 517 1210 25 1829 Jackson Andrew
1833-Jackson 499 1269 29 1833 Jackson Andrew
1837-VanBuren 1315 4165 95 1837 Van Buren Martin
1841-Harrison 1896 9144 210 1841 Harrison William Henry
1845-Polk 1334 5193 153 1845 Polk James Knox
1849-Taylor 496 1179 22 1849 Taylor Zachary
1853-Pierce 1165 3641 104 1853 Pierce Franklin
1857-Buchanan 945 3086 89 1857 Buchanan James
1861-Lincoln 1075 4006 135 1861 Lincoln Abraham
1865-Lincoln 360 776 26 1865 Lincoln Abraham
1869-Grant 485 1235 40 1869 Grant Ulysses S.
1873-Grant 552 1475 43 1873 Grant Ulysses S.
1877-Hayes 831 2716 59 1877 Hayes Rutherford B.
1881-Garfield 1021 3212 111 1881 Garfield James A.
1885-Cleveland 676 1820 44 1885 Cleveland Grover
1889-Harrison 1352 4722 157 1889 Harrison Benjamin
1893-Cleveland 821 2125 58 1893 Cleveland Grover
1897-McKinley 1232 4361 130 1897 McKinley William
1901-McKinley 854 2437 100 1901 McKinley William
1905-Roosevelt 404 1079 33 1905 Roosevelt Theodore
1909-Taft 1437 5822 159 1909 Taft William Howard
1913-Wilson 658 1882 68 1913 Wilson Woodrow
1917-Wilson 549 1656 59 1917 Wilson Woodrow
1921-Harding 1169 3721 148 1921 Harding Warren G.
1925-Coolidge 1220 4440 196 1925 Coolidge Calvin
1929-Hoover 1090 3865 158 1929 Hoover Herbert
1933-Roosevelt 743 2062 85 1933 Roosevelt Franklin D.
1937-Roosevelt 725 1997 96 1937 Roosevelt Franklin D.
1941-Roosevelt 526 1544 68 1941 Roosevelt Franklin D.
1945-Roosevelt 275 647 26 1945 Roosevelt Franklin D.
1949-Truman 781 2513 116 1949 Truman Harry S.
1953-Eisenhower 900 2757 119 1953 Eisenhower Dwight D.
1957-Eisenhower 621 1931 92 1957 Eisenhower Dwight D.
1961-Kennedy 566 1566 52 1961 Kennedy John F.
1965-Johnson 568 1723 93 1965 Johnson Lyndon Baines
1969-Nixon 743 2437 103 1969 Nixon Richard Milhous
1973-Nixon 544 2012 68 1973 Nixon Richard Milhous
1977-Carter 527 1376 52 1977 Carter Jimmy
1981-Reagan 902 2790 128 1981 Reagan Ronald
1985-Reagan 925 2921 123 1985 Reagan Ronald
1989-Bush 795 2681 141 1989 Bush George
1993-Clinton 642 1833 81 1993 Clinton Bill
1997-Clinton 773 2449 111 1997 Clinton Bill
2001-Bush 621 1808 97 2001 Bush George W.
2005-Bush 773 2319 100 2005 Bush George W.
2009-Obama 938 2711 110 2009 Obama Barack
2013-Obama 814 2317 88 2013 Obama Barack
2017-Trump 582 1660 88 2017 Trump Donald J.
Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
Created: Tue Jun 13 14:51:47 2017
Notes: http://www.presidency.ucsb.edu/inaugurals.php
As a reminder, we previously used quanteda’s dfm
command to generate a document-term matrix from this corpus, e.g.:
dtm <- quanteda::dfm(corp,
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
remove = stopwords("english"), # ignore common words on a "stop" list
ngrams = 1) # count unigrams
For illustration purposes, we’re going to create a second copy of the corpus and apply Lexicoder’s recommended preprocessing scripts to it:
#source("http://www.lexicoder.com/docs/LSDprep_jan2018.R")
source("LSDprep_jan2018.R")
corp.pp <- corp
texts.pp <- texts(corp)
texts.pp <- LSDprep_contr(texts.pp) #expands contractions
texts.pp <- LSDprep_dict_punct(texts.pp) # Removes misleading words that have markers in punctuation
texts.pp <- remove_punctuation_from_acronyms(texts.pp) #
texts.pp <- remove_punctuation_from_abbreviations(texts.pp)
texts.pp <- mark_proper_nouns(texts.pp)
texts.pp <- LSDprep_punctspace(texts.pp) # put spaces around punctuation
texts.pp <- LSDprep_negation(texts.pp) # normalizes negations, e.g "not very" -> not
texts.pp <- LSDprep_dict(texts.pp) # Alters misleading words e.g "may very xwell"
texts(corp.pp) <- texts.pp
dfm.lsd <- dfm(corp,
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
#remove = stopwords("english"), # ignore common words on a "stop" list
#ngrams = 1
dictionary = data_dictionary_LSD2015)
dfm.lsd.pp <- dfm(corp.pp,
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
#remove = stopwords("english"), # ignore common words on a "stop" list
#ngrams = 1
dictionary = data_dictionary_LSD2015)
dfmat.lsd <- as.matrix(dfm.lsd)
dfmat.lsd.pp <- as.matrix(dfm.lsd.pp)
lsd.df <- data.frame(cbind(dfmat.lsd,dfmat.lsd.pp))
names(lsd.df) <- c("neg","pos","neg_pos","neg_neg","neg.pp","pos.pp","neg_pos.pp", "neg_neg.pp")
rownames(lsd.df) <- docnames(corp)
lsd.df$adj_pos <- lsd.df$pos - lsd.df$neg_pos
lsd.df$adj_neg <- lsd.df$neg - lsd.df$neg_neg
lsd.df$adj_pos.pp <- lsd.df$pos.pp - lsd.df$neg_pos.pp
lsd.df$adj_neg.pp <- lsd.df$neg.pp - lsd.df$neg_neg.pp
summary(lsd.df)
neg pos neg_pos neg_neg
Min. : 3.00 Min. : 10.0 Min. :0.000 Min. :0.000
1st Qu.: 52.25 1st Qu.:122.2 1st Qu.:0.000 1st Qu.:0.000
Median : 77.00 Median :176.0 Median :1.000 Median :1.000
Mean : 86.95 Mean :199.6 Mean :1.052 Mean :1.707
3rd Qu.:117.75 3rd Qu.:246.5 3rd Qu.:2.000 3rd Qu.:2.000
Max. :290.00 Max. :571.0 Max. :5.000 Max. :8.000
neg.pp pos.pp neg_pos.pp neg_neg.pp
Min. : 3.00 Min. : 10.0 Min. : 0.000 Min. : 0.000
1st Qu.: 51.25 1st Qu.:116.0 1st Qu.: 2.000 1st Qu.: 2.000
Median : 76.50 Median :171.5 Median : 4.000 Median : 5.000
Mean : 85.97 Mean :192.0 Mean : 4.828 Mean : 5.828
3rd Qu.:117.50 3rd Qu.:237.5 3rd Qu.: 6.000 3rd Qu.: 8.000
Max. :288.00 Max. :540.0 Max. :20.000 Max. :22.000
adj_pos adj_neg adj_pos.pp adj_neg.pp
Min. : 10.0 Min. : 3.00 Min. : 10.0 Min. : 3.00
1st Qu.:122.2 1st Qu.: 52.25 1st Qu.:114.8 1st Qu.: 47.50
Median :175.0 Median : 74.50 Median :168.0 Median : 69.50
Mean :198.5 Mean : 85.24 Mean :187.2 Mean : 80.14
3rd Qu.:246.5 3rd Qu.:116.00 3rd Qu.:236.8 3rd Qu.:108.75
Max. :566.0 Max. :289.00 Max. :520.0 Max. :270.00
plot(c(1789,2017),c(0,600),type="n", main="Raw Dictionary Counts", xlab="Year", ylab = "Count")
lines(docvars(corp)$Year,lsd.df$pos, col="blue", lty=1)
lines(docvars(corp)$Year,lsd.df$pos.pp, col="blue", lty=2)
lines(docvars(corp)$Year,lsd.df$adj_pos, col="blue", lty=3)
lines(docvars(corp)$Year,lsd.df$adj_pos.pp, col="blue", lty=4)
lines(docvars(corp)$Year,lsd.df$neg, col="red", lty=1)
lines(docvars(corp)$Year,lsd.df$neg.pp, col="red", lty=2)
lines(docvars(corp)$Year,lsd.df$adj_neg, col="red", lty=3)
lines(docvars(corp)$Year,lsd.df$adj_neg.pp, col="red", lty=4)
In this corpus, then, it makes virtually no difference whether you apply the recommended preprocessing scripts, account for negations, or both. The dictionary counts are correlated at .997 or higher. So, for the remainder of this example, we will keep it simple and use the unadjusted counts from the texts without the preprocessing scripts (pos
and neg
).
cor(lsd.df[,c('pos','adj_pos','pos.pp','adj_pos.pp')])
pos adj_pos pos.pp adj_pos.pp
pos 1.0000000 0.9999388 0.9990616 0.9984557
adj_pos 0.9999388 1.0000000 0.9989778 0.9985275
pos.pp 0.9990616 0.9989778 1.0000000 0.9995618
adj_pos.pp 0.9984557 0.9985275 0.9995618 1.0000000
cor(lsd.df[,c('neg','adj_neg','neg.pp','adj_neg.pp')])
neg adj_neg neg.pp adj_neg.pp
neg 1.0000000 0.9995111 0.9993675 0.9974758
adj_neg 0.9995111 1.0000000 0.9987653 0.9982477
neg.pp 0.9993675 0.9987653 1.0000000 0.9980181
adj_neg.pp 0.9974758 0.9982477 0.9980181 1.0000000
So, our first possible measure of positive sentiment is something like the counts of positive tokens. That indicates that the most positive speech was that of William Henry Harrison in 1841. Harrison died 31 days into his presidency.
sent.absposcount <- lsd.df$pos
names(sent.absposcount) <- docnames(corp)
sort(sent.absposcount, dec=T)[1:10]
1841-Harrison 1845-Polk 1897-McKinley 1837-VanBuren 1889-Harrison
571 388 375 354 354
1909-Taft 1925-Coolidge 1821-Monroe 1817-Monroe 1853-Pierce
353 350 344 341 333
Conversely, using absolute negative counts suggests that the most negative speech was also that of William Henry Harrison.
sent.absnegcount <- lsd.df$neg
names(sent.absnegcount) <- docnames(corp)
sort(sent.absnegcount, dec=T)[1:10]
1841-Harrison 1837-VanBuren 1845-Polk 1921-Harding 1909-Taft
290 183 174 168 167
1861-Lincoln 1889-Harrison 1821-Monroe 1929-Hoover 1925-Coolidge
160 157 150 139 127
Of course, Harrison’s was the longest inaugural speech. A longer speech has more positive and negative and neutral tokens, all else equal. Of course, this means the positive and negative counts are also highly correlated.
lsd.df$tot_affect <- lsd.df$pos + lsd.df$neg
lsd.df$tot_tokens <- rowSums(dtm)
cor(lsd.df[,c('pos','neg','tot_affect','tot_tokens')])
pos neg tot_affect tot_tokens
pos 1.0000000 0.8913067 0.9885632 0.9529750
neg 0.8913067 1.0000000 0.9494891 0.9355812
tot_affect 0.9885632 0.9494891 1.0000000 0.9707433
tot_tokens 0.9529750 0.9355812 0.9707433 1.0000000
We tend to think of positive and negative affect as on the same scale, so perhaps we can just use the absolute difference as a measure of sentiment?
sent_absdiff <- lsd.df$pos - lsd.df$neg
names(sent_absdiff) <- docnames(corp)
sort(sent_absdiff, dec=T)[1:10] # Most positive?
1841-Harrison 1897-McKinley 1925-Coolidge 1817-Monroe 1853-Pierce
281 257 223 217 217
1845-Polk 1949-Truman 1889-Harrison 1821-Monroe 1909-Taft
214 199 197 194 186
sort(sent_absdiff, dec=F)[1:10] # Most negative?
1865-Lincoln 1793-Washington 1813-Madison 1965-Johnson
-6 7 18 31
1961-Kennedy 1945-Roosevelt 1905-Roosevelt 1993-Clinton
33 35 41 48
1861-Lincoln 1941-Roosevelt
49 55
Harrison again. At least he’s not the most positive and the most negative. But the shortest speech, Washington’s second inaugural at just 135 tokens, is the second most negative? And the most negative is Lincoln’s second inaugural, the third shortest?
Length is having a couple of effects here. The most obvious is that the base rates for positive tokens and negative tokens are different – inaugurals are more positive than negative, which makes sense – so the longer the speech is, the greater is the likely difference in positive and negative counts:
plot(lsd.df$tot_tokens,sent_absdiff,pch=19, col=rgb(0,0,0,.5), log="x",main="Sentiment Measured by Absolute Difference in Counts", xlab = "Total Tokens", ylab = "Sentiment")
So this leads us to the first constructed measure of sentiment actually recommended by Soroka and Young. The fraction (percentage would be the same x100) of affect tokens that are positive minus the fraction of affect tokens that are negative.
lsd.df$posfrac <- lsd.df$pos/lsd.df$tot_affect
lsd.df$negfrac <- lsd.df$neg/lsd.df$tot_affect
sent_fracdiff <- lsd.df$posfrac - lsd.df$negfrac
names(sent_fracdiff) <- docnames(corp)
sort(sent_fracdiff, dec=T)[1:10]
1849-Taylor 1797-Adams 1977-Carter 1829-Jackson
0.7315436 0.5933333 0.5612245 0.5602837
1949-Truman 1989-Bush 1877-Hayes 1885-Cleveland
0.5574230 0.5529412 0.5516014 0.5438596
1793-Washington 1973-Nixon
0.5384615 0.5371179
sort(sent_fracdiff, dec=F)[1:10]
1865-Lincoln 1813-Madison 1861-Lincoln 1961-Kennedy
-0.07500000 0.09278351 0.13279133 0.16582915
1965-Johnson 2009-Obama 1805-Jefferson 1993-Clinton
0.20000000 0.23717949 0.26640927 0.27586207
1893-Cleveland 1913-Wilson
0.27976190 0.28301887
summary(sent_fracdiff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0750 0.3225 0.4052 0.3971 0.4779 0.7315
mn.sent_fracdiff <- mean(sent_fracdiff)
plot(docvars(corp)$Year,sent_fracdiff, type="l", main="Sentiment by Fraction Difference", xlab="Year", ylab = "Sentiment", ylim=c(-1,1))
lines(c(1700,3000),c(0,0), col="gray")
lines(c(1700,3000),c(mn.sent_fracdiff,mn.sent_fracdiff), col="gray", lty=2)
(Note that since affect tokens = positive tokens + negative tokens, this is functionally equivalent to just the fraction of affect tokens that are positive. The former is centered on 0 and runs from -1 to 1; this is centered at .5 and runs from 0 to 1. They are correlated, by definition, at +1)
sent_frac <- lsd.df$posfrac
names(sent_frac) <- docnames(corp)
sort(sent_frac, dec=T)[1:10]
1849-Taylor 1797-Adams 1977-Carter 1829-Jackson
0.8657718 0.7966667 0.7806122 0.7801418
1949-Truman 1989-Bush 1877-Hayes 1885-Cleveland
0.7787115 0.7764706 0.7758007 0.7719298
1793-Washington 1973-Nixon
0.7692308 0.7685590
sort(sent_frac, dec=F)[1:10]
1865-Lincoln 1813-Madison 1861-Lincoln 1961-Kennedy
0.4625000 0.5463918 0.5663957 0.5829146
1965-Johnson 2009-Obama 1805-Jefferson 1993-Clinton
0.6000000 0.6185897 0.6332046 0.6379310
1893-Cleveland 1913-Wilson
0.6398810 0.6415094
summary(sent_frac)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4625 0.6613 0.7026 0.6986 0.7390 0.8658
mn.sent_frac <- mean(sent_frac)
plot(docvars(corp)$Year,sent_frac, type="l", main="Sentiment by Fraction", xlab="Year", ylab = "Sentiment", ylim=c(0,1))
lines(c(1700,3000),c(0.5,0.5), col="gray")
lines(c(1700,3000),c(mn.sent_frac,mn.sent_frac), col="gray", lty=2)
plot(lsd.df$tot_tokens,sent_fracdiff,pch=19, col=rgb(0,0,0,.5), ylim = c(-1,1), log="x",main="Sentiment Measured by Fraction Difference", xlab = "Total Tokens", ylab = "Sentiment")
lines(c(1,10000),c(0,0), col="gray")
lines(c(1,10000),c(mn.sent_fracdiff,mn.sent_fracdiff), col="gray", lty=2)
Or we can start down the Fightin Words logical path and look at the logratio:
sent_logratio <- log(lsd.df$pos+1) - log(lsd.df$neg +1)
names(sent_logratio) <- docnames(corp)
sort(sent_logratio, dec=T)[1:10]
1849-Taylor 1797-Adams 1977-Carter 1949-Truman
1.823012 1.353505 1.252763 1.249185
1829-Jackson 1989-Bush 1877-Hayes 1885-Cleveland
1.243794 1.232862 1.230189 1.205858
1973-Nixon 1897-McKinley
1.187166 1.150466
sort(sent_logratio, dec=F)[1:10]
1865-Lincoln 1813-Madison 1861-Lincoln 1961-Kennedy
-0.1466035 0.1841925 0.2657032 0.3313571
1965-Johnson 2009-Obama 1805-Jefferson 1993-Clinton
0.4001601 0.4803664 0.5415973 0.5596158
1893-Cleveland 1913-Wilson
0.5712574 0.5761755
summary(sent_logratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1466 0.6663 0.8517 0.8479 1.0137 1.8230
mn.sent_logratio <- mean(sent_logratio)
plot(docvars(corp)$Year,sent_logratio, type="l", main="Sentiment by Logratio", xlab="Year", ylab = "Sentiment", ylim=c(-2,2))
lines(c(1700,3000),c(0,0), col="gray")
lines(c(1700,3000),c(mn.sent_logratio,mn.sent_logratio), col="gray", lty=2)
The logratio measure is correlated with the fraction measure at .99; there are more substantial differences when the fractions involved are more extreme.
An advantage of the logratio is that we can remove the base rates if that’s a desirable thing to do.
tot_pos_count <- sum(lsd.df$pos+1)
tot_neg_count <- sum(lsd.df$neg+1)
sent_rellogratio <- log(lsd.df$pos+1) - log(tot_pos_count) - (log(lsd.df$neg +1) - log(tot_neg_count))
names(sent_rellogratio) <- docnames(corp)
sort(sent_rellogratio, dec=T)[1:10]
1849-Taylor 1797-Adams 1977-Carter 1949-Truman
0.9986027 0.5290953 0.4283537 0.4247759
1829-Jackson 1989-Bush 1877-Hayes 1885-Cleveland
0.4193850 0.4084525 0.4057794 0.3814485
1973-Nixon 1897-McKinley
0.3627564 0.3260564
sort(sent_rellogratio, dec=F)[1:10]
1865-Lincoln 1813-Madison 1861-Lincoln 1961-Kennedy
-0.9710128 -0.6402168 -0.5587061 -0.4930522
1965-Johnson 2009-Obama 1805-Jefferson 1993-Clinton
-0.4242492 -0.3440429 -0.2828120 -0.2647935
1893-Cleveland 1913-Wilson
-0.2531519 -0.2482338
summary(sent_rellogratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.97101 -0.15812 0.02727 0.02344 0.18925 0.99860
mn.sent_rellogratio <- mean(sent_rellogratio)
plot(docvars(corp)$Year,sent_rellogratio, type="l", main="Sentiment by Relative Logratio", xlab="Year", ylab = "Sentiment", ylim=c(-1,1))
lines(c(1700,3000),c(0,0), col="gray")
lines(c(1700,3000),c(mn.sent_rellogratio,mn.sent_rellogratio), col="gray", lty=2)
A bigger advantage is the “Fightin Words” logic that allows us to correct for the heteroskedasticity arising from document length. The standard error for the log((a/b)/(c/d)) when a,b,c, and d are Poisson distributed counts is ~ sqrt(1/a + 1/b + 1/c + 1/d).
se.sent_rellogratio = sqrt(1/(lsd.df$pos+1) + 1/tot_pos_count + 1/(lsd.df$neg+1) + 1/tot_neg_count)
sent_zlogratio = sent_rellogratio / se.sent_rellogratio
summary(sent_zlogratio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.2663 -1.6414 0.1917 0.0249 1.2690 4.2353
plot(lsd.df$tot_tokens,sent_zlogratio,pch=19, col=rgb(0,0,0,.5), ylim = c(-6,6), log="x",main="Sentiment Measured by Relative Logratio Z-Score", xlab = "Total Tokens", ylab = "Sentiment")
lines(c(1,10000),c(0,0), col="gray")
#lines(c(1,10000),c(mn.sent_fracdiff,mn.sent_fracdiff), col="gray", lty=2)
sort(sent_zlogratio,dec=T)[1:10]
1849-Taylor 1797-Adams 1949-Truman 1897-McKinley
4.235272 3.688378 3.320362 3.061199
1877-Hayes 1989-Bush 1977-Carter 1885-Cleveland
2.835939 2.720086 2.493854 2.422218
1973-Nixon 1829-Jackson
2.319923 2.082887
sort(sent_zlogratio,dec=F)[1:10]
1861-Lincoln 1813-Madison 1865-Lincoln 1961-Kennedy
-5.266341 -4.432318 -4.372110 -3.424157
2009-Obama 1965-Johnson 1893-Cleveland 1805-Jefferson
-2.931879 -2.591838 -2.211094 -2.184591
1841-Harrison 1881-Garfield
-2.009732 -1.837473
So … four of the five most negative (or least positive) inaugural addresses are Lincoln’s two, Kennedy’s, and Obama’s first? All of those are considered among the most inspirational, uplifting, hopeful inaugurals ever. What gives?
Maybe it’s Lexicoder. OK, let’s drop $10 and try the LIWC dictionaries …
time passes … money flows
liwc.df <- read.csv("LIWC-Inaugurals.csv", header=TRUE)
sent_liwc <- liwc.df$Tone
names(sent_liwc) <- docnames(corp)
plot(lsd.df$tot_tokens,sent_liwc,pch=19, col=rgb(0,0,0,.5), ylim = c(0,100), log="x",main="Tone Measured by LIWC", xlab = "Total Tokens", ylab = "Sentiment")
lines(c(1,10000),c(50,50), col="gray")
Somewhat similar to sent_fracdiff from Lexicoder. Correlated at .67, higher in variability for shorter documents, and in agreement that Lincoln’s and Kennedy’s inaugurals were among the most negative. LIWC seems to saturate on the positive end, making it difficult to see relative differences among the most positive.
cor(sent_liwc, sent_fracdiff) # correlated .67
sort(sent_liwc,dec=T)[1:10]
sort(sent_liwc,dec=F)[1:10]
So, again, what gives? What gives is that sentiment analysis based on dictionaries, especially dictionaries built for different contexts than the application, is often so noisy as to be effectively useless.
lincolnlines <- c("With malice toward none;", "with charity for all;","let us strive on to finish the work we are in;", "to bind up the nation’s wounds;","to care for him who shall have borne the battle, and for his widow, and his orphan—")
lincolnlines
[1] "With malice toward none;"
[2] "with charity for all;"
[3] "let us strive on to finish the work we are in;"
[4] "to bind up the nation’s wounds;"
[5] "to care for him who shall have borne the battle, and for his widow, and his orphan—"
dfm(lincolnlines,
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
dictionary = data_dictionary_LSD2015)
Document-feature matrix of: 5 documents, 4 features (75.0% sparse).
5 x 4 sparse Matrix of class "dfm"
features
docs negative positive neg_positive neg_negative
text1 1 0 0 0
text2 0 1 0 0
text3 0 0 0 0
text4 1 0 0 0
text5 2 1 0 0
Consider this Kennedy line:
dfm("Let us never negotiate out of fear. But let us never fear to negotiate.",
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
dictionary = data_dictionary_LSD2015)
Document-feature matrix of: 1 document, 4 features (75.0% sparse).
1 x 4 sparse Matrix of class "dfm"
features
docs negative positive neg_positive neg_negative
text1 2 0 0 0
Or these:
jfklines <- c("The graves of young Americans who answered the call to service surround the globe.", "Now the trumpet summons us again", "-- not as a call to bear arms, though arms we need;", "not as a call to battle, though embattled we are"," -- but a call to bear the burden of a long twilight struggle, year in and year out,", '\"rejoicing in hope, patient in tribulation\"', "-- a struggle against the common enemies of man: tyranny, poverty, disease, and war itself.","Can we forge against these enemies a grand and global alliance, North and South, East and West, that can assure a more fruitful life for all mankind?")
jfklines
[1] "The graves of young Americans who answered the call to service surround the globe."
[2] "Now the trumpet summons us again"
[3] "-- not as a call to bear arms, though arms we need;"
[4] "not as a call to battle, though embattled we are"
[5] " -- but a call to bear the burden of a long twilight struggle, year in and year out,"
[6] "\"rejoicing in hope, patient in tribulation\""
[7] "-- a struggle against the common enemies of man: tyranny, poverty, disease, and war itself."
[8] "Can we forge against these enemies a grand and global alliance, North and South, East and West, that can assure a more fruitful life for all mankind?"
dfm(jfklines,
tolower = TRUE, # casefold
stem = FALSE, # do not stem
remove_punct = TRUE, # remove punctuation
dictionary = data_dictionary_LSD2015)
Document-feature matrix of: 8 documents, 4 features (78.1% sparse).
8 x 4 sparse Matrix of class "dfm"
features
docs negative positive neg_positive neg_negative
text1 1 0 0 0
text2 0 0 0 0
text3 0 0 0 0
text4 1 0 0 0
text5 3 0 0 0
text6 0 2 0 0
text7 7 0 0 0
text8 2 4 0 0
It’s very difficult to build a dictionary that captures more signal than noise, especially across different sorts of contexts.