In this notebook we will work through a basic classification problem, using the movie reviews data set. We know the "negative" or "positive" labels for each of the movies. We'll set some of these aside for a test set and train our models on the remainder as a training set, using unigram presence or counts as the features. Then we'll evaluate the predictions quantitatively as well as look at some ways to interpret what the models tell us.

We'll start with Naive Bayes, move to logistic regression and its ridge and LASSO variants, then support vector machines and finally random forests. We'll also combine the models to examine an ensemble prediction.

We'll use these packages (install if necessary):

library(dplyr)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
library(quanteda)
library(quanteda.textmodels)
library(caret)

We'll start with the example given in the quanteda documentation. Read in the Pang and Lee dataset of 2000 movie reviews. (This appears to be the same 2000 reviews you used in the dictionary exercise, but in a different order.)

corpus <- data_corpus_moviereviews
summary(corpus,5)
Corpus consisting of 2000 documents, showing 5 documents:

            Text Types Tokens Sentences sentiment   id1
 cv000_29416.txt   354    841         9       neg cv000
 cv001_19502.txt   156    278         1       neg cv001
 cv002_17424.txt   276    553         3       neg cv002
 cv003_12683.txt   313    555         2       neg cv003
 cv004_12641.txt   380    841         2       neg cv004
   id2
 29416
 19502
 17424
 12683
 12641

Shuffle the rows to randomize the order.

set.seed(1234)
id_train <- sample(1:2000,1500, replace=F)
head(id_train, 10)
 [1]  228 1244 1218 1245 1719 1278   19  464 1327 1024

Use the 1500 for a training set and the other 500 as your test set. Create dfms for each.

docvars(corpus, "id_numeric") <- 1:ndoc(corpus)
dfmat_train <- corpus_subset(corpus, id_numeric %in% id_train) %>% tokens() %>% dfm() #%>% dfm_weight(scheme="boolean")
dfmat_test <- corpus_subset(corpus, !(id_numeric %in% id_train)) %>% tokens %>% dfm() #%>% dfm_weight(scheme="boolean")

Naive Bayes

Naive Bayes is a built in model for quanteda, so it's easy to use:

sentmod.nb <- textmodel_nb(dfmat_train, docvars(dfmat_train, "sentiment"), distribution = "Bernoulli")
summary(sentmod.nb)

Call:
textmodel_nb.dfm(x = dfmat_train, y = docvars(dfmat_train, "sentiment"), 
    distribution = "Bernoulli")

Class Priors:
(showing first 2 elements)
neg pos 
0.5 0.5 

Estimated Feature Scores:
      plot      :    two    teen couples     go     to
neg 0.5129 0.6360 0.4980 0.05683 0.01488 0.3545 0.9959
pos 0.3634 0.6157 0.5072 0.03399 0.01307 0.3752 0.9974
         a  church   party      ,   drink    and   then
neg 0.9946 0.01894 0.05683 0.9959 0.01488 0.9973 0.4763
pos 0.9987 0.02353 0.04706 0.9974 0.01699 0.9974 0.4052
      drive      .   they    get   into     an accident
neg 0.04871 0.9986 0.7212 0.5507 0.6428 0.8836  0.03924
pos 0.03529 0.9987 0.7085 0.5346 0.6654 0.9007  0.04052
       one     of    the    guys    dies    but    his
neg 0.8714 0.9959 0.9973 0.12720 0.04871 0.9499 0.8539
pos 0.8784 0.9987 0.9987 0.09935 0.05098 0.9477 0.9007
    girlfriend continues
neg    0.08390   0.02842
pos    0.07974   0.05752

Use the dfm_match command to limit dfmat_test to features (words) that appeared in the training data:

dfmat_matched <- dfm_match(dfmat_test, features=featnames(dfmat_train))

How did we do? Let's look at a "confusion" matrix.

actual_class <- docvars(dfmat_matched, "sentiment")
predicted_class <- predict(sentmod.nb, newdata=dfmat_matched)
tab_class <- table(actual_class,predicted_class)
tab_class
            predicted_class
actual_class neg pos
         neg 225  38
         pos  54 183

Not bad, considering. Let's put some numbers on that:

caret::confusionMatrix(tab_class, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class
actual_class neg pos
         neg 225  38
         pos  54 183
                                         
               Accuracy : 0.816          
                 95% CI : (0.7792, 0.849)
    No Information Rate : 0.558          
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.6298         
                                         
 Mcnemar's Test P-Value : 0.1179         
                                         
            Sensitivity : 0.8281         
            Specificity : 0.8065         
         Pos Pred Value : 0.7722         
         Neg Pred Value : 0.8555         
              Precision : 0.7722         
                 Recall : 0.8281         
                     F1 : 0.7991         
             Prevalence : 0.4420         
         Detection Rate : 0.3660         
   Detection Prevalence : 0.4740         
      Balanced Accuracy : 0.8173         
                                         
       'Positive' Class : pos            
                                         

Given the balance in the data among negatives and positives, "Accuracy" isn't a bad place to start. Here we have Accuracy of 81.6%.

The most commonly reporte alternatives I see are "F1" (the harmonic mean of Precision and Recall, see below) and "AUC" ("area under the curve," referring to the "ROC" or "Receiver Operating Characteristic" curve, discussed in the ROC curve section at the end of this notebook).

If the classes are imbalanced, accuracy becomes a less useful indicator. If, for example, you have 95% positive cases and 5% negative cases in your test set, a model which simply predicts "positive" without even looking at the data would get an accuracy of 0.95.

Similarly, if you care about positive and negative cases differently -- say, your classifier is a Covid test or a jury decision, or negative is a "null hypothesis" -- you will care about precision/recall or specificity/sensitivity or TPR/FPR/TNR/FNR or PDR/NDR.

"Sensitivity" and "recall" and "true positive rate" (TPR) and "hit rate" and "probability of detection" are all the same thing -- the probability that a positive case is classified as a positive. That is, the extent to which the classifier captures the positive cases it should capture. This is also the same as 1-FNR, the false negative rate. If negative cases are the "null," then FNR is the probability of "Type II error."

"Specificity" or "selectivity" is the "true negative rate," the probability that negative cases are classified as negative. This is also 1-FPR, where FPR is the "false positive rate," the probability of "false alarm" -- classifying a negative case as positive -- also known as a "Type I error."

"Precision" or "positive detection rate" (PDR) is the probability that a case classified as positive is actually positive. "Negative detection rate" (NDR) is the probability that a case classified as negative is actually negative.

Let's wave our hands at validation by doing some sniff tests. What are the most positive and negative words?

The conditional posterior estimates, the "estimated feature scores" displayed by the summary command above, are stored in the model object's param attribute as a \(K \times V\) matrix, where \(K\) is the number of classes and \(V\) is the vocabulary size. Here are the first 10 columns.

sentmod.nb$param[,1:10]
         plot         :       two       teen    couples
neg 0.5128552 0.6359946 0.4979702 0.05683356 0.01488498
pos 0.3633987 0.6156863 0.5071895 0.03398693 0.01307190
           go        to         a     church      party
neg 0.3545332 0.9959405 0.9945873 0.01894452 0.05683356
pos 0.3751634 0.9973856 0.9986928 0.02352941 0.04705882

These are the posterior probability that a word will appear in a document of a given class, \(p(w|k)\). We want the posterior probability that a document is of a given class, given it contains that word, \(p(k|w)\). This is obtained by Bayes Theorem: \(p(\text{pos}|w) = \frac{p(w|\text{pos})}{p(w|\text{pos}) + p(w|\text{neg})}\)

#Most positive words
sort(sentmod.nb$param["pos",]/colSums(sentmod.nb$param),dec=T)[1:20]
outstanding    seamless spielberg's    lovingly 
  0.9609883   0.9354430   0.9311493   0.9262437 
   flawless  astounding     winslet      winter 
  0.9205855   0.9205855   0.9205855   0.9205855 
    recalls        lore     gattaca      annual 
  0.9139870   0.9139870   0.9139870   0.9139870 
  addresses       mulan masterfully        deft 
  0.9139870   0.9139870   0.9139870   0.9139870 
     online  continuing    missteps   discussed 
  0.9061925   0.9061925   0.9061925   0.9061925 

There's reasonable stuff there: "outstanding", "seamless", "lovingly", "flawless". There's also some evidence of overfitting: "spielberg's", "winslet", "gattaca", "mulan". We'll see support for the overfitting conclusion below.

#Most negative words
sort(sentmod.nb$param["neg",]/colSums(sentmod.nb$param),dec=T)[1:20]
  ludicrous     spoiled         pen   insulting 
  0.9430619   0.9308311   0.9192703   0.9192703 
     racing degenerates perfunctory      bounce 
  0.9192703   0.9192703   0.9119085   0.9030693 
    misfire      feeble      horrid    weaponry 
  0.9030693   0.9030693   0.8922583   0.8922583 
       1982        3000      bursts    wielding 
  0.8922583   0.8922583   0.8922583   0.8922583 
  campiness   macdonald         wee      stalks 
  0.8922583   0.8922583   0.8922583   0.8922583 

Let's get a birds-eye view.

post.pos <- sentmod.nb$param["pos",]/colSums(sentmod.nb$param)
# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances")
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))

Look a little closer at the negative.

# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0,.25))
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))

And a little more closely at the positive words:

# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0.75,1.0))
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))

Let's look a little more closely at the document predictions.

predicted_prob <- predict(sentmod.nb, newdata=dfmat_matched, type="probability")
dim(predicted_prob)
[1] 500   2
head(predicted_prob)
                         neg          pos
cv007_4992.txt  1.0000000000 1.148809e-17
cv008_29326.txt 0.0002205452 9.997795e-01
cv011_13044.txt 0.9999997633 2.366626e-07
cv014_15600.txt 1.0000000000 2.289090e-13
cv016_4348.txt  1.0000000000 7.290510e-19
cv022_14227.txt 1.0000000000 1.498903e-14
summary(predicted_prob)
      neg              pos          
 Min.   :0.0000   Min.   :0.000000  
 1st Qu.:0.0000   1st Qu.:0.000000  
 Median :0.9943   Median :0.005718  
 Mean   :0.5593   Mean   :0.440708  
 3rd Qu.:1.0000   3rd Qu.:1.000000  
 Max.   :1.0000   Max.   :1.000000  
plot(density(predicted_prob[,"pos"]), main = "Predicted Probabilities from Naive Bayes classifier", xlab="", ylab = "")
rug(predicted_prob[,"pos"])

You can see there one problem with the "naive" part of naive Bayes. By taking all of the features (words) as independent, it thinks it has seen far more information than it really has, and is therefore far more confident about its predictions than is warranted.

What's the most positive review in the test set according to this?

# sort by *least negative* since near zero aren't rounded
sort.list(predicted_prob[,"neg"], dec=F)[1]
[1] 440
id_test <- !((1:2000) %in% id_train)
writeLines(as.character(corpus)[id_test][440])
here's a word analogy : amistad is to the lost world as schindler's list is to jurassic park . 
in 1993 , after steven spielberg made the monster dino hit , many critics described schindler's list as the director's " penance " ( as if there was a need for him to apologize for making a crowd-pleasing blockbuster ) . 
now , after a three-year layoff , spielberg is back with a vengeance . 
once again , his summer release was special effects-loaded action/adventure flick with dinosaurs munching on human appetizers . 
now , following his 1993 pattern , he has fashioned another serious , inspirational christmas release about the nature of humanity . 
that film is amistad . 
although not as masterful as schindler's list , amistad is nevertheless a gripping motion picture . 
thematically rich , impeccably crafted , and intellectually stimulating , the only area where this movie falls a little short is in its emotional impact . 
watching schindler's list was a powerful , almost spiritual , experience . 
spielberg pulled us into the narrative , absorbed us in the drama , then finally let us go , exhausted and shattered , three-plus hours later . 
aspects of the movie have stayed with me ever since . 
amistad , while a fine example of film making , is not as transcendent . 
the incident of the ship la amistad is not found in any history books , but , considering who writes the texts , that's not a surprise . 
however , the event is a part of the american social and legal fabric , and , while amistad does not adhere rigorously to the actual account , most of the basic facts are in order . 
several , mostly minor changes have been made to enhance the film's dramatic force . 
on the whole , while amistad may not be faithful to all of the details of the situation , it is true to the spirit and meaning of what transpired . 
one stormy night during the summer of 1839 , the 53 men imprisoned on the spanish slave ship la amistad escape . 
led by the lion-hearted cinque ( djimon hounsou ) , they take control of the vessel , killing most of the crew . 
adrift somewhere off the coast of cuba and uncertain how to make their way back to africa , they rely on the two surviving spaniards to navigate the eastward journey . 
they are tricked , however , and the la amistad , which makes its way northward off the united states' eastern coastline , is eventually captured by an american naval ship near connecticut . 
the kidnapped africans are shackled and thrown into prison , charged with murder and piracy . 
the first men to come to the africans' defense are abolitionists theodore joadson ( morgan freeman ) and lewis tappan ( stellan skarsgard ) . 
they are soon joined by roger baldwin ( matthew mcconaughey ) , a property attorney of little repute . 
aided by advice from former president john quincy adams ( anthony hopkins ) , baldwin proves a more persuasive orator than anyone gave him credit for , and his central argument  -  that the prisoners were illegally kidnapped free men , not property  -  convinces the judge . 
but powerful forces have aligned against baldwin's cause . 
current president martin van buren ( nigel hawthorne ) , eager to please southern voters and 11-year old queen isabella of spain ( anna paquin ) , begins pulling strings behind-the-scenes to ensure that none of the africans goes free . 
at its heart , amistad is a tale of human courage . 
cinque is a heroic figure whose spirit remains unbreakable regardless of the pain and indignity he is subjected to . 
he is a free man , not a slave , and , while he recognizes that he may die as a result of his struggle , he will not give it up . 
effectively portrayed by newcomer djimon hounsou , whose passion and screen presence arrest our attention , cinque is the key to viewers seeing the amistad africans as more than symbols in a battle of ideologies . 
they are individuals , and our ability to make that distinction is crucial to the movie's success . 
to amplify this point , spielberg presents many scenes from the africans' point-of-view , detailing their occasionally-humorous observations about some of the white man's seemingly-strange " rituals " . 
the larger struggle is , of course , one of defining humanity . 
as the nazis felt justified in slaughtering jews because they viewed their victims as " sub-human , " so the pro-slavery forces of amistad use a similar defense . 
the abolitionists regard the africans as men , but the slavers and their supporters see them as animals or property . 
in a sense , the morality of slavery is on trial here with the specter of civil war , which would break out less than three decades later , looming over everything . 
amistad's presentation of the legal and political intricacies surrounding the trial are fascinating , making this movie one of the most engrossing courtroom dramas in recent history . 
four claimants come forward against the africans : the state , which wants them tried for murder ; the queen of spain , who wants them handed over to her under the provision of an american/spanish treaty ; two american naval officers , who claim the right of high seas salvage ; and the two surviving spaniards from la amistad , who demand that their property be returned to them . 
baldwin must counter all of these claims , while facing a challenge to his own preconceived notions as the result of a relationship he develops with cinque . 
even though attorney and client are divided by a language barrier , they gradually learn to communicate . 
aside from cinque , who is a fully-realized individual , characterization is spotty , but the acting is top-notch . 
matthew mcconaughey successfully overcomes his " pretty boy " image to become baldwin , but the lawyer is never particularly well-defined outside of his role in the la amistad case . 
likewise , while morgan freeman and stellan skarsgard are effective as joadson and tappan , they are never anything more than " abolitionists . " 
nigel hawthorne , who played the title character in the madness of king george , presents martin van buren as a spineless sycophant to whom justice means far less than winning an election . 
finally , there's anthony hopkins , whose towering portrayal of john quincy adams is as compelling as anything the great actor has recently done . 
hopkins , who can convincingly play such diverse figures as a serial killer , an emotionally-crippled english butler , and richard nixon , makes us believe that he is adams . 
his ten-minute speech about freedom and human values is unforgettable . 
one point of difference worth noting between amistad and schindler's list is this film's lack of a well-defined human villain . 
schindler's list had ralph fiennes' superbly-realized amon goeth , who was not only a three-dimensional character , but a personification of all that the nazis stood for . 
there is no such figure in amistad . 
the villain is slavery , but an ideology , no matter how evil , is rarely the best adversary . 
it is to spielberg's credit that he has fashioned such a compelling motion picture without a prominent antagonist . 
amistad's trek to the screen , which encountered some choppy waters ( author barbara chase-riboud has cried plagiarism , a charge denied by the film makers ) , comes in the midst of an upsurge of interest in the incident . 
an opera of the same name opened in chicago on november 29 , 1997 . 
numerous books about the subject are showing up on bookstore shelves . 
it remains to be seen how much longevity the amistad phenomena has , but one thing is certain  -  with spielberg's rousing , substantive film leading the way , the spotlight has now illuminated this chapter of american history . 

Looks like ``Amistad.'' A genuinely positive review, but note how many times "spielberg" is mentioned. The prediction is biased toward positive just because Spielberg had positive reviews in the training set. We may not want that behavior.

Note also that this is a very long review.

# sort by *least neg* since near zero aren't rounded
sort.list(predicted_prob[,"pos"], dec=F)[1]
[1] 211
writeLines(as.character(corpus)[id_test][211])
and i thought " stigmata " would be the worst religiously-oriented thriller released this year . 
turns out i was wrong , because while " stigmata " was merely boring and self-important , " end of days " is completely inept on all fronts . 
it's a silly , incomprehensible , endlessly stupid mess . 
for a guy like me who grew up watching arnold schwarzenegger at his best , it's extremely disconcerting to see where the big man has ended up . 
for the first time in recent memory , an arnold action movie ( and " batman & robin " doesn't count ) is no fun at all . 
 " end of days " is a major stinker . 
the movie opens in vatican city , 1979 . 
some catholic priests have observed an ancient prophecy , which says that a girl will be born on that night that satan will have targeted for impregnation . 
if he impregnates her between 11 and midnight on december 31 , 1999 , the world will be destroyed . 
the pope orders protection of this girl , though some priests believe she ought to be killed . 
in new york , that very night , a girl is born to fulfill the prophecy . 
twenty years later , we meet jericho cane ( schwarzenegger ) , a suicidal ex-cop with a drinking problem . 
now working as a security guard for hire , he is protecting a local businessman ( gabriel byrne ) , who is actually possessed by the devil . 
an assassination attempt on the businessman by a crazed former priest leads him to the girl satan is after , christine york ( robin tunney ) . 
recognizing elements of his own murdered daughter in christine ( including ownership of the same music box , apparently ) , jericho swears to protect her against the devil and the faction of priests looking to kill her . 
there are so many problems with this film it's hard to know where to begin , but how about starting with the concept ? 
casting arnold in a role like this was a mistake to begin with . 
schwarzenegger is a persona , not an actor , so putting him in a role that contradicts his usual strong personality is a bad idea . 
arnold has neither the dramatic range nor the speaking ability to pull off a character tormented by conflicting emotions . 
in other words , trying to give him dimension was a mistake . 
harrison ford , mel gibson , or even bruce willis could have played this role ( they've all played noble and flawed heroes ) , but not schwarzenegger . 
there are several scenes that attempt to establish jericho's character ; one has him contemplating suicide , another crying over the loss of his wife and daughter , and even one in which the devil tries to tempt him into revealing christine's location by offering him his old life back . 
none of these scenes really work , because arnie isn't up to the task . 
the filmmakers would have been better off making jericho a strong , confident character ( like the terminator , for example ) , the likes of which schwarzenegger has excelled in before . 
this one isn't at all believable the way arnold plays him . 
the supporting cast tries their hardest , and only gabriel byrne makes any impact at all . 
as the prince of darkness , he's suave and confident . 
he acts like one would expect the devil to act . 
the problem is that the script has him doing things that make no sense ( more on that later ) and that undermines him as a powerful villain . 
byrne out-performs arnold in every scene they have together ( including the aforementioned temptation bit ) , but this is problematic when it causes the audience to start doing the unthinkable : root for the devil . 
byrne's speech about the bible being " overrated " actually starts to make sense , mainly because arnold's attempts at refuting it ( mostly of the " 'tis not ! " 
variety ) are feeble at best . 
the only problem is , arnold has to win , so in the end , nobody really cares . 
kevin pollack plays jericho's security guard sidekick and tries to liven things up with some comic asides , but like most bad action movie sidekicks , he disappears after about an hour . 
robin tunney isn't given much to do except look scared . 
in fact , all of the supporting players are good actors , but none , save for byrne , is given anything interesting to do . 
performances aside , it would be really hard to enjoy this film no matter who starred in it . 
this being an action blockbuster , it's no surprise that the worst thing about it is the script , which starts off totally confusing , and when some of it is explained ( and not much of it is ) , it's utterly ridiculous . 
why is the devil coming on new year's eve , 1999 ? 
because it's exactly 1000 years after the year of the devil , which isn't 666 , it turns out . 
some nutty priest accidentally read it upside down , so the real year is 999 , so just add a 1 to the beginning , and you've got 1999 ! 
if you don't buy this explanation , you're not alone . 
it's convoluted and silly at the same time . 
the method by which jericho locates christine york is equally ludicrous ( she's christine , see , and she lives in new york , see . 
 . 
 . ) , and if that weren't bad enough , there's plenty of bothersome stuff in this film that isn't explained at all . 
why can satan kill everyone he passes on the street , but when it comes to snuffing out one drunk ex-cop , he's powerless ? 
is he impervious to only one kind of bullet ? 
how come he can't control jericho or christine ? 
and how did those gregorian monks deal with time zones in their prophecies ? 
a clumsy attempt at a joke is made about this , but it's never actually explained . 
usually , this sort of thing wouldn't matter in a schwarzenegger flick ( i mean , don't get me started on the time paradoxes offered up by the terminator movies ) , but this time the plot inconsistencies stand out even more than usual because the action is rarely exciting . 
there are several predictable horror film clich ? s present in " end of days , " complete with the old " black cat hiding in a cabinet " bit , not that we ever find out what the cat was doing in there . 
it gets so formulaic that it's possible for those uninterested in being scared to close their eyes at the precise moment a " boo " will come . 
their predictions will rarely be wrong . 
the more grandiose action sequences are utterly charmless , partially because we don't care about these characters ( due to the script's pathetic attempts at characterization and setup ) , and also because they , too , don't make any sense . 
there's a scene where schwarzenegger gets thrown around a room by a little old lady . 
it's good for a few chuckles , but not much else . 
supposedly we're to believe she now has super strength by virtue of being controlled by satan , but the script never sets that up , so the scene is merely silly . 
none of this is terribly exciting , because all the action sequences are so badly framed that it's often hard to tell why it's happening in the first place , not to mention that they're edited in full-on incomprehensible mtv quick-cut style . 
most of them had me scratching my head , rather than saying , " wow , cool ! " 
 " end of days " is not only silly and confusing , but it's also distinctly unpleasant to watch . 
the devil apparently doesn't operate in the more subtle , i'll-convince-people-to-kill-each-other fashion outlined in the bible , but instead enjoys killing people gruesomely in broad daylight . 
this doesn't only make him an awfully predictable sort , but it also means that not a single scene in " end of days " goes by without unnecessarily graphic violence , or the odd kinky sexual encounter ( yet another bit that had me scratching my head ) . 
if violence is supposed to be shocking , it's not a good idea to throw so much of it into a movie that the audience goes numb . 
scenes aren't connected through any reasonable means , so a lot of the time , stuff gets blown up , or people get killed , and i had no idea why . 
reasons ? 
to hell with reasons ! 
let's just blow stuff up ! 
isn't it cool ? 
nope , not by a long shot . 
this film is thoroughly unwatchable . 
it's dull , interminable , and unrelenting in its stupidity . 
perhaps arnold needs to make some movies with james cameron to revive his career , because it's not happening with hack peter hyams here . 
 " end of days " might have had camp value , if only it didn't top itself off with an overly pious ending that nobody's going to buy . 
if the movie is going to be serious , the filmmakers should have come up with a decent script . 
if it's going to be campy , arnold shouldn't be taking himself so damn seriously ( i didn't actually see him put up on a cross , did i ? ) , and his character shouldn't be such a sad sack . 
as it stands , " end of days " is just a bad movie , and an awfully gloomy one at that . 

Schwarzenegger's ``End of Days.''

It also should be clear enough that more words means more votes, so longer documents are more clearly positive or negative. There's an argument for that. It also would underplay a review that read in it's entirety ``terrible,'' even though the review is 100% clear in its sentiment.

What is it most confused about?

sort.list(abs(predicted_prob[,"pos"] - .5), dec=F)[1]
[1] 400
predicted_prob[400,]
      neg       pos 
0.5244155 0.4755845 

So ... the model says 52% chance negative, 48% positive.

writeLines(as.character(corpus)[id_test][400])
capsule : this is a 1950s or 1960s style heist film , set in the present . 
robert deniro stars as a risk-adverse safecracker who wants to retire form crime but takes one last job at the request of a personal friend ( played by marlon brando ) . 
edward norton plays a hotshot young sharpster who is also in on the crime . 
the plot is mostly straightforward suspense with little nonsense . 
 , +2 ( -4 to +4 ) 
i am sure i must have seen almost the identical plot before . 
this is a heist film made for an adult audience who probably wanted a crime film like they had seen in theaters when they were teens . 
there are no superhuman acrobats taking nosedives off of buildings like in entrapment . 
there is no rock score . 
there are no ballet- like martial arts . 
this is just a basic heist film with a decent and distinctly credible and un-flashy script . 
nick ( played by robert deniro ) is a safecracker who has managed to be successful by never taking risks . 
if a job is not a safe bet ( pun intended ) , he backs out . 
sometimes even the safe bets turn out not to be so safe . 
when one job very nearly goes wrong nick is unnerved enough to decide that it is nature telling him that it is time to get out of the game . 
he returns to his home in montreal where he owns a jazz club , and decides to manage it full time . 
he proposes to his girl friend diane ( angela bassett ) . 
she has one condition . 
he must stay retired from crime . 
but before the deal can be cemented , max , a montreal kingpin and personal friend , has one last supposedly easy job for nick . 
nick wants no part particularly because the heist will be right in his hometown of montreal . 
more and more details seem to complicate the job . 
nick's partner in the crime is to be a smart , but uncontrollable young crook , jack ( edward norton ) . 
jack treats a locked front door like a welcome mat , even at his associates' homes . 
the young crook is a know-it-all who seems good at everything he does but at avoiding rubbing people the wrong way . 
together they plan to steal a priceless historic artifact from the montreal customs house . 
the script by kario salem , lem dobbs , and scott marshall smith works like an episode of the old " mission impossible " television series . 
we see pieces of the heist being put together , last minute changes , and things that go wrong , much like a good episode of " mission impossible . " 
this team might not be bad choices to write scripts for the tom cruise " mission impossible " films . 
the complications are , however no more and no fewer than are needed to make the story believable . 
the telling is cold and noirish , which is just what it is supposed to be . 
director frank oz , the voices of yoda and miss piggy proves surprisingly good at directing a serious crime film . 
the score has a more than adequate cast with little flashy or scene-stealing acting . 
edward norton probably has the flashiest role and even that is low-key by today's standards . 
he plays what is nearly a double role . 
jack pretends to be a brain damage victim to be hired for a job in the customs house . 
one nice ( ? ) 
character i have not mentioned is stephen ( jamie harrold ) . 
stephen is a master hacker who lives in his mother's basement in a house with a lot of screaming in both directions . 
he seems like the last person the risk adverse nick would want to depend upon . 
the film itself remains low-key up until the time of the climactic heist . 
then the pace really picks up . 
before that the plot even stops twice for jazz interludes . 
though oz never lets the music steal time from the story the way woody allen does in sweet and lowdown . 
on the subject of music , the score of the score is by howard shore . 
it adds tension to the suspense scenes , but never seems to have much of a melody . 
angela bassett is the one misused celebrity in a totally minor role that should have been played by a less famous actress who needed a break . 
she has nothing to do in the film but demand that nick give up crime and to look like an attractive reward if he does . 
speaking of being attractive the score seems to be attracting an older audience who learned to appreciate much the same sort of film in the 1950s and 1960s . 
it does the job . 
i rate it a 7 on the 0 to 10 scale and a +2 on the -4 to +4 scale . 

A negative review of "Mafia!" a spoof movie I'd never heard of. Satire, parody, sarcasm, and similar are notoriously difficult to correctly classify, so perhaps that's what happened here.

Let's look at a clear mistake.

sort.list(predicted_prob[1:250,"neg"],dec=F)[1]
[1] 196
predicted_prob[196,]
        neg         pos 
3.67913e-17 1.00000e+00 

So ... the model says DEFINITELY positive.

writeLines(as.character(corpus)[id_test][196])
weighed down by tired plot lines and spielberg's reliance on formulas , _saving private ryan_ is a mediocre film which nods in the direction of realism before descending into an abyss of cliches . 
there ought to be a law against steven spielberg making movies about truly serious topics . 
spielberg's greatest strength as a director is the polished , formulaic way in which every aspect of the film falls carefully into place to make a perfect story . 
but for a topic of such weight as combat in the second world war ( or the holocaust ) this technique backfires , for it creates coherent , comprehensible and redemptive narratives out of events whose size , complexity and evil are utterly beyond the reach of human ken . 
in this way spielberg trivializes the awesome evil of the stories he films . 
_saving private ryan_ tells the story of eight men who have been detailed on a " pr mission " to pull a young man , ryan ( whose three other brothers were just killed in fighting elsewhere ) out of combat on the normandy front just after d-day . 
ryan is a paratrooper who dropped behind enemy lines the night before the landings and became separated from his fellow soldiers . 
the search for him takes the eight soldiers across the hellish terrain of world war ii combat in france . 
there's no denying spielberg came within shouting distance of making a great war movie . 
the equipment , uniforms and weapons are superbly done . 
the opening sequence , in which captain miller ( tom hanks ) leads his men onto omaha beach , is quite possibly the closest anyone has come to actually capturing the unendurably savage intensity of modern infantry combat . 
another pleasing aspect of the film is spielberg's brave depiction of scenes largely unknown to american audiences , such as the shooting of prisoners by allied soldiers , the banality of death in combat , the routine foul-ups in the execution of the war , and the cynicism of the troops . 
the technical side of the film is peerless , as always . 
the camera work is magnificent , the pacing perfect , the sets convincing , the directing without flaw . 
hanks will no doubt be nominated for an oscar for his performance , which was utterly convincing , and the supporting cast was excellent , though ted danson seems a mite out of place as a paratroop colonel . 
yet the attempt at a realistic depiction of combat falls flat on its face because realism is not something which can be represented by single instances or events . 
it has to thoroughly permeate the context at every level of the film , or the story fails to convince . 
throughout the movie spielberg repeatedly showed only single examples of the grotesque wounds produced by modern mechanized devices ( exception : men are shown burning to death with relative frequency ) . 
for example , we see only one man with guts spilled out on the ground . 
here and there men lose limbs ; in one scene miller is pulling a man to safety , there's an explosion , and miller looks back to see he is only pulling half a man . 
but the rest of the corpses are remarkably intact . 
there are no shoes with only feet in them , no limbs scattered everywhere , no torsos without limbs , no charred corpses , and most importantly , all corpses have heads ( in fairness there are a smattering of wicked head wounds ) . 
the relentless dehumanization of the war , in which even corpses failed to retain any indentity , is soft-pedaled in the film . 
ultimately , _saving private ryan_ bows to both hollywood convention and the unwritten rules of wartime photography in its portrayal of wounds and death in war . 
rather than saying _saving private ryan_ is " realistic , " it would be better to describe it as " having realistic moments . " 
another aspect of the " hollywoodization " of the war is the lack of realistic dialogue and in particular , the lack of swearing . 
anyone familiar with the literature on the behavior of the men during the war , such as fussell's superb _wartime : understanding and behavior in the second world war_ ( which has an extensive discussion on swearing ) , knows that the troops swore fluently and without letup . 
 " who is this private ryan that we have to die for him ? " 
asks one infantrymen in the group of eight . 
rendered in wartime demotic , that should have been expressed as " who is this little pecker that we have to get our dicks shot off for him ? " 
or some variant thereof . 
conversations should have been literally sprinkled with the " f " word , and largely about ( the search for ) food and sex . 
this is all the more inexplicable because the movie already had an " r " rating due to violence , so swearing could not possibly have been eliminated to make it a family film . 
however , the most troubling aspect of the film is the spielbergization of the topic . 
the most intense hell humans have ever created for themselves is not emotionally wrenching enough for steven spielberg . 
he cannot just cede control to the material ; he has to be bigger than it . 
as if afraid to let the viewer find their own ( perhaps unsettled and not entirely clear ) emotional foothold in the material , spielberg has to package it in hallmark moments to give the war a meaning and coherence it never had : the opening and closing scenes of ryan and his family in the war cemetary ( reminscent of the closing scene from _schindler's list ) , the saccharine exchange between ryan and his wife at the close ( every bit as bad as schindler's monologue about how his car , tiepin or ring could have saved another jew ) , quotes from abraham lincoln and emerson , captain miller's last words to private ryan , and an unbelievable storyline in which a prisoner whom they free earlier in the movie comes back to kill the captain . 
that particular subplot is so hokey , so predictable , it nigh on ruins the film . 
nowhere in the film is there a resolute depiction of the meaninglessness , stupidity and waste which characterized the experience of war to the men who actually fought in combat ( imagine if miller had been killed by friendly fire or collateral damage ) . 
because of its failure to mine deeply into the terrible realities of world war ii , _saving private ryan_ can only pan for small truths in the shallows . 
 . 

Aha! A clearly negative review of "Saving Private Ryan."

This is at least partly an "overfitting" mistake. It probably learned other "Saving Private Ryan" or "Spielberg movies" words -- it looks like spielberg's was number #3 on our list above -- and learned that "reviews that talk about Saving Private Ryan are probably positive."

Below, I'll give brief examples of some other classification models for this data.

Logistic regression, ridge regression, LASSO, and elasticnet

We'll look at three (well really only two) variants of the relatively straightforward regularized logistic regression model. These add a "regularization penalty" to the usual loss function of logistic regression. There are two basic types of regularization penalty: L1 and L2.

The L2 penalty imposes loss proportional to the sum of the squared coefficients. That is, it biases the model toward finding smaller coefficients. In Bayesian terms, this is also equivalent to using a Gaussian prior. Logistic regression with an L2 penalty is known as "ridge regression."

The L1 penalty imposes loss proportional to the sum of the absolute values of the coefficients. That is, it biases the model toward finding more coefficients equal exactly to zero. It has a feature selection effect. In Bayesian terms, this is also equivalent to using a Laplace prior. Logistic regression with an L1 penalty is known as the "LASSO" (Least Absolute Shrinkage and Selection Operator).

How much of a penalty is applied is determined by a weight, \(\lambda\). This is a "hyperparameter" that can be tuned during the training stage.

The elastic net adds both penalties, distributing the total regularization cost according to a second hyperparameter, \(\alpha\), and estimates the optimal values for both.

library(glmnet)
library(doMC)

Ridge regression (Logistic with L2-regularization)

registerDoMC(cores=2) # parallelize to speed up
sentmod.ridge <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$sentiment,
                   family="binomial", 
                   alpha=0,  # alpha = 0: ridge regression
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.ridge)

This shows classification error as \(\lambda\) (the total weight of the regularization penalty) is increased from 0. The minimum error is at the leftmost dotted line, about \(\log(\lambda) \approx 3\). This value is stored in lambda.min.

# actual_class <- docvars(dfmat_matched, "sentiment")
predicted_value.ridge <- predict(sentmod.ridge, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.ridge <- rep(NA,length(predicted_value.ridge))
predicted_class.ridge[predicted_value.ridge>0] <- "pos"
predicted_class.ridge[predicted_value.ridge<0] <- "neg"
tab_class.ridge <- table(actual_class,predicted_class.ridge)
tab_class.ridge
            predicted_class.ridge
actual_class neg pos
         neg 214  49
         pos  42 195
confusionMatrix(tab_class.ridge, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.ridge
actual_class neg pos
         neg 214  49
         pos  42 195
                                          
               Accuracy : 0.818           
                 95% CI : (0.7813, 0.8509)
    No Information Rate : 0.512           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6355          
                                          
 Mcnemar's Test P-Value : 0.5294          
                                          
            Sensitivity : 0.7992          
            Specificity : 0.8359          
         Pos Pred Value : 0.8228          
         Neg Pred Value : 0.8137          
              Precision : 0.8228          
                 Recall : 0.7992          
                     F1 : 0.8108          
             Prevalence : 0.4880          
         Detection Rate : 0.3900          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.8176          
                                          
       'Positive' Class : pos             
                                          

Accuracy of .818, similar to the others. The misses are a little more evenly distributed across classes, with it being slightly more successful in identifying positive reviews and slightly less successful in identifying negative reviews.

At first blush, the coefficients should tell us what the model learned:

plot(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,75*abs(coef(sentmod.ridge)[-1,1]))) # inexplicable error when knitting of alpha not in [0,1] range when it absolutely is

That's both confusing and misleading since the variance of coefficients is largest with the most obscure terms. (And, for plotting, the 40,000+ features include some very long one-off "tokens" that overlap with more common ones, e.g., "boy-drinks-entire-bottle-of-shampoo-and-may-or-may-not-get-girl-back," "props-strategically-positioned-between-naked-actors-and-camera," and "____________________________________________")

With this model, it would be more informative to look at which coefficients have the most impact when making a prediction, by having larger coefficients and occurring more, or alternatively to look at which coefficients we are most certain of, downweighting by the inherent error. The impact will be proportional to \(log(n_w)\) and the error will be roughly proportional to \(1/sqrt(n_w)\).

So, impact:

plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,25*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))

#text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))

Most positive and negative features by impact:

sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
 outstanding     seamless     flawless     chilling 
  0.02986523   0.02405951   0.02369958   0.02258045 
   perfectly   astounding         deft    memorable 
  0.02240629   0.02240350   0.02239351   0.02236545 
   feel-good  wonderfully      offbeat    fantastic 
  0.02204303   0.02201251   0.02169243   0.02166363 
 masterfully       superb  understands         lore 
  0.02160334   0.02159813   0.02140946   0.02139079 
      finest          gem breathtaking   refreshing 
  0.02129761   0.02116775   0.02105908   0.02097314 
sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
       wasted     ludicrous         waste        poorly 
  -0.03357349   -0.03190920   -0.02923099   -0.02870431 
         mess    ridiculous          lame         awful 
  -0.02796097   -0.02768335   -0.02696895   -0.02676743 
    insulting         worst       spoiled        boring 
  -0.02548997   -0.02499573   -0.02470123   -0.02451342 
    stupidity          dull     laughable unintentional 
  -0.02332056   -0.02317028   -0.02300540   -0.02295658 
      unfunny       idiotic         sucks     painfully 
  -0.02284695   -0.02241219   -0.02238895   -0.02155078 

Regularization and cross-validation bought us a lot more general -- less overfit -- model than we saw with Naive Bayes.

Alternatively, by certainty:

plot(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Error Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,10*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))

Most positive and negative terms:

sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
 outstanding    perfectly performances    memorable 
  0.05576915   0.05101881   0.05047507   0.04937696 
       great         also    hilarious         life 
  0.04658153   0.04586310   0.04494460   0.04461705 
        best     deserves     terrific  wonderfully 
  0.04295549   0.04288505   0.04208678   0.04164480 
      superb       allows         both           as 
  0.04154602   0.04142025   0.04105762   0.04055521 
        most       strong       others       subtle 
  0.04045270   0.04038688   0.03987612   0.03985545 
sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
       wasted         worst           bad        boring 
  -0.07034255   -0.07027883   -0.06857515   -0.06613638 
        waste          mess    ridiculous      supposed 
  -0.06365334   -0.06335187   -0.06061957   -0.05816944 
        awful        poorly          lame        stupid 
  -0.05660674   -0.05652766   -0.05409580   -0.05350318 
    ludicrous          dull         worse unfortunately 
  -0.05208293   -0.04988310   -0.04752929   -0.04642705 
         plot       nothing       unfunny     laughable 
  -0.04553899   -0.04516937   -0.04482198   -0.04370796 

This view implies some would-be "stop words" are important, and these seem to make sense on inspection. For example, "as" is indicative of phrases in positive reviews comparing movies to well-known and well-liked movies, e.g., "as good as." There's not a parallel "as bad as" that is as common in negative reviews.

LASSO (Logistic with L1-regularization)

Ridge regression gives you a coefficient for every feature. At the other extreme, we can use the LASSO to get some feature selection.

registerDoMC(cores=2) # parallelize to speed up
sentmod.lasso <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$sentiment,
                   family="binomial", 
                   alpha=1,  # alpha = 1: LASSO
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.lasso)

# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.lasso <- predict(sentmod.lasso, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.lasso <- rep(NA,length(predicted_value.lasso))
predicted_class.lasso[predicted_value.lasso>0] <- "pos"
predicted_class.lasso[predicted_value.lasso<0] <- "neg"
tab_class.lasso <- table(actual_class,predicted_class.lasso)
tab_class.lasso
            predicted_class.lasso
actual_class neg pos
         neg 202  61
         pos  29 208
confusionMatrix(tab_class.lasso, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.lasso
actual_class neg pos
         neg 202  61
         pos  29 208
                                          
               Accuracy : 0.82            
                 95% CI : (0.7835, 0.8527)
    No Information Rate : 0.538           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6414          
                                          
 Mcnemar's Test P-Value : 0.001084        
                                          
            Sensitivity : 0.7732          
            Specificity : 0.8745          
         Pos Pred Value : 0.8776          
         Neg Pred Value : 0.7681          
              Precision : 0.8776          
                 Recall : 0.7732          
                     F1 : 0.8221          
             Prevalence : 0.5380          
         Detection Rate : 0.4160          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.8238          
                                          
       'Positive' Class : pos             
                                          

Slightly better accuracy of .82. The pattern of misses goes further in the other direction from Naive Bayes, overpredicting positive reviews.

plot(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,1*abs(coef(sentmod.lasso)[-1,1])))

As we want when we run the LASSO, the vast majority of our coefficients are zero ... most features have no influence on the predictions.

It's less necessary but let's look at impact:

plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.25*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1])))

Most positive and negative features by impact:

sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=T)[1:20]
 outstanding    perfectly    memorable    hilarious 
   2.0168340    1.7973050    1.7462205    1.3514679 
    terrific     deserves   refreshing       finest 
   1.2847580    1.2552130    1.2195209    1.1194808 
performances breathtaking       others       allows 
   0.9978078    0.8788957    0.8011013    0.7975994 
       great     chilling  wonderfully      overall 
   0.7869102    0.7079751    0.6910283    0.6883786 
       flaws   commanding         lore          war 
   0.6604843    0.6523092    0.6155854    0.6112613 

Interestingly, there are a few there that would be negative indicators in most sentiment dictionaries, like "flaws" and "war".

sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=F)[1:20]
   ridiculous        wasted        boring          mess 
    -3.183920     -3.157582     -2.934914     -2.873280 
    ludicrous        poorly         worst         awful 
    -2.825649     -2.485490     -2.397311     -2.373415 
        waste          lame           bad      supposed 
    -2.173870     -2.080114     -2.049388     -2.002295 
 embarrassing       runtime       tedious     laughable 
    -1.826169     -1.669473     -1.529704     -1.526047 
      unfunny       nothing unfortunately   degenerates 
    -1.278365     -1.254600     -1.244573     -1.239855 

Both lists also have words that indicate a transition from a particular negative or positive aspect, followed by a holistic sentiment in the opposite direction. "The pace dragged at times, but overall it is an astonishing act of filmmaking." "The performances are tremendous, but unfortunately the poor writing makes this movie fall flat."

Elastic net

The elastic net estimates not just \(\lambda\) (the overall amount of regularization) but also \(\alpha\) (the relative weight of the L1 loss relative to the L2 loss). In R, this can also be done with the glmnet package. "I leave that as an exercise."

A first ensemble

We've got three sets of predictions now, so why don't we try a simple ensemble in which our prediction for each review is based on a majority vote of the three. Sort of like a Rotten Tomatoes rating. They each learned slightly different things, so perhaps the whole is better than its parts.

predicted_class.ensemble3 <- rep("neg",length(actual_class))
num_predicted_pos3 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos")
predicted_class.ensemble3[num_predicted_pos3>1] <- "pos"
tab_class.ensemble3 <- table(actual_class,predicted_class.ensemble3)
tab_class.ensemble3
            predicted_class.ensemble3
actual_class neg pos
         neg 223  40
         pos  42 195
confusionMatrix(tab_class.ensemble3, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.ensemble3
actual_class neg pos
         neg 223  40
         pos  42 195
                                          
               Accuracy : 0.836           
                 95% CI : (0.8006, 0.8674)
    No Information Rate : 0.53            
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.671           
                                          
 Mcnemar's Test P-Value : 0.9121          
                                          
            Sensitivity : 0.8298          
            Specificity : 0.8415          
         Pos Pred Value : 0.8228          
         Neg Pred Value : 0.8479          
              Precision : 0.8228          
                 Recall : 0.8298          
                     F1 : 0.8263          
             Prevalence : 0.4700          
         Detection Rate : 0.3900          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.8356          
                                          
       'Positive' Class : pos             
                                          

Hey, that is better! Accuracy 83.6%!

Support vector machine

Without explaining SVM at all, let's try a simple one.

library(e1071)
sentmod.svm <- svm(x=dfmat_train,
                   y=as.factor(docvars(dfmat_train)$sentiment),
                   kernel="linear", 
                   cost=10,  # arbitrary regularization cost
                   probability=TRUE)

Ideally, we would tune the cost parameter via cross-validation or similar, as we did with \(\lambda\) above.

# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class.svm <- predict(sentmod.svm, newdata=dfmat_matched)
tab_class.svm <- table(actual_class,predicted_class.svm)
tab_class.svm
            predicted_class.svm
actual_class neg pos
         neg 208  55
         pos  31 206
confusionMatrix(tab_class.svm, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.svm
actual_class neg pos
         neg 208  55
         pos  31 206
                                         
               Accuracy : 0.828          
                 95% CI : (0.792, 0.8601)
    No Information Rate : 0.522          
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6568         
                                         
 Mcnemar's Test P-Value : 0.01313        
                                         
            Sensitivity : 0.7893         
            Specificity : 0.8703         
         Pos Pred Value : 0.8692         
         Neg Pred Value : 0.7909         
              Precision : 0.8692         
                 Recall : 0.7893         
                     F1 : 0.8273         
             Prevalence : 0.5220         
         Detection Rate : 0.4120         
   Detection Prevalence : 0.4740         
      Balanced Accuracy : 0.8298         
                                         
       'Positive' Class : pos            
                                         

That's actually a bit better than the others, individually if not combined, with accuracy of .828, and a bias toward overpredicting positives.

For a linear kernel, we can back out interpretable coefficients. This is not true with nonlinear kernels such as the "radial basis function."

beta.svm <- drop(t(sentmod.svm$coefs)%*%dfmat_train[sentmod.svm$index,])

(Note the signs are reversed from our expected pos-neg.)

plot(colSums(dfmat_train),-beta.svm, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Support Vector Machine Coefficients (Linear Kernel), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),-beta.svm, colnames(dfmat_train),pos=4,cex=10*abs(beta.svm), col=rgb(0,0,0,5*abs(beta.svm)))

sort(-beta.svm,dec=T)[1:20]
   excellent    perfectly          job         seen 
  0.10174038   0.08249379   0.08239108   0.07982601 
      jackie        great       laughs         well 
  0.07796253   0.07625078   0.07433246   0.07290207 
        life     american        takes       sherri 
  0.07014216   0.06798416   0.06700347   0.06666700 
    together          son performances      hopkins 
  0.06603345   0.06554653   0.06551056   0.06537682 
         war     terrific          fun        comic 
  0.06507929   0.06492881   0.06272680   0.06098881 
sort(-beta.svm,dec=F)[1:20]
          waste             bad         nothing 
    -0.13829148     -0.11982861     -0.11831565 
           only   unfortunately          boring 
    -0.11097562     -0.10948674     -0.10597857 
           poor          anyway          script 
    -0.09202704     -0.09050548     -0.08665645 
            any           worst           awful 
    -0.08614869     -0.08549709     -0.08353324 
           plot      horrendous           maybe 
    -0.07762370     -0.07756367     -0.07716360 
         should        supposed extraordinarily 
    -0.07520975     -0.07422758     -0.07078764 
           mess           looks 
    -0.06984777     -0.06962532 

Looks a bit overfit to me and I would probably increase the regularization cost in further iterations.

Random Forests

library(randomForest)

Random forests is a very computationally intensive algorithm, so I will cut the number of features way way down just so this can run in a reasonable amount of time.

dfmat.rf <- corpus %>%
  tokens() %>%
  dfm() %>%
  dfm_trim(min_docfreq=50,max_docfreq=300,verbose=TRUE)
Removing features occurring: 
  - in fewer than 50 documents: 46,213
  - in more than 300 documents: 359
  Total features removed: 46,572 (96.3%).
dfmatrix.rf <- as.matrix(dfmat.rf)
set.seed(1234)
sentmod.rf <- randomForest(dfmatrix.rf[id_train,], 
                   y=as.factor(docvars(dfmat.rf)$sentiment)[id_train],
                   xtest=dfmatrix.rf[id_test,],
                   ytest=as.factor(docvars(dfmat.rf)$sentiment)[id_test],
                   importance=TRUE,
                   mtry=20,
                   ntree=100
                   )
#sentmod.rf
predicted_class.rf <- sentmod.rf$test[['predicted']]
tab_class.rf <- table(actual_class,predicted_class.rf)
confusionMatrix(tab_class.rf, mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.rf
actual_class neg pos
         neg 193  70
         pos  30 207
                                          
               Accuracy : 0.8             
                 95% CI : (0.7622, 0.8342)
    No Information Rate : 0.554           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6022          
                                          
 Mcnemar's Test P-Value : 9.619e-05       
                                          
            Sensitivity : 0.7473          
            Specificity : 0.8655          
         Pos Pred Value : 0.8734          
         Neg Pred Value : 0.7338          
              Precision : 0.8734          
                 Recall : 0.7473          
                     F1 : 0.8054          
             Prevalence : 0.5540          
         Detection Rate : 0.4140          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.8064          
                                          
       'Positive' Class : pos             
                                          

That did a bit worse -- Accuracy .8 -- but we did give it considerably less information.

Getting marginal effects from a random forest model requires more finesse than I'm willing to apply here. We can get the "importance" of the different features, but this alone does not tell us in what direction the feature pushes the predictions.

varImpPlot(sentmod.rf)

Some usual suspects there, but we need our brains to fill in which ones are positive and negative. Some are ambiguous ("town") and some we have seen are subtle ("overall") or likely to be misleading ("war").

Another ensemble

Now we've got five, so let's ensemble those.

  predicted_class.ensemble5 <- rep("neg",length(actual_class))
num_predicted_pos5 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos") + 
  1*(predicted_class.svm=="pos") + 
  1*(predicted_class.rf=="pos")
predicted_class.ensemble5[num_predicted_pos5>2] <- "pos"
tab_class.ensemble5 <- table(actual_class,predicted_class.ensemble5)
tab_class.ensemble5
            predicted_class.ensemble5
actual_class neg pos
         neg 224  39
         pos  31 206
confusionMatrix(tab_class.ensemble5,mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.ensemble5
actual_class neg pos
         neg 224  39
         pos  31 206
                                          
               Accuracy : 0.86            
                 95% CI : (0.8265, 0.8892)
    No Information Rate : 0.51            
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7197          
                                          
 Mcnemar's Test P-Value : 0.4028          
                                          
            Sensitivity : 0.8408          
            Specificity : 0.8784          
         Pos Pred Value : 0.8692          
         Neg Pred Value : 0.8517          
              Precision : 0.8692          
                 Recall : 0.8408          
                     F1 : 0.8548          
             Prevalence : 0.4900          
         Detection Rate : 0.4120          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.8596          
                                          
       'Positive' Class : pos             
                                          

And, like magic, now we're up over 85% Accuracy in the test set.

Note that most of these agree with each other on about 80% of cases (Naive Bayes and Ridge Regression agree on 93.4% of cases, and every other pair is between 78.8% and 81.0% agreement). The ensembling works because of this general disagreement, not in spite of it. If the models made the same mistakes -- agreed with each other -- there would be no gain from ensembling. Ensembling works best when you have different sorts of models that learn different sorts of things.

Compare to a dictionary

library(quanteda.sentiment)
polarity.lsd <- textstat_polarity(dfmat_test, dictionary = data_dictionary_LSD2015, fun=sent_relpropdiff)
predicted_class.lsd <- rep("neg",nrow(polarity.lsd))
predicted_class.lsd[polarity.lsd$sentiment>0] <- "pos"
tab_class.lsd <- table(actual_class,predicted_class.lsd)
confusionMatrix(tab_class.lsd,mode="everything", positive="pos")
Confusion Matrix and Statistics

            predicted_class.lsd
actual_class neg pos
         neg 172  91
         pos  70 167
                                          
               Accuracy : 0.678           
                 95% CI : (0.6351, 0.7188)
    No Information Rate : 0.516           
    P-Value [Acc > NIR] : 1.512e-13       
                                          
                  Kappa : 0.3571          
                                          
 Mcnemar's Test P-Value : 0.115           
                                          
            Sensitivity : 0.6473          
            Specificity : 0.7107          
         Pos Pred Value : 0.7046          
         Neg Pred Value : 0.6540          
              Precision : 0.7046          
                 Recall : 0.6473          
                     F1 : 0.6747          
             Prevalence : 0.5160          
         Detection Rate : 0.3340          
   Detection Prevalence : 0.4740          
      Balanced Accuracy : 0.6790          
                                          
       'Positive' Class : pos             
                                          

Accuracy 67.8%. Yeah. Better than flipping a coin, but ...

ROC Curves

Another common approach to evaluating classifiers is the ROC curve, which looks in more depth at the underlying scores / probability statements that indicate how sure the classifier is. There are multiple libraries that provide ROC curve functionality. We'll use ROCit.

library(ROCit)

For these, we will use the underlying "score" emitted from the classifiers -- the predicted probabilities, or the dictionary polarity score.

There are multiple versions, but a typical one plots "Sensitivity" (True Positive Rate) vs. "Specificity" (1 - False Positive Rate) as the threshold for predicting class labels moves from the lowest score received to the highest.

roc_empirical.lsd <- rocit(score = polarity.lsd$sentiment, class = actual_class,negref = "neg")
roc_empirical.nb <- rocit(score =predict(sentmod.nb, newdata=dfmat_matched, type="probability")[,"pos"] , class = actual_class,negref = "neg")
roc_empirical.ridge <- rocit(score = predicted_value.ridge, class = actual_class,negref = "neg")
roc_empirical.lasso <- rocit(score = predicted_value.lasso, class = actual_class,negref = "neg")
predicted_prob.svm <- attr(predict(sentmod.svm, newdata=dfmat_matched, probability=TRUE),"probabilities")[,"pos"]
roc_empirical.svm <- rocit(score = predicted_prob.svm, class = actual_class,negref = "neg")
roc_empirical.rf <- rocit(score = sentmod.rf$test$votes[,2], class = actual_class,negref = "neg")
plot(c(0,1),c(0,1), type="n",main = "ROC Curves",ylab = "Sensitivity (TPR)", xlab = "1-Specificity (FPR)")
lines(c(0,1),c(0,1), col="grey", lty=2)
lines(roc_empirical.lsd$FPR,roc_empirical.lsd$TPR,col="black",lwd=2)
lines(roc_empirical.nb$FPR,roc_empirical.nb$TPR,col="red",lwd=2)
lines(roc_empirical.ridge$FPR,roc_empirical.ridge$TPR,col="blue",lwd=2)
lines(roc_empirical.lasso$FPR,roc_empirical.lasso$TPR,col="green",lwd=2)
lines(roc_empirical.svm$FPR,roc_empirical.svm$TPR,col="orange",lwd=2)
lines(roc_empirical.rf$FPR,roc_empirical.rf$TPR,col="purple",lwd=2)
legend(.6,.2,legend=c("Lexicoder","Naive Bayes","Ridge Regression","LASSO","Support Vector Machine","Random Forests"), col=c("black","red","blue","green","orange","purple"),lty=1,lwd=2)

A good ROC curve steps up steeply from 0, nearing 1 quickly, with its "elbow" as close to the upper left corner as possible.

A random classifier will have the ROC curve shown in gray.

This shows clearly that the analysis using the Lexicoder dictionary is by far the poorest. Among the other five, SVM and LASSO are slightly better, followed by Naive Bayes and Ridge Regression. Random Forest is the poorest of these five, but as previously noted we used much less information in building that model because of computation time.

This can be summarized with the AUC (Area Under Curve) statistic:

print(paste("AUC, Lexicoder: ",sprintf("%1.3f",roc_empirical.lsd$AUC)))
[1] "AUC, Lexicoder:  0.757"
print(paste("AUC, Naive Bayes: ",sprintf("%1.3f",roc_empirical.nb$AUC)))
[1] "AUC, Naive Bayes:  0.905"
print(paste("AUC, Ridge Regression: ",sprintf("%1.3f",roc_empirical.ridge$AUC)))
[1] "AUC, Ridge Regression:  0.895"
print(paste("AUC, LASSO: ",sprintf("%1.3f",roc_empirical.lasso$AUC)))
[1] "AUC, LASSO:  0.904"
print(paste("AUC, SVM: ",sprintf("%1.3f",roc_empirical.svm$AUC)))
[1] "AUC, SVM:  0.909"
print(paste("AUC, Random Forests: ",sprintf("%1.3f",roc_empirical.rf$AUC)))
[1] "AUC, Random Forests:  0.882"
---
title: "Text as Data Tutorial - Introduction to Text Classification (in R)"
author: "Burt L. Monroe"
subtitle: Text as Data, PLSC 597, Penn State
output:
  html_notebook:
    code_folding: show
    df_print: paged
    highlight: tango
    theme: united
    toc: yes
  html_document:
    df_print: paged
    toc: yes
---
In this notebook we will work through a basic classification problem, using the movie reviews data set. We know the "negative" or "positive" labels for each of the movies. We'll set some of these aside for a test set and train our models on the remainder as a training set, using unigram presence or counts as the features. Then we'll evaluate the predictions quantitatively as well as look at some ways to interpret what the models tell us.

We'll start with Naive Bayes, move to logistic regression and its ridge and LASSO variants, then support vector machines and finally random forests. We'll also combine the models to examine an ensemble prediction.

We'll use these packages (install if necessary):

```{r}
library(dplyr)
library(quanteda)
library(quanteda.textmodels)
library(caret)
```

We'll start with the example given in the `quanteda` documentation. Read in the Pang and Lee dataset of 2000 movie reviews. (This appears to be the same 2000 reviews you used in the dictionary exercise, but in a different order.)

```{r}
corpus <- data_corpus_moviereviews
summary(corpus,5)
```

Shuffle the rows to randomize the order.
```{r}
set.seed(1234)
id_train <- sample(1:2000,1500, replace=F)
head(id_train, 10)
```

Use the 1500 for a training set and the other 500 as your test set. Create dfms for each.
```{r}
docvars(corpus, "id_numeric") <- 1:ndoc(corpus)

dfmat_train <- corpus_subset(corpus, id_numeric %in% id_train) %>% tokens() %>% dfm() #%>% dfm_weight(scheme="boolean")

dfmat_test <- corpus_subset(corpus, !(id_numeric %in% id_train)) %>% tokens %>% dfm() #%>% dfm_weight(scheme="boolean")
```

## Naive Bayes

Naive Bayes is a built in model for quanteda, so it's easy to use:

```{r}
sentmod.nb <- textmodel_nb(dfmat_train, docvars(dfmat_train, "sentiment"), distribution = "Bernoulli")
summary(sentmod.nb)
```

Use the dfm_match command to limit dfmat_test to features (words) that appeared in the training data:
```{r}
dfmat_matched <- dfm_match(dfmat_test, features=featnames(dfmat_train))
```

How did we do? Let's look at a "confusion" matrix.
```{r}
actual_class <- docvars(dfmat_matched, "sentiment")
predicted_class <- predict(sentmod.nb, newdata=dfmat_matched)
tab_class <- table(actual_class,predicted_class)
tab_class
```

Not bad, considering. Let's put some numbers on that:
```{r}
caret::confusionMatrix(tab_class, mode="everything", positive="pos")
```

Given the balance in the data among negatives and positives, "Accuracy" isn't a bad place to start. Here we have Accuracy of 81.6%.

The most commonly reporte alternatives I see are "F1" (the harmonic mean of Precision and Recall, see below) and "AUC" ("area under the curve," referring to the "ROC" or "Receiver Operating Characteristic" curve, discussed in the ROC curve section at the end of this notebook).

If the classes are imbalanced, accuracy becomes a less useful indicator. If, for example, you have 95% positive cases and 5% negative cases in your test set, a model which simply predicts "positive" without even looking at the data would get an accuracy of 0.95. 

Similarly, if you *care* about positive and negative cases differently -- say, your classifier is a Covid test or a jury decision, or negative is a "null hypothesis" -- you will care about precision/recall or specificity/sensitivity or TPR/FPR/TNR/FNR or PDR/NDR. 

"Sensitivity" and "recall" and "true positive rate" (TPR) and "hit rate" and "probability of detection" are all the same thing -- the probability that a positive case is classified as a positive. That is, the extent to which the classifier captures the positive cases it should capture. This is also the same as 1-FNR, the false negative rate. If negative cases are the "null," then FNR is the probability of "Type II error." 

"Specificity" or "selectivity" is the "true negative rate," the probability that negative cases are classified as negative. This is also 1-FPR, where FPR is the "false positive rate," the probability of "false alarm" -- classifying a negative case as positive -- also known as a "Type I error."

"Precision" or "positive detection rate" (PDR) is the probability that a case classified as positive is actually positive. "Negative detection rate" (NDR) is the probability that a case classified as negative is actually negative.

Let's wave our hands at validation by doing some sniff tests. What are the most positive and negative words?

The conditional posterior estimates, the "estimated feature scores" displayed by the summary command above, are stored in the model object's `param` attribute as a $K \times V$ matrix, where $K$ is the number of classes and $V$ is the vocabulary size. Here are the first 10 columns.

```{r}
sentmod.nb$param[,1:10]
```

These are the posterior probability that a word will appear in a document of a given class, $p(w|k)$. We want the posterior probability that a document is of a given class, given it contains that word, $p(k|w)$. This is obtained by Bayes Theorem: $p(\text{pos}|w) = \frac{p(w|\text{pos})}{p(w|\text{pos}) + p(w|\text{neg})}$

```{r}
#Most positive words
sort(sentmod.nb$param["pos",]/colSums(sentmod.nb$param),dec=T)[1:20]
```

There's reasonable stuff there: "outstanding", "seamless", "lovingly", "flawless". There's also some evidence of overfitting: "spielberg's", "winslet", "gattaca", "mulan". We'll see support for the overfitting conclusion below.

```{r}
#Most negative words
sort(sentmod.nb$param["neg",]/colSums(sentmod.nb$param),dec=T)[1:20]
```

Let's get a birds-eye view.
```{r, fig.width=7, fig.height=6}

post.pos <- sentmod.nb$param["pos",]/colSums(sentmod.nb$param)
# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances")
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))
```

Look a little closer at the negative.
```{r, fig.width=7, fig.height=6}
# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0,.25))
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))
```

And a little more closely at the positive words:

```{r, fig.width=7, fig.height=6}
# Plot weights
plot(colSums(dfmat_train),post.pos, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0.75,1.0))
text(colSums(dfmat_train),post.pos, colnames(dfmat_train),pos=4,cex=5*abs(.5-post.pos), col=rgb(0,0,0,1.5*abs(.5-post.pos)))
```


Let's look a little more closely at the document predictions.

```{r}
predicted_prob <- predict(sentmod.nb, newdata=dfmat_matched, type="probability")

dim(predicted_prob)
head(predicted_prob)
summary(predicted_prob)
```

```{r, fig.height=3, fig.width=4}
plot(density(predicted_prob[,"pos"]), main = "Predicted Probabilities from Naive Bayes classifier", xlab="", ylab = "")
rug(predicted_prob[,"pos"])
```

You can see there one problem with the "naive" part of naive Bayes. By taking all of the features (words) as independent, it thinks it has seen far more information than it really has, and is therefore far more confident about its predictions than is warranted.

What's the most positive review in the test set according to this?

```{r}
# sort by *least negative* since near zero aren't rounded
sort.list(predicted_prob[,"neg"], dec=F)[1]
```

```{r}
id_test <- !((1:2000) %in% id_train)
writeLines(as.character(corpus)[id_test][440])
```
Looks like ``Amistad.'' A genuinely positive review, but note how many times "spielberg" is mentioned. The prediction is biased toward positive just because Spielberg had positive reviews in the training set. We may not want that behavior.

Note also that this is a very long review.

```{r}
# sort by *least neg* since near zero aren't rounded
sort.list(predicted_prob[,"pos"], dec=F)[1]
```

```{r}
writeLines(as.character(corpus)[id_test][211])
```

Schwarzenegger's ``End of Days.''

It also should be clear enough that more words means more votes, so longer documents are more clearly positive or negative. There's an argument for that. It also would underplay a review that read in it's entirety ``terrible,'' even though the review is 100% clear in its sentiment.

What is it most confused about?

```{r}
sort.list(abs(predicted_prob[,"pos"] - .5), dec=F)[1]
```

```{r}
predicted_prob[400,]
```

So ... the model says 52% chance negative, 48% positive.

```{r}
writeLines(as.character(corpus)[id_test][400])
```

A negative review of "Mafia!" a spoof movie I'd never heard of. Satire, parody, sarcasm, and similar are notoriously difficult to correctly classify, so perhaps that's what happened here.

Let's look at a clear mistake. 
```{r}
sort.list(predicted_prob[1:250,"neg"],dec=F)[1]
```
```{r}
predicted_prob[196,]
```
So ... the model says *DEFINITELY* positive.

```{r}
writeLines(as.character(corpus)[id_test][196])
```

Aha! A clearly negative review of "Saving Private Ryan."

This is at least partly an "overfitting" mistake. It probably learned other "Saving Private Ryan" or "Spielberg movies" words -- it looks like `spielberg's` was number #3 on our list above -- and learned that "reviews that talk about Saving Private Ryan are probably positive."

Below, I'll give brief examples of some other classification models for this data.

## Logistic regression, ridge regression, LASSO, and elasticnet

We'll look at three (well really only two) variants of the relatively straightforward regularized logistic regression model. These add a "regularization penalty" to the usual loss function of logistic regression. There are two basic types of regularization penalty: L1 and L2. 

The L2 penalty imposes loss proportional to the sum of the squared coefficients. That is, it biases the model toward finding smaller coefficients. In Bayesian terms, this is also equivalent to using a Gaussian prior. Logistic regression with an L2 penalty is known as "ridge regression."

The L1 penalty imposes loss proportional to the sum of the absolute values of the coefficients. That is, it biases the model toward finding more coefficients equal exactly to zero. It has a feature selection effect. In Bayesian terms, this is also equivalent to using a Laplace prior. Logistic regression with an L1 penalty is known as the "LASSO" (Least Absolute Shrinkage and Selection Operator).

How much of a penalty is applied is determined by a weight, $\lambda$. This is a "hyperparameter" that can be tuned during the training stage.

The elastic net adds *both* penalties, distributing the total regularization cost according to a second hyperparameter, $\alpha$, and estimates the optimal values for both.


```{r}
library(glmnet)
library(doMC)
```

### Ridge regression (Logistic with L2-regularization)

```{r}
registerDoMC(cores=2) # parallelize to speed up
sentmod.ridge <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$sentiment,
                   family="binomial", 
                   alpha=0,  # alpha = 0: ridge regression
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.ridge)
```

This shows classification error as $\lambda$ (the total weight of the regularization penalty) is increased from 0.
The minimum error is at the leftmost dotted line, about $\log(\lambda) \approx 3$. This value is stored in `lambda.min`.

```{r}
# actual_class <- docvars(dfmat_matched, "sentiment")
predicted_value.ridge <- predict(sentmod.ridge, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.ridge <- rep(NA,length(predicted_value.ridge))
predicted_class.ridge[predicted_value.ridge>0] <- "pos"
predicted_class.ridge[predicted_value.ridge<0] <- "neg"
tab_class.ridge <- table(actual_class,predicted_class.ridge)
tab_class.ridge
```


```{r}
confusionMatrix(tab_class.ridge, mode="everything", positive="pos")
```

Accuracy of .818, similar to the others. The misses are a little more evenly distributed across classes, with it being slightly more successful in identifying positive reviews and slightly less successful in identifying negative reviews.


At first blush, the coefficients should tell us what the model learned:

```{r, fig.width=7, fig.height=6}
plot(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,75*abs(coef(sentmod.ridge)[-1,1]))) # inexplicable error when knitting of alpha not in [0,1] range when it absolutely is
```

That's both confusing and misleading since the variance of coefficients is largest with the most obscure terms. (And, for plotting, the 40,000+ features include some very long one-off "tokens" that overlap with more common ones, e.g., "boy-drinks-entire-bottle-of-shampoo-and-may-or-may-not-get-girl-back," "props-strategically-positioned-between-naked-actors-and-camera," and "____________________________________________")


With this model, it would be more informative to look at which coefficients have the most impact when making a prediction, by having larger coefficients *and* occurring more, or alternatively to look at which coefficients we are most certain of, downweighting by the inherent error. The impact will be proportional to $log(n_w)$ and the error will be roughly proportional to $1/sqrt(n_w)$.

So, impact:

```{r, fig.width=7, fig.height=6}
plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,25*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))
#text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))
```

Most positive and negative features by impact:

```{r}
sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
```

```{r}
sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
```

Regularization and cross-validation bought us a lot more general -- less overfit -- model than we saw with Naive Bayes.

Alternatively, by certainty:

```{r, fig.width=7, fig.height=6}
plot(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Error Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,10*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))
```

Most positive and negative terms:

```{r}
sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
```

```{r}
sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
```

This view implies some would-be "stop words" are important, and these seem to make sense on inspection. For example, "as" is indicative of phrases in positive reviews comparing movies to well-known and well-liked movies, e.g., "as good as." There's not a parallel "as bad as" that is as common in negative reviews.

### LASSO (Logistic with L1-regularization)

Ridge regression gives you a coefficient for every feature. At the other extreme, we can use the LASSO to get some feature selection.

```{r}
registerDoMC(cores=2) # parallelize to speed up
sentmod.lasso <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$sentiment,
                   family="binomial", 
                   alpha=1,  # alpha = 1: LASSO
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.lasso)
```


```{r}
# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.lasso <- predict(sentmod.lasso, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.lasso <- rep(NA,length(predicted_value.lasso))
predicted_class.lasso[predicted_value.lasso>0] <- "pos"
predicted_class.lasso[predicted_value.lasso<0] <- "neg"
tab_class.lasso <- table(actual_class,predicted_class.lasso)
tab_class.lasso
```

```{r}
confusionMatrix(tab_class.lasso, mode="everything", positive="pos")
```

Slightly better accuracy of .82. The pattern of misses goes further in the other direction from Naive Bayes, overpredicting positive reviews.


```{r, fig.width=7, fig.height=6}
plot(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,1*abs(coef(sentmod.lasso)[-1,1])))
```

As we want when we run the LASSO, the vast majority of our coefficients are zero ... most features have no influence on the predictions.

It's less necessary but let's look at impact:

```{r, fig.width=7, fig.height=6}
plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
#text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.3))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.25*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1])))
```

Most positive and negative features by impact:

```{r}
sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=T)[1:20]
```

Interestingly, there are a few there that would be negative indicators in most sentiment dictionaries, like "flaws" and "war".

```{r}
sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=F)[1:20]
```

Both lists also have words that indicate a transition from a particular negative or positive aspect, followed by a holistic sentiment in the opposite direction. "The pace dragged at times, but *overall* it is an astonishing act of filmmaking." "The performances are tremendous, but *unfortunately* the poor writing makes this movie fall flat." 

### Elastic net

The elastic net estimates not just $\lambda$ (the overall amount of regularization) but also $\alpha$ (the relative weight of the L1 loss relative to the L2 loss). In R, this can also be done with the `glmnet` package. "I leave that as an exercise."

## A first ensemble

We've got three sets of predictions now, so why don't we try a simple ensemble in which our prediction for each review is based on a majority vote of the three. Sort of like a Rotten Tomatoes rating. They each learned slightly different things, so perhaps the whole is better than its parts.

```{r}
predicted_class.ensemble3 <- rep("neg",length(actual_class))
num_predicted_pos3 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos")
predicted_class.ensemble3[num_predicted_pos3>1] <- "pos"
tab_class.ensemble3 <- table(actual_class,predicted_class.ensemble3)
tab_class.ensemble3
```


```{r}
confusionMatrix(tab_class.ensemble3, mode="everything", positive="pos")
```

Hey, that is better! Accuracy 83.6%!



## Support vector machine

Without explaining SVM at all, let's try a simple one.

```{r}
library(e1071)
```

```{r}
sentmod.svm <- svm(x=dfmat_train,
                   y=as.factor(docvars(dfmat_train)$sentiment),
                   kernel="linear", 
                   cost=10,  # arbitrary regularization cost
                   probability=TRUE)
```

Ideally, we would tune the cost parameter via cross-validation or similar, as we did with $\lambda$ above.

```{r}
# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class.svm <- predict(sentmod.svm, newdata=dfmat_matched)
tab_class.svm <- table(actual_class,predicted_class.svm)
tab_class.svm
```


```{r}
confusionMatrix(tab_class.svm, mode="everything", positive="pos")
```

That's actually a bit better than the others, individually if not combined, with accuracy of .828, and a bias toward overpredicting positives.



For a linear kernel, we can back out interpretable coefficients. This is not true with nonlinear kernels such as the "radial basis function."

```{r}
beta.svm <- drop(t(sentmod.svm$coefs)%*%dfmat_train[sentmod.svm$index,])
```

(Note the signs are reversed from our expected pos-neg.)

```{r,fig.width=7,fig.height=6}
plot(colSums(dfmat_train),-beta.svm, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Support Vector Machine Coefficients (Linear Kernel), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),-beta.svm, colnames(dfmat_train),pos=4,cex=10*abs(beta.svm), col=rgb(0,0,0,5*abs(beta.svm)))
```

```{r}
sort(-beta.svm,dec=T)[1:20]
```

```{r}
sort(-beta.svm,dec=F)[1:20]
```

Looks a bit overfit to me and I would probably increase the regularization cost in further iterations.

## Random Forests

```{r}
library(randomForest)
```

Random forests is a very computationally intensive algorithm, so I will cut the number of features way way down just so this can run in a reasonable amount of time.

```{r}
dfmat.rf <- corpus %>%
  tokens() %>%
  dfm() %>%
  dfm_trim(min_docfreq=50,max_docfreq=300,verbose=TRUE)
```

```{r}
dfmatrix.rf <- as.matrix(dfmat.rf)
```

```{r}
set.seed(1234)
sentmod.rf <- randomForest(dfmatrix.rf[id_train,], 
                   y=as.factor(docvars(dfmat.rf)$sentiment)[id_train],
                   xtest=dfmatrix.rf[id_test,],
                   ytest=as.factor(docvars(dfmat.rf)$sentiment)[id_test],
                   importance=TRUE,
                   mtry=20,
                   ntree=100
                   )
#sentmod.rf
```

```{r}
predicted_class.rf <- sentmod.rf$test[['predicted']]
tab_class.rf <- table(actual_class,predicted_class.rf)
confusionMatrix(tab_class.rf, mode="everything", positive="pos")
```

That did a bit worse -- Accuracy .8 -- but we did give it considerably less information.

Getting marginal effects from a random forest model requires more finesse than I'm willing to apply here. We can get the "importance" of the different features, but this alone does not tell us in what direction the feature pushes the predictions.

```{r,fig.height=7,fig.width=6}
varImpPlot(sentmod.rf)
```

Some usual suspects there, but we need our brains to fill in which ones are positive and negative. Some are ambiguous ("town") and some we have seen are subtle ("overall") or likely to be misleading ("war").

## Another ensemble

Now we've got five, so let's ensemble those.

```{r}
  predicted_class.ensemble5 <- rep("neg",length(actual_class))
num_predicted_pos5 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos") + 
  1*(predicted_class.svm=="pos") + 
  1*(predicted_class.rf=="pos")
predicted_class.ensemble5[num_predicted_pos5>2] <- "pos"
tab_class.ensemble5 <- table(actual_class,predicted_class.ensemble5)
tab_class.ensemble5
confusionMatrix(tab_class.ensemble5,mode="everything", positive="pos")
```

And, like magic, now we're up over 85% Accuracy in the test set.

Note that most of these agree with each other on about 80% of cases (Naive Bayes and Ridge Regression agree on 93.4% of cases, and every other pair is between 78.8% and 81.0% agreement). The ensembling works *because* of this general disagreement, not *in spite* of it. If the models made the same mistakes -- agreed with each other -- there would be no gain from ensembling. Ensembling works best when you have different sorts of models that learn different sorts of things.

## Compare to a dictionary

```{r}
library(quanteda.sentiment)
```

```{r}
polarity.lsd <- textstat_polarity(dfmat_test, dictionary = data_dictionary_LSD2015, fun=sent_relpropdiff)
predicted_class.lsd <- rep("neg",nrow(polarity.lsd))
predicted_class.lsd[polarity.lsd$sentiment>0] <- "pos"
tab_class.lsd <- table(actual_class,predicted_class.lsd)
confusionMatrix(tab_class.lsd,mode="everything", positive="pos")
```

Accuracy 67.8%. Yeah. Better than flipping a coin, but ...

## ROC Curves

Another common approach to evaluating classifiers is the ROC curve, which looks in more depth at the underlying scores / probability statements that indicate how sure the classifier is. There are multiple libraries that provide ROC curve functionality. We'll use `ROCit`.

```{r}
library(ROCit)
```

For these, we will use the underlying "score" emitted from the classifiers -- the predicted probabilities, or the dictionary polarity score.

There are multiple versions, but a typical one plots "Sensitivity" (True Positive Rate) vs. "Specificity" (1 - False Positive Rate) as the threshold for predicting class labels moves from the lowest score received to the highest.



```{r}
roc_empirical.lsd <- rocit(score = polarity.lsd$sentiment, class = actual_class,negref = "neg")
roc_empirical.nb <- rocit(score =predict(sentmod.nb, newdata=dfmat_matched, type="probability")[,"pos"] , class = actual_class,negref = "neg")

roc_empirical.ridge <- rocit(score = predicted_value.ridge, class = actual_class,negref = "neg")
roc_empirical.lasso <- rocit(score = predicted_value.lasso, class = actual_class,negref = "neg")
predicted_prob.svm <- attr(predict(sentmod.svm, newdata=dfmat_matched, probability=TRUE),"probabilities")[,"pos"]
roc_empirical.svm <- rocit(score = predicted_prob.svm, class = actual_class,negref = "neg")
roc_empirical.rf <- rocit(score = sentmod.rf$test$votes[,2], class = actual_class,negref = "neg")
```




```{r,fig.height=4,fig.width=4}
plot(c(0,1),c(0,1), type="n",main = "ROC Curves",ylab = "Sensitivity (TPR)", xlab = "1-Specificity (FPR)")
lines(c(0,1),c(0,1), col="grey", lty=2)
lines(roc_empirical.lsd$FPR,roc_empirical.lsd$TPR,col="black",lwd=2)
lines(roc_empirical.nb$FPR,roc_empirical.nb$TPR,col="red",lwd=2)
lines(roc_empirical.ridge$FPR,roc_empirical.ridge$TPR,col="blue",lwd=2)
lines(roc_empirical.lasso$FPR,roc_empirical.lasso$TPR,col="green",lwd=2)
lines(roc_empirical.svm$FPR,roc_empirical.svm$TPR,col="orange",lwd=2)
lines(roc_empirical.rf$FPR,roc_empirical.rf$TPR,col="purple",lwd=2)
legend(.6,.2,legend=c("Lexicoder","Naive Bayes","Ridge Regression","LASSO","Support Vector Machine","Random Forests"), col=c("black","red","blue","green","orange","purple"),lty=1,lwd=2)
```

A good ROC curve steps up steeply from 0, nearing 1 quickly, with its "elbow" as close to the upper left corner as possible.

A random classifier will have the ROC curve shown in gray.

This shows clearly that the analysis using the Lexicoder dictionary is by far the poorest. Among the other five, SVM and LASSO are slightly better, followed by Naive Bayes and Ridge Regression. Random Forest is the poorest of these five, but as previously noted we used much less information in building that model because of computation time.

This can be summarized with the AUC (Area Under Curve) statistic:

```{r}
print(paste("AUC, Lexicoder: ",sprintf("%1.3f",roc_empirical.lsd$AUC)))
print(paste("AUC, Naive Bayes: ",sprintf("%1.3f",roc_empirical.nb$AUC)))
print(paste("AUC, Ridge Regression: ",sprintf("%1.3f",roc_empirical.ridge$AUC)))
print(paste("AUC, LASSO: ",sprintf("%1.3f",roc_empirical.lasso$AUC)))
print(paste("AUC, SVM: ",sprintf("%1.3f",roc_empirical.svm$AUC)))
print(paste("AUC, Random Forests: ",sprintf("%1.3f",roc_empirical.rf$AUC)))
```