In this notebook we will work through a basic classification problem, using the movie reviews data set. We know the “negative” or “positive” labels for each of the movies. We’ll set some of these aside for a test set and train our models on the remainder as a training set, using unigram presence or counts as the features. Then we’ll evaluate the predictions quantitatively as well as look at some ways to interpret what the models tell us.
We’ll start with Naive Bayes, move to logistic regression and its ridge and LASSO variants, then support vector machines and finally random forests. We’ll also combine the models to examine an ensemble prediction.
Remove the comment and install the quanteda.corpora package from github:
# devtools::install_github("quanteda/quanteda.corpora")
We’ll use these packages:
library(dplyr)
library(quanteda)
library(quanteda.corpora)
library(caret)
We’ll start with the example given in the quanteda
documentation. Read in the Pang and Lee dataset of 2000 movie reviews. (This appears to be the same 2000 reviews you used in the dictionary exercise, but in a different order.)
corpus <- data_corpus_movies
summary(corpus,5)
nsentence() does not correctly count sentences in all lower-cased text
Corpus consisting of 2000 documents, showing 5 documents:
Text Types Tokens Sentences Sentiment
neg_cv000_29416 354 841 9 neg
neg_cv001_19502 156 278 1 neg
neg_cv002_17424 276 553 3 neg
neg_cv003_12683 314 564 2 neg
neg_cv004_12641 380 842 2 neg
id1 id2
cv000 29416
cv001 19502
cv002 17424
cv003 12683
cv004 12641
Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
Created: Sat Nov 15 18:43:25 2014
Notes:
Shuffle the rows to randomize the order.
set.seed(1234)
id_train <- sample(1:2000,1500, replace=F)
head(id_train, 10)
[1] 228 1244 1218 1245 1719 1278 19 464 1327 1024
Use the 1500 for a training set and the other 500 as your test set. Create dfms for each.
docvars(corpus, "id_numeric") <- 1:ndoc(corpus)
dfmat_train <- corpus_subset(corpus, id_numeric %in% id_train) %>% dfm() #%>% dfm_weight(scheme="boolean")
dfmat_test <- corpus_subset(corpus, !(id_numeric %in% id_train)) %>% dfm() #%>% dfm_weight(scheme="boolean")
Naive Bayes is a built in model for quanteda, so it’s easy to use:
sentmod.nb <- textmodel_nb(dfmat_train, docvars(dfmat_train, "Sentiment"), distribution = "Bernoulli")
summary(sentmod.nb)
Call:
textmodel_nb.dfm(x = dfmat_train, y = docvars(dfmat_train, "Sentiment"),
distribution = "Bernoulli")
Class Priors:
(showing first 2 elements)
neg pos
0.5 0.5
Estimated Feature Scores:
plot : two teen couples go to
neg 0.5853 0.5081 0.4954 0.6258 0.5324 0.4859 0.4996
pos 0.4147 0.4919 0.5046 0.3742 0.4676 0.5141 0.5004
a church party , drink and then
neg 0.499 0.446 0.547 0.4996 0.4669 0.5 0.5403
pos 0.501 0.554 0.453 0.5004 0.5331 0.5 0.4597
drive . they get into an accident
neg 0.5799 0.5 0.5045 0.5074 0.4914 0.4952 0.492
pos 0.4201 0.5 0.4955 0.4926 0.5086 0.5048 0.508
one of the guys dies but his
neg 0.498 0.4993 0.4996 0.5615 0.4886 0.5006 0.4867
pos 0.502 0.5007 0.5004 0.4385 0.5114 0.4994 0.5133
girlfriend continues
neg 0.5127 0.3307
pos 0.4873 0.6693
Use the dfm_match command to limit dfmat_test to features (words) that appeared in the training data:
dfmat_matched <- dfm_match(dfmat_test, features=featnames(dfmat_train))
How did we do? Let’s look at a “confusion” matrix.
actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class <- predict(sentmod.nb, newdata=dfmat_matched)
tab_class <- table(actual_class,predicted_class)
tab_class
predicted_class
actual_class neg pos
neg 225 38
pos 53 184
Not bad, considering. Let’s put some numbers on that:
confusionMatrix(tab_class, mode="everything")
Confusion Matrix and Statistics
predicted_class
actual_class neg pos
neg 225 38
pos 53 184
Accuracy : 0.818
95% CI : (0.7813, 0.8509)
No Information Rate : 0.556
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6339
Mcnemar's Test P-Value : 0.1422
Sensitivity : 0.8094
Specificity : 0.8288
Pos Pred Value : 0.8555
Neg Pred Value : 0.7764
Precision : 0.8555
Recall : 0.8094
F1 : 0.8318
Prevalence : 0.5560
Detection Rate : 0.4500
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8191
'Positive' Class : neg
Given the balance in the data among negatives and positives, “Accuracy” isn’t a bad place to start. Here we have Accuracy of 81.8%.
Let’s do some sniff tests. What are the most positive and negative words?
#Most positive words
sort(sentmod.nb$PcGw[2,],dec=T)[1:20]
outstanding seamless spielberg's lovingly
0.9609883 0.9354430 0.9311493 0.9262437
flawless astounding winslet winter
0.9205855 0.9205855 0.9205855 0.9205855
recalls lore gattaca annual
0.9139870 0.9139870 0.9139870 0.9139870
addresses mulan masterfully deft
0.9139870 0.9139870 0.9139870 0.9139870
online continuing missteps discussed
0.9061925 0.9061925 0.9061925 0.9061925
There’s reasonable stuff there: “outstanding”, “seamless”, “lovingly”, “flawless”. There’s also some evidence of overfitting: “spielberg’s”, “winslet”, “gattaca”, “mulan”. We’ll see support for the overfitting conclusion below.
#Most negative words
sort(sentmod.nb$PcGw[2,],dec=F)[1:20]
ludicrous spoiled pen insulting
0.05693813 0.06916885 0.08072974 0.08072974
racing degenerates perfunctory bounce
0.08072974 0.08072974 0.08809155 0.09693075
misfire feeble horrid weaponry
0.09693075 0.09693075 0.10774165 0.10774165
1982 3000 bursts wielding
0.10774165 0.10774165 0.10774165 0.10774165
campiness macdonald wee stalks
0.10774165 0.10774165 0.10774165 0.10774165
Let’s get a birds-eye view.
# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances")
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))
Look a little closer at the negative.
# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0,.25))
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))
And a little more closely at the positive words:
# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0.75,1.0))
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))
Let’s look a little more closely at the document predictions.
predicted_prob <- predict(sentmod.nb, newdata=dfmat_matched, type="probability")
dim(predicted_prob)
[1] 500 2
head(predicted_prob)
neg pos
neg_cv007_4992 1.0000000000 1.284030e-17
neg_cv008_29326 0.0002319693 9.997680e-01
neg_cv011_13044 0.9999995397 4.603311e-07
neg_cv014_15600 1.0000000000 2.380867e-13
neg_cv016_4348 1.0000000000 1.026186e-18
neg_cv022_14227 1.0000000000 1.448269e-14
summary(predicted_prob)
neg pos
Min. :0.0000 Min. :0.000000
1st Qu.:0.0000 1st Qu.:0.000000
Median :0.9938 Median :0.006158
Mean :0.5580 Mean :0.442003
3rd Qu.:1.0000 3rd Qu.:1.000000
Max. :1.0000 Max. :1.000000
You can see there one problem with the “naive” part of naive Bayes. By taking all of the features (words) as independent, it thinks it has seen far more information than it really has, and is therefore far more confident about its predictions than is warranted.
What’s the most positive review in the test set according to this?
# sort by *least negative* since near zero aren't rounded
sort.list(predicted_prob[,1], dec=F)[1]
[1] 440
id_test <- !((1:2000) %in% id_train)
texts(corpus)[id_test][440]
pos_cv738_10116
"here's a word analogy : amistad is to the lost world as schindler's list is to jurassic park . \nin 1993 , after steven spielberg made the monster dino hit , many critics described schindler's list as the director's \" penance \" ( as if there was a need for him to apologize for making a crowd-pleasing blockbuster ) . \nnow , after a three-year layoff , spielberg is back with a vengeance . \nonce again , his summer release was special effects-loaded action/adventure flick with dinosaurs munching on human appetizers . \nnow , following his 1993 pattern , he has fashioned another serious , inspirational christmas release about the nature of humanity . \nthat film is amistad . \nalthough not as masterful as schindler's list , amistad is nevertheless a gripping motion picture . \nthematically rich , impeccably crafted , and intellectually stimulating , the only area where this movie falls a little short is in its emotional impact . \nwatching schindler's list was a powerful , almost spiritual , experience . \nspielberg pulled us into the narrative , absorbed us in the drama , then finally let us go , exhausted and shattered , three-plus hours later . \naspects of the movie have stayed with me ever since . \namistad , while a fine example of film making , is not as transcendent . \nthe incident of the ship la amistad is not found in any history books , but , considering who writes the texts , that's not a surprise . \nhowever , the event is a part of the american social and legal fabric , and , while amistad does not adhere rigorously to the actual account , most of the basic facts are in order . \nseveral , mostly minor changes have been made to enhance the film's dramatic force . \non the whole , while amistad may not be faithful to all of the details of the situation , it is true to the spirit and meaning of what transpired . \none stormy night during the summer of 1839 , the 53 men imprisoned on the spanish slave ship la amistad escape . \nled by the lion-hearted cinque ( djimon hounsou ) , they take control of the vessel , killing most of the crew . \nadrift somewhere off the coast of cuba and uncertain how to make their way back to africa , they rely on the two surviving spaniards to navigate the eastward journey . \nthey are tricked , however , and the la amistad , which makes its way northward off the united states' eastern coastline , is eventually captured by an american naval ship near connecticut . \nthe kidnapped africans are shackled and thrown into prison , charged with murder and piracy . \nthe first men to come to the africans' defense are abolitionists theodore joadson ( morgan freeman ) and lewis tappan ( stellan skarsgard ) . \nthey are soon joined by roger baldwin ( matthew mcconaughey ) , a property attorney of little repute . \naided by advice from former president john quincy adams ( anthony hopkins ) , baldwin proves a more persuasive orator than anyone gave him credit for , and his central argument -- that the prisoners were illegally kidnapped free men , not property -- convinces the judge . \nbut powerful forces have aligned against baldwin's cause . \ncurrent president martin van buren ( nigel hawthorne ) , eager to please southern voters and 11-year old queen isabella of spain ( anna paquin ) , begins pulling strings behind-the-scenes to ensure that none of the africans goes free . \nat its heart , amistad is a tale of human courage . \ncinque is a heroic figure whose spirit remains unbreakable regardless of the pain and indignity he is subjected to . \nhe is a free man , not a slave , and , while he recognizes that he may die as a result of his struggle , he will not give it up . \neffectively portrayed by newcomer djimon hounsou , whose passion and screen presence arrest our attention , cinque is the key to viewers seeing the amistad africans as more than symbols in a battle of ideologies . \nthey are individuals , and our ability to make that distinction is crucial to the movie's success . \nto amplify this point , spielberg presents many scenes from the africans' point-of-view , detailing their occasionally-humorous observations about some of the white man's seemingly-strange \" rituals \" . \nthe larger struggle is , of course , one of defining humanity . \nas the nazis felt justified in slaughtering jews because they viewed their victims as \" sub-human , \" so the pro-slavery forces of amistad use a similar defense . \nthe abolitionists regard the africans as men , but the slavers and their supporters see them as animals or property . \nin a sense , the morality of slavery is on trial here with the specter of civil war , which would break out less than three decades later , looming over everything . \namistad's presentation of the legal and political intricacies surrounding the trial are fascinating , making this movie one of the most engrossing courtroom dramas in recent history . \nfour claimants come forward against the africans : the state , which wants them tried for murder ; the queen of spain , who wants them handed over to her under the provision of an american/spanish treaty ; two american naval officers , who claim the right of high seas salvage ; and the two surviving spaniards from la amistad , who demand that their property be returned to them . \nbaldwin must counter all of these claims , while facing a challenge to his own preconceived notions as the result of a relationship he develops with cinque . \neven though attorney and client are divided by a language barrier , they gradually learn to communicate . \naside from cinque , who is a fully-realized individual , characterization is spotty , but the acting is top-notch . \nmatthew mcconaughey successfully overcomes his \" pretty boy \" image to become baldwin , but the lawyer is never particularly well-defined outside of his role in the la amistad case . \nlikewise , while morgan freeman and stellan skarsgard are effective as joadson and tappan , they are never anything more than \" abolitionists . \" \nnigel hawthorne , who played the title character in the madness of king george , presents martin van buren as a spineless sycophant to whom justice means far less than winning an election . \nfinally , there's anthony hopkins , whose towering portrayal of john quincy adams is as compelling as anything the great actor has recently done . \nhopkins , who can convincingly play such diverse figures as a serial killer , an emotionally-crippled english butler , and richard nixon , makes us believe that he is adams . \nhis ten-minute speech about freedom and human values is unforgettable . \none point of difference worth noting between amistad and schindler's list is this film's lack of a well-defined human villain . \nschindler's list had ralph fiennes' superbly-realized amon goeth , who was not only a three-dimensional character , but a personification of all that the nazis stood for . \nthere is no such figure in amistad . \nthe villain is slavery , but an ideology , no matter how evil , is rarely the best adversary . \nit is to spielberg's credit that he has fashioned such a compelling motion picture without a prominent antagonist . \namistad's trek to the screen , which encountered some choppy waters ( author barbara chase-riboud has cried plagiarism , a charge denied by the film makers ) , comes in the midst of an upsurge of interest in the incident . \nan opera of the same name opened in chicago on november 29 , 1997 . \nnumerous books about the subject are showing up on bookstore shelves . \nit remains to be seen how much longevity the amistad phenomena has , but one thing is certain -- with spielberg's rousing , substantive film leading the way , the spotlight has now illuminated this chapter of american history . "
Looks like ``Amistad.’’ A genuinely positive review, but note how many times “spielberg” is mentioned. The prediction is biased toward positive just because Spielberg had positive reviews in the training set. We may not want that behavior.
Note also that this is a very long review.
# sort by *least neg* since near zero aren't rounded
sort.list(predicted_prob[,2], dec=F)[1]
[1] 211
texts(corpus)[id_test][211]
neg_cv807_23024
"and i thought \" stigmata \" would be the worst religiously-oriented thriller released this year . \nturns out i was wrong , because while \" stigmata \" was merely boring and self-important , \" end of days \" is completely inept on all fronts . \nit's a silly , incomprehensible , endlessly stupid mess . \nfor a guy like me who grew up watching arnold schwarzenegger at his best , it's extremely disconcerting to see where the big man has ended up . \nfor the first time in recent memory , an arnold action movie ( and \" batman & robin \" doesn't count ) is no fun at all . \n \" end of days \" is a major stinker . \nthe movie opens in vatican city , 1979 . \nsome catholic priests have observed an ancient prophecy , which says that a girl will be born on that night that satan will have targeted for impregnation . \nif he impregnates her between 11 and midnight on december 31 , 1999 , the world will be destroyed . \nthe pope orders protection of this girl , though some priests believe she ought to be killed . \nin new york , that very night , a girl is born to fulfill the prophecy . \ntwenty years later , we meet jericho cane ( schwarzenegger ) , a suicidal ex-cop with a drinking problem . \nnow working as a security guard for hire , he is protecting a local businessman ( gabriel byrne ) , who is actually possessed by the devil . \nan assassination attempt on the businessman by a crazed former priest leads him to the girl satan is after , christine york ( robin tunney ) . \nrecognizing elements of his own murdered daughter in christine ( including ownership of the same music box , apparently ) , jericho swears to protect her against the devil and the faction of priests looking to kill her . \nthere are so many problems with this film it's hard to know where to begin , but how about starting with the concept ? \ncasting arnold in a role like this was a mistake to begin with . \nschwarzenegger is a persona , not an actor , so putting him in a role that contradicts his usual strong personality is a bad idea . \narnold has neither the dramatic range nor the speaking ability to pull off a character tormented by conflicting emotions . \nin other words , trying to give him dimension was a mistake . \nharrison ford , mel gibson , or even bruce willis could have played this role ( they've all played noble and flawed heroes ) , but not schwarzenegger . \nthere are several scenes that attempt to establish jericho's character ; one has him contemplating suicide , another crying over the loss of his wife and daughter , and even one in which the devil tries to tempt him into revealing christine's location by offering him his old life back . \nnone of these scenes really work , because arnie isn't up to the task . \nthe filmmakers would have been better off making jericho a strong , confident character ( like the terminator , for example ) , the likes of which schwarzenegger has excelled in before . \nthis one isn't at all believable the way arnold plays him . \nthe supporting cast tries their hardest , and only gabriel byrne makes any impact at all . \nas the prince of darkness , he's suave and confident . \nhe acts like one would expect the devil to act . \nthe problem is that the script has him doing things that make no sense ( more on that later ) and that undermines him as a powerful villain . \nbyrne out-performs arnold in every scene they have together ( including the aforementioned temptation bit ) , but this is problematic when it causes the audience to start doing the unthinkable : root for the devil . \nbyrne's speech about the bible being \" overrated \" actually starts to make sense , mainly because arnold's attempts at refuting it ( mostly of the \" 'tis not ! \" \nvariety ) are feeble at best . \nthe only problem is , arnold has to win , so in the end , nobody really cares . \nkevin pollack plays jericho's security guard sidekick and tries to liven things up with some comic asides , but like most bad action movie sidekicks , he disappears after about an hour . \nrobin tunney isn't given much to do except look scared . \nin fact , all of the supporting players are good actors , but none , save for byrne , is given anything interesting to do . \nperformances aside , it would be really hard to enjoy this film no matter who starred in it . \nthis being an action blockbuster , it's no surprise that the worst thing about it is the script , which starts off totally confusing , and when some of it is explained ( and not much of it is ) , it's utterly ridiculous . \nwhy is the devil coming on new year's eve , 1999 ? \nbecause it's exactly 1000 years after the year of the devil , which isn't 666 , it turns out . \nsome nutty priest accidentally read it upside down , so the real year is 999 , so just add a 1 to the beginning , and you've got 1999 ! \nif you don't buy this explanation , you're not alone . \nit's convoluted and silly at the same time . \nthe method by which jericho locates christine york is equally ludicrous ( she's christine , see , and she lives in new york , see . \n . \n . ) , and if that weren't bad enough , there's plenty of bothersome stuff in this film that isn't explained at all . \nwhy can satan kill everyone he passes on the street , but when it comes to snuffing out one drunk ex-cop , he's powerless ? \nis he impervious to only one kind of bullet ? \nhow come he can't control jericho or christine ? \nand how did those gregorian monks deal with time zones in their prophecies ? \na clumsy attempt at a joke is made about this , but it's never actually explained . \nusually , this sort of thing wouldn't matter in a schwarzenegger flick ( i mean , don't get me started on the time paradoxes offered up by the terminator movies ) , but this time the plot inconsistencies stand out even more than usual because the action is rarely exciting . \nthere are several predictable horror film clich ? s present in \" end of days , \" complete with the old \" black cat hiding in a cabinet \" bit , not that we ever find out what the cat was doing in there . \nit gets so formulaic that it's possible for those uninterested in being scared to close their eyes at the precise moment a \" boo \" will come . \ntheir predictions will rarely be wrong . \nthe more grandiose action sequences are utterly charmless , partially because we don't care about these characters ( due to the script's pathetic attempts at characterization and setup ) , and also because they , too , don't make any sense . \nthere's a scene where schwarzenegger gets thrown around a room by a little old lady . \nit's good for a few chuckles , but not much else . \nsupposedly we're to believe she now has super strength by virtue of being controlled by satan , but the script never sets that up , so the scene is merely silly . \nnone of this is terribly exciting , because all the action sequences are so badly framed that it's often hard to tell why it's happening in the first place , not to mention that they're edited in full-on incomprehensible mtv quick-cut style . \nmost of them had me scratching my head , rather than saying , \" wow , cool ! \" \n \" end of days \" is not only silly and confusing , but it's also distinctly unpleasant to watch . \nthe devil apparently doesn't operate in the more subtle , i'll-convince-people-to-kill-each-other fashion outlined in the bible , but instead enjoys killing people gruesomely in broad daylight . \nthis doesn't only make him an awfully predictable sort , but it also means that not a single scene in \" end of days \" goes by without unnecessarily graphic violence , or the odd kinky sexual encounter ( yet another bit that had me scratching my head ) . \nif violence is supposed to be shocking , it's not a good idea to throw so much of it into a movie that the audience goes numb . \nscenes aren't connected through any reasonable means , so a lot of the time , stuff gets blown up , or people get killed , and i had no idea why . \nreasons ? \nto hell with reasons ! \nlet's just blow stuff up ! \nisn't it cool ? \nnope , not by a long shot . \nthis film is thoroughly unwatchable . \nit's dull , interminable , and unrelenting in its stupidity . \nperhaps arnold needs to make some movies with james cameron to revive his career , because it's not happening with hack peter hyams here . \n \" end of days \" might have had camp value , if only it didn't top itself off with an overly pious ending that nobody's going to buy . \nif the movie is going to be serious , the filmmakers should have come up with a decent script . \nif it's going to be campy , arnold shouldn't be taking himself so damn seriously ( i didn't actually see him put up on a cross , did i ? ) , and his character shouldn't be such a sad sack . \nas it stands , \" end of days \" is just a bad movie , and an awfully gloomy one at that . "
Schwarzenegger’s ``End of Days’’
It also should be clear enough that more words means more votes, so longer documents are more clearly positive or negative. There’s an argument for that. It also would underplay a review that read in it’s entirety: ``terrible.’’ That even though the review is 100% clear in its sentiment.
What is it most confused about?
sort.list(abs(predicted_prob - .5), dec=F)[1]
[1] 212
predicted_prob[212,]
neg pos
0.4496432 0.5503568
So … the model says 45% chance negative, 55% positive.
texts(corpus)[id_test][212]
neg_cv808_13773
"stephen , please post if appropriate . \n \" mafia ! \" - crime isn't a funny business by homer yen ( c ) 1998 \non a particular night when i found myself having some free time , i had a chance to either go to sleep early or to see \" mafia ! \" , a spoof of mafia and crime films such as \" the godfather , \" \" goodfellas \" and \" casino \" . \nat 84 minutes in length , i thought that i could enjoy a few laughs before getting a good nights sleep . \nbut by my account , i think that my laff-o-meter only registered a few grins , one giggle , and maybe one chortle . \ni suppose that you could justify your time as homage to the venerable hollywood star , lloyd bridges , who just recently passed away and whose last performance was in this film . \n \" mafia ! \" \nchronicles vincenzo cortino's ( lloyd bridges ) life . \nseparated from his family when he was young , he escapes to america and tries to live an honest life . \nbut as fate would have it , vincenzo grows up to be a powerful and klutzy crime lord . \nfollowing in his footsteps are his two sons , joey ( billy burke ) and anthony ( jay mohr ) . \nlike all siblings in powerful crime families , they squabble over power , the future of the family , fortune , and women . \n \" mafia ! \" is co-written by jim abrahams , who also contributed to some gut-busting funny spoofs such as \" airplane \" and \" the naked gun . \" \nbut these previous movies were funny because the jokes seemed more universally understood and there was more of a manic silliness at work . \nas i write this , i also wonder how many people have actually seen the movies on which this spoof is based . \ncrime movies in general contain a lot of profanity and violence . \nit's a tough genre to parody . \ni was kind of hoping that they could somehow spoof the profanity used in all of those crime movies , maybe by having all of the tough crime lords say \" please \" as they decide which sector to take over , but this opportunity was never explored . \nthere were one or two moments that made me smile such as the scene where vincenzo is dancing with his newly wed daughter-in-law . \na gunman shoots him several times . \nthe impact of the bullets cause him to make these wild contortions that force the wedding band to change music styles to keep up with him , from the samba to disco to the macarena . \ni think that i just gave away the best part of the film . \noh well , that just means that you can go to sleep a little earlier . "
A negative review of “Mafia!” a spoof movie I’d never heard of. Satire, parody, sarcasm, and similar are notoriously difficult to correctly classify, so perhaps that’s what happened here.
Let’s look at a clear mistake.
sort.list(predicted_prob[1:250,1],dec=F)[1]
[1] 196
predicted_prob[196,]
neg pos
3.967294e-17 1.000000e+00
So … the model says DEFINITELY positive.
texts(corpus)[id_test][196]
neg_cv761_13769
"weighed down by tired plot lines and spielberg's reliance on formulas , _saving private ryan_ is a mediocre film which nods in the direction of realism before descending into an abyss of cliches . \nthere ought to be a law against steven spielberg making movies about truly serious topics . \nspielberg's greatest strength as a director is the polished , formulaic way in which every aspect of the film falls carefully into place to make a perfect story . \nbut for a topic of such weight as combat in the second world war ( or the holocaust ) this technique backfires , for it creates coherent , comprehensible and redemptive narratives out of events whose size , complexity and evil are utterly beyond the reach of human ken . \nin this way spielberg trivializes the awesome evil of the stories he films . \n_saving private ryan_ tells the story of eight men who have been detailed on a \" pr mission \" to pull a young man , ryan ( whose three other brothers were just killed in fighting elsewhere ) out of combat on the normandy front just after d-day . \nryan is a paratrooper who dropped behind enemy lines the night before the landings and became separated from his fellow soldiers . \nthe search for him takes the eight soldiers across the hellish terrain of world war ii combat in france . \nthere's no denying spielberg came within shouting distance of making a great war movie . \nthe equipment , uniforms and weapons are superbly done . \nthe opening sequence , in which captain miller ( tom hanks ) leads his men onto omaha beach , is quite possibly the closest anyone has come to actually capturing the unendurably savage intensity of modern infantry combat . \nanother pleasing aspect of the film is spielberg's brave depiction of scenes largely unknown to american audiences , such as the shooting of prisoners by allied soldiers , the banality of death in combat , the routine foul-ups in the execution of the war , and the cynicism of the troops . \nthe technical side of the film is peerless , as always . \nthe camera work is magnificent , the pacing perfect , the sets convincing , the directing without flaw . \nhanks will no doubt be nominated for an oscar for his performance , which was utterly convincing , and the supporting cast was excellent , though ted danson seems a mite out of place as a paratroop colonel . \nyet the attempt at a realistic depiction of combat falls flat on its face because realism is not something which can be represented by single instances or events . \nit has to thoroughly permeate the context at every level of the film , or the story fails to convince . \nthroughout the movie spielberg repeatedly showed only single examples of the grotesque wounds produced by modern mechanized devices ( exception : men are shown burning to death with relative frequency ) . \nfor example , we see only one man with guts spilled out on the ground . \nhere and there men lose limbs ; in one scene miller is pulling a man to safety , there's an explosion , and miller looks back to see he is only pulling half a man . \nbut the rest of the corpses are remarkably intact . \nthere are no shoes with only feet in them , no limbs scattered everywhere , no torsos without limbs , no charred corpses , and most importantly , all corpses have heads ( in fairness there are a smattering of wicked head wounds ) . \nthe relentless dehumanization of the war , in which even corpses failed to retain any indentity , is soft-pedaled in the film . \nultimately , _saving private ryan_ bows to both hollywood convention and the unwritten rules of wartime photography in its portrayal of wounds and death in war . \nrather than saying _saving private ryan_ is \" realistic , \" it would be better to describe it as \" having realistic moments . \" \nanother aspect of the \" hollywoodization \" of the war is the lack of realistic dialogue and in particular , the lack of swearing . \nanyone familiar with the literature on the behavior of the men during the war , such as fussell's superb _wartime : understanding and behavior in the second world war_ ( which has an extensive discussion on swearing ) , knows that the troops swore fluently and without letup . \n \" who is this private ryan that we have to die for him ? \" \nasks one infantrymen in the group of eight . \nrendered in wartime demotic , that should have been expressed as \" who is this little pecker that we have to get our dicks shot off for him ? \" \nor some variant thereof . \nconversations should have been literally sprinkled with the \" f \" word , and largely about ( the search for ) food and sex . \nthis is all the more inexplicable because the movie already had an \" r \" rating due to violence , so swearing could not possibly have been eliminated to make it a family film . \nhowever , the most troubling aspect of the film is the spielbergization of the topic . \nthe most intense hell humans have ever created for themselves is not emotionally wrenching enough for steven spielberg . \nhe cannot just cede control to the material ; he has to be bigger than it . \nas if afraid to let the viewer find their own ( perhaps unsettled and not entirely clear ) emotional foothold in the material , spielberg has to package it in hallmark moments to give the war a meaning and coherence it never had : the opening and closing scenes of ryan and his family in the war cemetary ( reminscent of the closing scene from _schindler's list ) , the saccharine exchange between ryan and his wife at the close ( every bit as bad as schindler's monologue about how his car , tiepin or ring could have saved another jew ) , quotes from abraham lincoln and emerson , captain miller's last words to private ryan , and an unbelievable storyline in which a prisoner whom they free earlier in the movie comes back to kill the captain . \nthat particular subplot is so hokey , so predictable , it nigh on ruins the film . \nnowhere in the film is there a resolute depiction of the meaninglessness , stupidity and waste which characterized the experience of war to the men who actually fought in combat ( imagine if miller had been killed by friendly fire or collateral damage ) . \nbecause of its failure to mine deeply into the terrible realities of world war ii , _saving private ryan_ can only pan for small truths in the shallows . \n . "
Aha! A clearly negative review of “Saving Private Ryan.”
This is at least partly an “overfitting” mistake. It probably learned other “Saving Private Ryan” or “Spielberg movies” words – it looks like “Spielberg’s” was number #3 on our list above – and learned that “reviews that talk about Saving Private Ryan are probably positive.”
Below, I’ll give brief examples of some other classification models for this data.
We’ll look at three (well really only two) variants of the relatively straightforward regularized logistic regression model.
library(glmnet)
library(doMC)
registerDoMC(cores=2) # parallelize to speed up
sentmod.ridge <- cv.glmnet(x=dfmat_train,
y=docvars(dfmat_train)$Sentiment,
family="binomial",
alpha=0, # alpha = 0: ridge regression
nfolds=5, # 5-fold cross-validation
parallel=TRUE,
intercept=TRUE,
type.measure="class")
plot(sentmod.ridge)
This shows classification error as \(\lambda\) (the total weight of the regularization penalty) is increased from 0. The minimum error is at the leftmost dotted line, about \(\log(\lambda) \approx 3\). This value is stored in lambda.min
.
# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.ridge <- predict(sentmod.ridge, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.ridge <- rep(NA,length(predicted_value.ridge))
predicted_class.ridge[predicted_value.ridge>0] <- "pos"
predicted_class.ridge[predicted_value.ridge<0] <- "neg"
tab_class.ridge <- table(actual_class,predicted_class.ridge)
tab_class.ridge
predicted_class.ridge
actual_class neg pos
neg 214 49
pos 42 195
Accuracy of .818, exactly as with Naive Bayes. The misses are a little more even, with it being slightly more successful in identifying positive reviews and slightly less successful in identifying negative reviews.
confusionMatrix(tab_class.ridge, mode="everything")
Confusion Matrix and Statistics
predicted_class.ridge
actual_class neg pos
neg 214 49
pos 42 195
Accuracy : 0.818
95% CI : (0.7813, 0.8509)
No Information Rate : 0.512
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6355
Mcnemar's Test P-Value : 0.5294
Sensitivity : 0.8359
Specificity : 0.7992
Pos Pred Value : 0.8137
Neg Pred Value : 0.8228
Precision : 0.8137
Recall : 0.8359
F1 : 0.8247
Prevalence : 0.5120
Detection Rate : 0.4280
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8176
'Positive' Class : neg
At first blush, the coefficients should tell us what t he model learned:
plot(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,75*abs(coef(sentmod.ridge)[-1,1])))
That’s both confusing and misleading since the variance of coefficients is largest with the most obscure terms. (And, for plotting, the 40,000+ features include some very long one-off “tokens” that overlap with more common ones, e.g., “boy-drinks-entire-bottle-of-shampoo-and-may-or-may-not-get-girl-back,” “props-strategically-positioned-between-naked-actors-and-camera,” and “____________________________________________”)
With this model, it would be more informative to look at which coefficients have the most impact when making a prediction, by having larger coefficients and occurring more, or alternatively to look at which coefficients we are most certain of, downweighting by the inherent error. The impact will be proportional to \(log(n_w)\) and the error will be roughly proportional to \(1/sqrt(n_w)\).
So, impact:
plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,25*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))
Most positive and negative features by impact:
sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
outstanding seamless flawless chilling
0.03221820 0.02591475 0.02556874 0.02437356
perfectly deft astounding memorable
0.02422200 0.02420305 0.02416697 0.02412884
feel-good wonderfully offbeat fantastic
0.02384193 0.02367749 0.02339370 0.02337778
superb masterfully lore understands
0.02332856 0.02332021 0.02307940 0.02301901
finest gem breathtaking missteps
0.02295212 0.02283362 0.02265643 0.02263633
sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
wasted ludicrous waste
-0.03628220 -0.03443045 -0.03160819
poorly mess ridiculous
-0.03097066 -0.03019796 -0.02981018
lame awful insulting
-0.02907724 -0.02888791 -0.02746581
worst spoiled boring
-0.02697566 -0.02669386 -0.02651112
stupidity dull laughable
-0.02512044 -0.02505126 -0.02484610
unintentional unfunny sucks
-0.02479505 -0.02468694 -0.02417994
idiotic lifeless
-0.02417114 -0.02337420
Regularization and cross-validation bought us a lot more general – less overfit – model than we saw with Naive Bayes.
Alternatively, by certainty:
plot(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Error Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,10*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))
Most positive and negative terms:
sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]
outstanding perfectly performances memorable
0.06016299 0.05515314 0.05487387 0.05327005
great also hilarious life
0.05023645 0.04935470 0.04867251 0.04791490
deserves best terrific superb
0.04616259 0.04591025 0.04549550 0.04487466
wonderfully allows both as
0.04479474 0.04477793 0.04405678 0.04358100
strong most overall others
0.04343652 0.04338336 0.04306128 0.04303831
sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]
wasted worst bad
-0.07601778 -0.07584566 -0.07389321
boring waste mess
-0.07152609 -0.06882991 -0.06842027
ridiculous supposed awful
-0.06527679 -0.06283200 -0.06109105
poorly lame stupid
-0.06099081 -0.05832472 -0.05720734
ludicrous dull worse
-0.05619818 -0.05393264 -0.05132278
unfortunately plot nothing
-0.05025006 -0.04881129 -0.04866758
unfunny laughable
-0.04843174 -0.04720511
This view implies some would-be “stop words” are important, and these seem to make sense on inspection. For example, “as” is indicative of phrases in positive reviews comparing movies to well-known and well-liked movies, e.g., “as good as.” There’s not a parallel “as bad as” that is as common in negative reviews.
Ridge regression gives you a coefficient for every feature. At the other extreme, we can use the LASSO to get some feature selection.
registerDoMC(cores=2) # parallelize to speed up
sentmod.lasso <- cv.glmnet(x=dfmat_train,
y=docvars(dfmat_train)$Sentiment,
family="binomial",
alpha=1, # alpha = 1: LASSO
nfolds=5, # 5-fold cross-validation
parallel=TRUE,
intercept=TRUE,
type.measure="class")
plot(sentmod.lasso)
# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.lasso <- predict(sentmod.lasso, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.lasso <- rep(NA,length(predicted_value.lasso))
predicted_class.lasso[predicted_value.lasso>0] <- "pos"
predicted_class.lasso[predicted_value.lasso<0] <- "neg"
tab_class.lasso <- table(actual_class,predicted_class.lasso)
tab_class.lasso
predicted_class.lasso
actual_class neg pos
neg 202 61
pos 29 208
This gets one more right than the others for an accuracy of .82. The pattern of misses goes further in the other direction from Naive Bayes, overpredicting positive reviews.
confusionMatrix(tab_class.lasso, mode="everything")
Confusion Matrix and Statistics
predicted_class.lasso
actual_class neg pos
neg 202 61
pos 29 208
Accuracy : 0.82
95% CI : (0.7835, 0.8527)
No Information Rate : 0.538
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6414
Mcnemar's Test P-Value : 0.001084
Sensitivity : 0.8745
Specificity : 0.7732
Pos Pred Value : 0.7681
Neg Pred Value : 0.8776
Precision : 0.7681
Recall : 0.8745
F1 : 0.8178
Prevalence : 0.4620
Detection Rate : 0.4040
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8238
'Positive' Class : neg
plot(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,1*abs(coef(sentmod.lasso)[-1,1])))
As we want when we run the LASSO, the vast majority of our coefficients are zero … most features have no influence on the predictions.
It’s less necessary but let’s look at impact:
plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.25*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1])))
Most positive and negative features by impact:
sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=T)[1:20]
outstanding perfectly memorable deserves
1.8774578 1.6077352 1.5402021 1.1174430
hilarious terrific finest performances
1.1040564 1.0820637 0.9907954 0.9659045
refreshing great others breathtaking
0.8944540 0.7419333 0.6173077 0.5688596
wonderfully allows overall war
0.5686626 0.5611135 0.5484030 0.5313923
also world life flaws
0.5164209 0.4605382 0.4560559 0.4285162
Interestingly, there are a few there that would be negative indicators in most sentiment dictionaries, like “flaws” and “war”.
sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=F)[1:20]
ridiculous wasted boring
-2.902012 -2.883715 -2.645762
ludicrous mess worst
-2.476602 -2.470577 -2.270073
poorly awful bad
-2.219431 -2.022392 -2.008826
waste lame supposed
-1.957973 -1.913403 -1.840086
embarrassing runtime tedious
-1.517301 -1.283180 -1.276298
dull unfunny nothing
-1.187166 -1.131177 -1.068824
laughable unfortunately
-1.047326 -1.030627
Both lists also have words that indicate a transition from a particular negative or positive aspect, followed by a holistic sentiment in the opposite direction. “The pace dragged at times, but overall it is an astonishing act of filmmaking.” “The performances are tremendous, but unfortunately the poor writing makes this movie fall flat.”
The elastic net estimates not just \(\lambda\) (the overall amount of regularization) but also \(\alpha\) (the relative weight of the L1 loss relative to the L2 loss). In R, this can also be done with the glmnet
package. “I leave that as an exercise.”
We’ve got three sets of predictions now, so why don’t we try a simple ensemble in which our prediction for each review is based on a majority vote of the three. Sort of like a Rotten Tomatoes rating. They each learned slightly different things, so perhaps the whole is better than its parts.
predicted_class.ensemble3 <- rep("neg",length(actual_class))
num_predicted_pos3 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos")
predicted_class.ensemble3[num_predicted_pos3>1] <- "pos"
tab_class.ensemble3 <- table(actual_class,predicted_class.ensemble3)
tab_class.ensemble3
predicted_class.ensemble3
actual_class neg pos
neg 223 40
pos 41 196
Hey, that is better! Accuracy 83.8%!
confusionMatrix(tab_class.ensemble3, mode="everything")
Confusion Matrix and Statistics
predicted_class.ensemble3
actual_class neg pos
neg 223 40
pos 41 196
Accuracy : 0.838
95% CI : (0.8027, 0.8692)
No Information Rate : 0.528
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6751
Mcnemar's Test P-Value : 1
Sensitivity : 0.8447
Specificity : 0.8305
Pos Pred Value : 0.8479
Neg Pred Value : 0.8270
Precision : 0.8479
Recall : 0.8447
F1 : 0.8463
Prevalence : 0.5280
Detection Rate : 0.4460
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8376
'Positive' Class : neg
Without explaining SVM at all, let’s try a simple one.
library(e1071)
sentmod.svm <- svm(x=dfmat_train,
y=as.factor(docvars(dfmat_train)$Sentiment),
kernel="linear",
cost=10, # arbitrary regularization cost
probability=TRUE)
Ideally, we would tune the cost parameter via cross-validation or similar, as we did with \(\lambda\) above.
# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class.svm <- predict(sentmod.svm, newdata=dfmat_matched)
tab_class.svm <- table(actual_class,predicted_class.svm)
tab_class.svm
predicted_class.svm
actual_class neg pos
neg 209 54
pos 29 208
That’s actually a bit better than the others, individually if not combined, with accuracy of .834, and a bias toward overpredicting positives.
confusionMatrix(tab_class.svm, mode="everything")
Confusion Matrix and Statistics
predicted_class.svm
actual_class neg pos
neg 209 54
pos 29 208
Accuracy : 0.834
95% CI : (0.7984, 0.8656)
No Information Rate : 0.524
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6688
Mcnemar's Test P-Value : 0.00843
Sensitivity : 0.8782
Specificity : 0.7939
Pos Pred Value : 0.7947
Neg Pred Value : 0.8776
Precision : 0.7947
Recall : 0.8782
F1 : 0.8343
Prevalence : 0.4760
Detection Rate : 0.4180
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8360
'Positive' Class : neg
For a linear kernel, we can back out interpretable coefficients. This is not true with nonlinear kernels such as the “radial basis function.”
beta.svm <- drop(t(sentmod.svm$coefs)%*%dfmat_train[sentmod.svm$index,])
(Note the signs are reversed from our expected pos-neg.)
plot(colSums(dfmat_train),-beta.svm, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Support Vector Machine Coefficients (Linear Kernel), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),-beta.svm, colnames(dfmat_train),pos=4,cex=10*abs(beta.svm), col=rgb(0,0,0,5*abs(beta.svm)))
sort(-beta.svm,dec=T)[1:20]
excellent perfectly job seen
0.10159567 0.08328699 0.08199813 0.07928320
jackie great well laughs
0.07850642 0.07773142 0.07410808 0.07354421
life american sherri takes
0.07060505 0.06731434 0.06718588 0.06656558
together war son hopkins
0.06613371 0.06551154 0.06545866 0.06507787
terrific performances fun twister
0.06477875 0.06422575 0.06368690 0.06119385
sort(-beta.svm,dec=F)[1:20]
waste bad nothing
-0.13844821 -0.11973782 -0.11914619
only unfortunately boring
-0.11092148 -0.11052668 -0.10550711
poor anyway script
-0.09114519 -0.09077601 -0.08812765
any worst awful
-0.08732612 -0.08544895 -0.08318521
maybe plot horrendous
-0.07819739 -0.07793267 -0.07724877
should supposed extraordinarily
-0.07511486 -0.07394538 -0.07042833
looks mess
-0.06942303 -0.06895524
Looks a bit overfit to me and I would probably increase the regularization cost in further iterations.
library(randomForest)
Random forests is a very computationally intensive algorithm, so I will cut the number of features way way down just so this can run in a reasonable amount of time.
dfmat.rf <- corpus %>%
dfm() %>%
dfm_trim(min_docfreq=50,max_docfreq=300,verbose=TRUE)
dfmatrix.rf <- as.matrix(dfmat.rf)
set.seed(1234)
sentmod.rf <- randomForest(dfmatrix.rf[id_train,],
y=as.factor(docvars(dfmat.rf)$Sentiment)[id_train],
xtest=dfmatrix.rf[id_test,],
ytest=as.factor(docvars(dfmat.rf)$Sentiment)[id_test],
importance=TRUE,
mtry=20,
ntree=100
)
#sentmod.rf
predicted_class.rf <- sentmod.rf$test[['predicted']]
tab_class.rf <- table(actual_class,predicted_class.rf)
confusionMatrix(tab_class.rf, mode="everything")
Confusion Matrix and Statistics
predicted_class.rf
actual_class neg pos
neg 187 76
pos 36 201
Accuracy : 0.776
95% CI : (0.7369, 0.8118)
No Information Rate : 0.554
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5545
Mcnemar's Test P-Value : 0.0002286
Sensitivity : 0.8386
Specificity : 0.7256
Pos Pred Value : 0.7110
Neg Pred Value : 0.8481
Precision : 0.7110
Recall : 0.8386
F1 : 0.7695
Prevalence : 0.4460
Detection Rate : 0.3740
Detection Prevalence : 0.5260
Balanced Accuracy : 0.7821
'Positive' Class : neg
That did a bit worse – Accuracy .776 – but we did give it considerably less information.
Getting marginal effects from a random forest model requires more finesse than I’m willing to apply here. We can get the “importance” of the different features, but this alone does not tell us in what direction the feature pushes the predictions.
varImpPlot(sentmod.rf)
Some usual suspects there, but we need our brains to fill in which ones are positive and negative. Some are ambiguous (“town”) and some we have seen are subtle (“overall”) or likely to be misleading (“war”).
Now we’ve got five, so let’s ensemble those.
predicted_class.ensemble5 <- rep("neg",length(actual_class))
num_predicted_pos5 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos") +
1*(predicted_class.svm=="pos") +
1*(predicted_class.rf=="pos")
predicted_class.ensemble5[num_predicted_pos5>2] <- "pos"
tab_class.ensemble5 <- table(actual_class,predicted_class.ensemble5)
tab_class.ensemble5
predicted_class.ensemble5
actual_class neg pos
neg 220 43
pos 28 209
confusionMatrix(tab_class.ensemble5,mode="everything")
Confusion Matrix and Statistics
predicted_class.ensemble5
actual_class neg pos
neg 220 43
pos 28 209
Accuracy : 0.858
95% CI : (0.8243, 0.8874)
No Information Rate : 0.504
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7161
Mcnemar's Test P-Value : 0.09661
Sensitivity : 0.8871
Specificity : 0.8294
Pos Pred Value : 0.8365
Neg Pred Value : 0.8819
Precision : 0.8365
Recall : 0.8871
F1 : 0.8611
Prevalence : 0.4960
Detection Rate : 0.4400
Detection Prevalence : 0.5260
Balanced Accuracy : 0.8582
'Positive' Class : neg
And like magic, now we’re up to 85.8% Accuracy in the test set.