Text as Data Tutorial - Introduction to Text Classification (in R)

Naive Bayes
Logistic regression, ridge regression, LASSO, and elasticnet
A first ensemble
Support vector machine
Random Forests
Another ensemble

In this notebook we will work through a basic classification problem, using the movie reviews data set. We know the “negative” or “positive” labels for each of the movies. We’ll set some of these aside for a test set and train our models on the remainder as a training set, using unigram presence or counts as the features. Then we’ll evaluate the predictions quantitatively as well as look at some ways to interpret what the models tell us.

We’ll start with Naive Bayes, move to logistic regression and its ridge and LASSO variants, then support vector machines and finally random forests. We’ll also combine the models to examine an ensemble prediction.

Remove the comment and install the quanteda.corpora package from github:

# devtools::install_github("quanteda/quanteda.corpora")

We’ll use these packages:

library(dplyr)
library(quanteda)
library(quanteda.corpora)
library(caret)

We’ll start with the example given in the quanteda documentation. Read in the Pang and Lee dataset of 2000 movie reviews. (This appears to be the same 2000 reviews you used in the dictionary exercise, but in a different order.)

corpus <- data_corpus_movies
summary(corpus,5)

nsentence() does not correctly count sentences in all lower-cased text

Corpus consisting of 2000 documents, showing 5 documents:

            Text Types Tokens Sentences Sentiment
 neg_cv000_29416   354    841         9       neg
 neg_cv001_19502   156    278         1       neg
 neg_cv002_17424   276    553         3       neg
 neg_cv003_12683   314    564         2       neg
 neg_cv004_12641   380    842         2       neg
   id1   id2
 cv000 29416
 cv001 19502
 cv002 17424
 cv003 12683
 cv004 12641

Source: /Users/kbenoit/Dropbox/QUANTESS/quantedaData_kenlocal_gh/* on x86_64 by kbenoit
Created: Sat Nov 15 18:43:25 2014
Notes:

Shuffle the rows to randomize the order.

set.seed(1234)
id_train <- sample(1:2000,1500, replace=F)
head(id_train, 10)

 [1]  228 1244 1218 1245 1719 1278   19  464 1327 1024

Use the 1500 for a training set and the other 500 as your test set. Create dfms for each.

docvars(corpus, "id_numeric") <- 1:ndoc(corpus)
dfmat_train <- corpus_subset(corpus, id_numeric %in% id_train) %>% dfm() #%>% dfm_weight(scheme="boolean")
dfmat_test <- corpus_subset(corpus, !(id_numeric %in% id_train)) %>% dfm() #%>% dfm_weight(scheme="boolean")

Naive Bayes

Naive Bayes is a built in model for quanteda, so it’s easy to use:

sentmod.nb <- textmodel_nb(dfmat_train, docvars(dfmat_train, "Sentiment"), distribution = "Bernoulli")
summary(sentmod.nb)


Call:
textmodel_nb.dfm(x = dfmat_train, y = docvars(dfmat_train, "Sentiment"), 
    distribution = "Bernoulli")

Class Priors:
(showing first 2 elements)
neg pos 
0.5 0.5 

Estimated Feature Scores:
      plot      :    two   teen couples     go     to
neg 0.5853 0.5081 0.4954 0.6258  0.5324 0.4859 0.4996
pos 0.4147 0.4919 0.5046 0.3742  0.4676 0.5141 0.5004
        a church party      ,  drink and   then
neg 0.499  0.446 0.547 0.4996 0.4669 0.5 0.5403
pos 0.501  0.554 0.453 0.5004 0.5331 0.5 0.4597
     drive   .   they    get   into     an accident
neg 0.5799 0.5 0.5045 0.5074 0.4914 0.4952    0.492
pos 0.4201 0.5 0.4955 0.4926 0.5086 0.5048    0.508
      one     of    the   guys   dies    but    his
neg 0.498 0.4993 0.4996 0.5615 0.4886 0.5006 0.4867
pos 0.502 0.5007 0.5004 0.4385 0.5114 0.4994 0.5133
    girlfriend continues
neg     0.5127    0.3307
pos     0.4873    0.6693

Use the dfm_match command to limit dfmat_test to features (words) that appeared in the training data:

dfmat_matched <- dfm_match(dfmat_test, features=featnames(dfmat_train))

How did we do? Let’s look at a “confusion” matrix.

actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class <- predict(sentmod.nb, newdata=dfmat_matched)
tab_class <- table(actual_class,predicted_class)
tab_class

            predicted_class
actual_class neg pos
         neg 225  38
         pos  53 184

Not bad, considering. Let’s put some numbers on that:

confusionMatrix(tab_class, mode="everything")

Confusion Matrix and Statistics

            predicted_class
actual_class neg pos
         neg 225  38
         pos  53 184
                                          
               Accuracy : 0.818           
                 95% CI : (0.7813, 0.8509)
    No Information Rate : 0.556           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6339          
                                          
 Mcnemar's Test P-Value : 0.1422          
                                          
            Sensitivity : 0.8094          
            Specificity : 0.8288          
         Pos Pred Value : 0.8555          
         Neg Pred Value : 0.7764          
              Precision : 0.8555          
                 Recall : 0.8094          
                     F1 : 0.8318          
             Prevalence : 0.5560          
         Detection Rate : 0.4500          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8191          
                                          
       'Positive' Class : neg

Given the balance in the data among negatives and positives, “Accuracy” isn’t a bad place to start. Here we have Accuracy of 81.8%.

Let’s do some sniff tests. What are the most positive and negative words?

#Most positive words
sort(sentmod.nb$PcGw[2,],dec=T)[1:20]

outstanding    seamless spielberg's    lovingly 
  0.9609883   0.9354430   0.9311493   0.9262437 
   flawless  astounding     winslet      winter 
  0.9205855   0.9205855   0.9205855   0.9205855 
    recalls        lore     gattaca      annual 
  0.9139870   0.9139870   0.9139870   0.9139870 
  addresses       mulan masterfully        deft 
  0.9139870   0.9139870   0.9139870   0.9139870 
     online  continuing    missteps   discussed 
  0.9061925   0.9061925   0.9061925   0.9061925

There’s reasonable stuff there: “outstanding”, “seamless”, “lovingly”, “flawless”. There’s also some evidence of overfitting: “spielberg’s”, “winslet”, “gattaca”, “mulan”. We’ll see support for the overfitting conclusion below.

#Most negative words
sort(sentmod.nb$PcGw[2,],dec=F)[1:20]

  ludicrous     spoiled         pen   insulting 
 0.05693813  0.06916885  0.08072974  0.08072974 
     racing degenerates perfunctory      bounce 
 0.08072974  0.08072974  0.08809155  0.09693075 
    misfire      feeble      horrid    weaponry 
 0.09693075  0.09693075  0.10774165  0.10774165 
       1982        3000      bursts    wielding 
 0.10774165  0.10774165  0.10774165  0.10774165 
  campiness   macdonald         wee      stalks 
 0.10774165  0.10774165  0.10774165  0.10774165

Let’s get a birds-eye view.

# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances")
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))

Look a little closer at the negative.

# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0,.25))
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))

And a little more closely at the positive words:

# Plot weights
plot(colSums(dfmat_train),sentmod.nb$PcGw[2,], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Posterior Probabilities, Naive Bayes Classifier, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim=c(10,1000),ylim=c(0.75,1.0))
text(colSums(dfmat_train),sentmod.nb$PcGw[2,], colnames(dfmat_train),pos=4,cex=5*abs(.5-sentmod.nb$PcGw[2,]), col=rgb(0,0,0,1.5*abs(.5-sentmod.nb$PcGw[2,])))

Let’s look a little more closely at the document predictions.

predicted_prob <- predict(sentmod.nb, newdata=dfmat_matched, type="probability")
dim(predicted_prob)

[1] 500   2

head(predicted_prob)

                         neg          pos
neg_cv007_4992  1.0000000000 1.284030e-17
neg_cv008_29326 0.0002319693 9.997680e-01
neg_cv011_13044 0.9999995397 4.603311e-07
neg_cv014_15600 1.0000000000 2.380867e-13
neg_cv016_4348  1.0000000000 1.026186e-18
neg_cv022_14227 1.0000000000 1.448269e-14

summary(predicted_prob)

      neg              pos          
 Min.   :0.0000   Min.   :0.000000  
 1st Qu.:0.0000   1st Qu.:0.000000  
 Median :0.9938   Median :0.006158  
 Mean   :0.5580   Mean   :0.442003  
 3rd Qu.:1.0000   3rd Qu.:1.000000  
 Max.   :1.0000   Max.   :1.000000

You can see there one problem with the “naive” part of naive Bayes. By taking all of the features (words) as independent, it thinks it has seen far more information than it really has, and is therefore far more confident about its predictions than is warranted.

What’s the most positive review in the test set according to this?

# sort by *least negative* since near zero aren't rounded
sort.list(predicted_prob[,1], dec=F)[1]

[1] 440

id_test <- !((1:2000) %in% id_train)
texts(corpus)[id_test][440]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   pos_cv738_10116 
"here's a word analogy : amistad is to the lost world as schindler's list is to jurassic park . \nin 1993 , after steven spielberg made the monster dino hit , many critics described schindler's list as the director's \" penance \" ( as if there was a need for him to apologize for making a crowd-pleasing blockbuster ) . \nnow , after a three-year layoff , spielberg is back with a vengeance . \nonce again , his summer release was special effects-loaded action/adventure flick with dinosaurs munching on human appetizers . \nnow , following his 1993 pattern , he has fashioned another serious , inspirational christmas release about the nature of humanity . \nthat film is amistad . \nalthough not as masterful as schindler's list , amistad is nevertheless a gripping motion picture . \nthematically rich , impeccably crafted , and intellectually stimulating , the only area where this movie falls a little short is in its emotional impact . \nwatching schindler's list was a powerful , almost spiritual , experience . \nspielberg pulled us into the narrative , absorbed us in the drama , then finally let us go , exhausted and shattered , three-plus hours later . \naspects of the movie have stayed with me ever since . \namistad , while a fine example of film making , is not as transcendent . \nthe incident of the ship la amistad is not found in any history books , but , considering who writes the texts , that's not a surprise . \nhowever , the event is a part of the american social and legal fabric , and , while amistad does not adhere rigorously to the actual account , most of the basic facts are in order . \nseveral , mostly minor changes have been made to enhance the film's dramatic force . \non the whole , while amistad may not be faithful to all of the details of the situation , it is true to the spirit and meaning of what transpired . \none stormy night during the summer of 1839 , the 53 men imprisoned on the spanish slave ship la amistad escape . \nled by the lion-hearted cinque ( djimon hounsou ) , they take control of the vessel , killing most of the crew . \nadrift somewhere off the coast of cuba and uncertain how to make their way back to africa , they rely on the two surviving spaniards to navigate the eastward journey . \nthey are tricked , however , and the la amistad , which makes its way northward off the united states' eastern coastline , is eventually captured by an american naval ship near connecticut . \nthe kidnapped africans are shackled and thrown into prison , charged with murder and piracy . \nthe first men to come to the africans' defense are abolitionists theodore joadson ( morgan freeman ) and lewis tappan ( stellan skarsgard ) . \nthey are soon joined by roger baldwin ( matthew mcconaughey ) , a property attorney of little repute . \naided by advice from former president john quincy adams ( anthony hopkins ) , baldwin proves a more persuasive orator than anyone gave him credit for , and his central argument -- that the prisoners were illegally kidnapped free men , not property -- convinces the judge . \nbut powerful forces have aligned against baldwin's cause . \ncurrent president martin van buren ( nigel hawthorne ) , eager to please southern voters and 11-year old queen isabella of spain ( anna paquin ) , begins pulling strings behind-the-scenes to ensure that none of the africans goes free . \nat its heart , amistad is a tale of human courage . \ncinque is a heroic figure whose spirit remains unbreakable regardless of the pain and indignity he is subjected to . \nhe is a free man , not a slave , and , while he recognizes that he may die as a result of his struggle , he will not give it up . \neffectively portrayed by newcomer djimon hounsou , whose passion and screen presence arrest our attention , cinque is the key to viewers seeing the amistad africans as more than symbols in a battle of ideologies . \nthey are individuals , and our ability to make that distinction is crucial to the movie's success . \nto amplify this point , spielberg presents many scenes from the africans' point-of-view , detailing their occasionally-humorous observations about some of the white man's seemingly-strange \" rituals \" . \nthe larger struggle is , of course , one of defining humanity . \nas the nazis felt justified in slaughtering jews because they viewed their victims as \" sub-human , \" so the pro-slavery forces of amistad use a similar defense . \nthe abolitionists regard the africans as men , but the slavers and their supporters see them as animals or property . \nin a sense , the morality of slavery is on trial here with the specter of civil war , which would break out less than three decades later , looming over everything . \namistad's presentation of the legal and political intricacies surrounding the trial are fascinating , making this movie one of the most engrossing courtroom dramas in recent history . \nfour claimants come forward against the africans : the state , which wants them tried for murder ; the queen of spain , who wants them handed over to her under the provision of an american/spanish treaty ; two american naval officers , who claim the right of high seas salvage ; and the two surviving spaniards from la amistad , who demand that their property be returned to them . \nbaldwin must counter all of these claims , while facing a challenge to his own preconceived notions as the result of a relationship he develops with cinque . \neven though attorney and client are divided by a language barrier , they gradually learn to communicate . \naside from cinque , who is a fully-realized individual , characterization is spotty , but the acting is top-notch . \nmatthew mcconaughey successfully overcomes his \" pretty boy \" image to become baldwin , but the lawyer is never particularly well-defined outside of his role in the la amistad case . \nlikewise , while morgan freeman and stellan skarsgard are effective as joadson and tappan , they are never anything more than \" abolitionists . \" \nnigel hawthorne , who played the title character in the madness of king george , presents martin van buren as a spineless sycophant to whom justice means far less than winning an election . \nfinally , there's anthony hopkins , whose towering portrayal of john quincy adams is as compelling as anything the great actor has recently done . \nhopkins , who can convincingly play such diverse figures as a serial killer , an emotionally-crippled english butler , and richard nixon , makes us believe that he is adams . \nhis ten-minute speech about freedom and human values is unforgettable . \none point of difference worth noting between amistad and schindler's list is this film's lack of a well-defined human villain . \nschindler's list had ralph fiennes' superbly-realized amon goeth , who was not only a three-dimensional character , but a personification of all that the nazis stood for . \nthere is no such figure in amistad . \nthe villain is slavery , but an ideology , no matter how evil , is rarely the best adversary . \nit is to spielberg's credit that he has fashioned such a compelling motion picture without a prominent antagonist . \namistad's trek to the screen , which encountered some choppy waters ( author barbara chase-riboud has cried plagiarism , a charge denied by the film makers ) , comes in the midst of an upsurge of interest in the incident . \nan opera of the same name opened in chicago on november 29 , 1997 . \nnumerous books about the subject are showing up on bookstore shelves . \nit remains to be seen how much longevity the amistad phenomena has , but one thing is certain -- with spielberg's rousing , substantive film leading the way , the spotlight has now illuminated this chapter of american history . "

Looks like ``Amistad.’’ A genuinely positive review, but note how many times “spielberg” is mentioned. The prediction is biased toward positive just because Spielberg had positive reviews in the training set. We may not want that behavior.

Note also that this is a very long review.

# sort by *least neg* since near zero aren't rounded
sort.list(predicted_prob[,2], dec=F)[1]

[1] 211

texts(corpus)[id_test][211]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   neg_cv807_23024 
"and i thought \" stigmata \" would be the worst religiously-oriented thriller released this year . \nturns out i was wrong , because while \" stigmata \" was merely boring and self-important , \" end of days \" is completely inept on all fronts . \nit's a silly , incomprehensible , endlessly stupid mess . \nfor a guy like me who grew up watching arnold schwarzenegger at his best , it's extremely disconcerting to see where the big man has ended up . \nfor the first time in recent memory , an arnold action movie ( and \" batman & robin \" doesn't count ) is no fun at all . \n \" end of days \" is a major stinker . \nthe movie opens in vatican city , 1979 . \nsome catholic priests have observed an ancient prophecy , which says that a girl will be born on that night that satan will have targeted for impregnation . \nif he impregnates her between 11 and midnight on december 31 , 1999 , the world will be destroyed . \nthe pope orders protection of this girl , though some priests believe she ought to be killed . \nin new york , that very night , a girl is born to fulfill the prophecy . \ntwenty years later , we meet jericho cane ( schwarzenegger ) , a suicidal ex-cop with a drinking problem . \nnow working as a security guard for hire , he is protecting a local businessman ( gabriel byrne ) , who is actually possessed by the devil . \nan assassination attempt on the businessman by a crazed former priest leads him to the girl satan is after , christine york ( robin tunney ) . \nrecognizing elements of his own murdered daughter in christine ( including ownership of the same music box , apparently ) , jericho swears to protect her against the devil and the faction of priests looking to kill her . \nthere are so many problems with this film it's hard to know where to begin , but how about starting with the concept ? \ncasting arnold in a role like this was a mistake to begin with . \nschwarzenegger is a persona , not an actor , so putting him in a role that contradicts his usual strong personality is a bad idea . \narnold has neither the dramatic range nor the speaking ability to pull off a character tormented by conflicting emotions . \nin other words , trying to give him dimension was a mistake . \nharrison ford , mel gibson , or even bruce willis could have played this role ( they've all played noble and flawed heroes ) , but not schwarzenegger . \nthere are several scenes that attempt to establish jericho's character ; one has him contemplating suicide , another crying over the loss of his wife and daughter , and even one in which the devil tries to tempt him into revealing christine's location by offering him his old life back . \nnone of these scenes really work , because arnie isn't up to the task . \nthe filmmakers would have been better off making jericho a strong , confident character ( like the terminator , for example ) , the likes of which schwarzenegger has excelled in before . \nthis one isn't at all believable the way arnold plays him . \nthe supporting cast tries their hardest , and only gabriel byrne makes any impact at all . \nas the prince of darkness , he's suave and confident . \nhe acts like one would expect the devil to act . \nthe problem is that the script has him doing things that make no sense ( more on that later ) and that undermines him as a powerful villain . \nbyrne out-performs arnold in every scene they have together ( including the aforementioned temptation bit ) , but this is problematic when it causes the audience to start doing the unthinkable : root for the devil . \nbyrne's speech about the bible being \" overrated \" actually starts to make sense , mainly because arnold's attempts at refuting it ( mostly of the \" 'tis not ! \" \nvariety ) are feeble at best . \nthe only problem is , arnold has to win , so in the end , nobody really cares . \nkevin pollack plays jericho's security guard sidekick and tries to liven things up with some comic asides , but like most bad action movie sidekicks , he disappears after about an hour . \nrobin tunney isn't given much to do except look scared . \nin fact , all of the supporting players are good actors , but none , save for byrne , is given anything interesting to do . \nperformances aside , it would be really hard to enjoy this film no matter who starred in it . \nthis being an action blockbuster , it's no surprise that the worst thing about it is the script , which starts off totally confusing , and when some of it is explained ( and not much of it is ) , it's utterly ridiculous . \nwhy is the devil coming on new year's eve , 1999 ? \nbecause it's exactly 1000 years after the year of the devil , which isn't 666 , it turns out . \nsome nutty priest accidentally read it upside down , so the real year is 999 , so just add a 1 to the beginning , and you've got 1999 ! \nif you don't buy this explanation , you're not alone . \nit's convoluted and silly at the same time . \nthe method by which jericho locates christine york is equally ludicrous ( she's christine , see , and she lives in new york , see . \n . \n . ) , and if that weren't bad enough , there's plenty of bothersome stuff in this film that isn't explained at all . \nwhy can satan kill everyone he passes on the street , but when it comes to snuffing out one drunk ex-cop , he's powerless ? \nis he impervious to only one kind of bullet ? \nhow come he can't control jericho or christine ? \nand how did those gregorian monks deal with time zones in their prophecies ? \na clumsy attempt at a joke is made about this , but it's never actually explained . \nusually , this sort of thing wouldn't matter in a schwarzenegger flick ( i mean , don't get me started on the time paradoxes offered up by the terminator movies ) , but this time the plot inconsistencies stand out even more than usual because the action is rarely exciting . \nthere are several predictable horror film clich ? s present in \" end of days , \" complete with the old \" black cat hiding in a cabinet \" bit , not that we ever find out what the cat was doing in there . \nit gets so formulaic that it's possible for those uninterested in being scared to close their eyes at the precise moment a \" boo \" will come . \ntheir predictions will rarely be wrong . \nthe more grandiose action sequences are utterly charmless , partially because we don't care about these characters ( due to the script's pathetic attempts at characterization and setup ) , and also because they , too , don't make any sense . \nthere's a scene where schwarzenegger gets thrown around a room by a little old lady . \nit's good for a few chuckles , but not much else . \nsupposedly we're to believe she now has super strength by virtue of being controlled by satan , but the script never sets that up , so the scene is merely silly . \nnone of this is terribly exciting , because all the action sequences are so badly framed that it's often hard to tell why it's happening in the first place , not to mention that they're edited in full-on incomprehensible mtv quick-cut style . \nmost of them had me scratching my head , rather than saying , \" wow , cool ! \" \n \" end of days \" is not only silly and confusing , but it's also distinctly unpleasant to watch . \nthe devil apparently doesn't operate in the more subtle , i'll-convince-people-to-kill-each-other fashion outlined in the bible , but instead enjoys killing people gruesomely in broad daylight . \nthis doesn't only make him an awfully predictable sort , but it also means that not a single scene in \" end of days \" goes by without unnecessarily graphic violence , or the odd kinky sexual encounter ( yet another bit that had me scratching my head ) . \nif violence is supposed to be shocking , it's not a good idea to throw so much of it into a movie that the audience goes numb . \nscenes aren't connected through any reasonable means , so a lot of the time , stuff gets blown up , or people get killed , and i had no idea why . \nreasons ? \nto hell with reasons ! \nlet's just blow stuff up ! \nisn't it cool ? \nnope , not by a long shot . \nthis film is thoroughly unwatchable . \nit's dull , interminable , and unrelenting in its stupidity . \nperhaps arnold needs to make some movies with james cameron to revive his career , because it's not happening with hack peter hyams here . \n \" end of days \" might have had camp value , if only it didn't top itself off with an overly pious ending that nobody's going to buy . \nif the movie is going to be serious , the filmmakers should have come up with a decent script . \nif it's going to be campy , arnold shouldn't be taking himself so damn seriously ( i didn't actually see him put up on a cross , did i ? ) , and his character shouldn't be such a sad sack . \nas it stands , \" end of days \" is just a bad movie , and an awfully gloomy one at that . "

Schwarzenegger’s ``End of Days’’

It also should be clear enough that more words means more votes, so longer documents are more clearly positive or negative. There’s an argument for that. It also would underplay a review that read in it’s entirety: ``terrible.’’ That even though the review is 100% clear in its sentiment.

What is it most confused about?

sort.list(abs(predicted_prob - .5), dec=F)[1]

[1] 212

predicted_prob[212,]

      neg       pos 
0.4496432 0.5503568

So … the model says 45% chance negative, 55% positive.

texts(corpus)[id_test][212]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       neg_cv808_13773 
"stephen , please post if appropriate . \n \" mafia ! \" - crime isn't a funny business by homer yen ( c ) 1998 \non a particular night when i found myself having some free time , i had a chance to either go to sleep early or to see \" mafia ! \" , a spoof of mafia and crime films such as \" the godfather , \" \" goodfellas \" and \" casino \" . \nat 84 minutes in length , i thought that i could enjoy a few laughs before getting a good nights sleep . \nbut by my account , i think that my laff-o-meter only registered a few grins , one giggle , and maybe one chortle . \ni suppose that you could justify your time as homage to the venerable hollywood star , lloyd bridges , who just recently passed away and whose last performance was in this film . \n \" mafia ! \" \nchronicles vincenzo cortino's ( lloyd bridges ) life . \nseparated from his family when he was young , he escapes to america and tries to live an honest life . \nbut as fate would have it , vincenzo grows up to be a powerful and klutzy crime lord . \nfollowing in his footsteps are his two sons , joey ( billy burke ) and anthony ( jay mohr ) . \nlike all siblings in powerful crime families , they squabble over power , the future of the family , fortune , and women . \n \" mafia ! \" is co-written by jim abrahams , who also contributed to some gut-busting funny spoofs such as \" airplane \" and \" the naked gun . \" \nbut these previous movies were funny because the jokes seemed more universally understood and there was more of a manic silliness at work . \nas i write this , i also wonder how many people have actually seen the movies on which this spoof is based . \ncrime movies in general contain a lot of profanity and violence . \nit's a tough genre to parody . \ni was kind of hoping that they could somehow spoof the profanity used in all of those crime movies , maybe by having all of the tough crime lords say \" please \" as they decide which sector to take over , but this opportunity was never explored . \nthere were one or two moments that made me smile such as the scene where vincenzo is dancing with his newly wed daughter-in-law . \na gunman shoots him several times . \nthe impact of the bullets cause him to make these wild contortions that force the wedding band to change music styles to keep up with him , from the samba to disco to the macarena . \ni think that i just gave away the best part of the film . \noh well , that just means that you can go to sleep a little earlier . "

A negative review of “Mafia!” a spoof movie I’d never heard of. Satire, parody, sarcasm, and similar are notoriously difficult to correctly classify, so perhaps that’s what happened here.

Let’s look at a clear mistake.

sort.list(predicted_prob[1:250,1],dec=F)[1]

[1] 196

predicted_prob[196,]

         neg          pos 
3.967294e-17 1.000000e+00

So … the model says DEFINITELY positive.

texts(corpus)[id_test][196]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                neg_cv761_13769 
"weighed down by tired plot lines and spielberg's reliance on formulas , _saving private ryan_ is a mediocre film which nods in the direction of realism before descending into an abyss of cliches . \nthere ought to be a law against steven spielberg making movies about truly serious topics . \nspielberg's greatest strength as a director is the polished , formulaic way in which every aspect of the film falls carefully into place to make a perfect story . \nbut for a topic of such weight as combat in the second world war ( or the holocaust ) this technique backfires , for it creates coherent , comprehensible and redemptive narratives out of events whose size , complexity and evil are utterly beyond the reach of human ken . \nin this way spielberg trivializes the awesome evil of the stories he films . \n_saving private ryan_ tells the story of eight men who have been detailed on a \" pr mission \" to pull a young man , ryan ( whose three other brothers were just killed in fighting elsewhere ) out of combat on the normandy front just after d-day . \nryan is a paratrooper who dropped behind enemy lines the night before the landings and became separated from his fellow soldiers . \nthe search for him takes the eight soldiers across the hellish terrain of world war ii combat in france . \nthere's no denying spielberg came within shouting distance of making a great war movie . \nthe equipment , uniforms and weapons are superbly done . \nthe opening sequence , in which captain miller ( tom hanks ) leads his men onto omaha beach , is quite possibly the closest anyone has come to actually capturing the unendurably savage intensity of modern infantry combat . \nanother pleasing aspect of the film is spielberg's brave depiction of scenes largely unknown to american audiences , such as the shooting of prisoners by allied soldiers , the banality of death in combat , the routine foul-ups in the execution of the war , and the cynicism of the troops . \nthe technical side of the film is peerless , as always . \nthe camera work is magnificent , the pacing perfect , the sets convincing , the directing without flaw . \nhanks will no doubt be nominated for an oscar for his performance , which was utterly convincing , and the supporting cast was excellent , though ted danson seems a mite out of place as a paratroop colonel . \nyet the attempt at a realistic depiction of combat falls flat on its face because realism is not something which can be represented by single instances or events . \nit has to thoroughly permeate the context at every level of the film , or the story fails to convince . \nthroughout the movie spielberg repeatedly showed only single examples of the grotesque wounds produced by modern mechanized devices ( exception : men are shown burning to death with relative frequency ) . \nfor example , we see only one man with guts spilled out on the ground . \nhere and there men lose limbs ; in one scene miller is pulling a man to safety , there's an explosion , and miller looks back to see he is only pulling half a man . \nbut the rest of the corpses are remarkably intact . \nthere are no shoes with only feet in them , no limbs scattered everywhere , no torsos without limbs , no charred corpses , and most importantly , all corpses have heads ( in fairness there are a smattering of wicked head wounds ) . \nthe relentless dehumanization of the war , in which even corpses failed to retain any indentity , is soft-pedaled in the film . \nultimately , _saving private ryan_ bows to both hollywood convention and the unwritten rules of wartime photography in its portrayal of wounds and death in war . \nrather than saying _saving private ryan_ is \" realistic , \" it would be better to describe it as \" having realistic moments . \" \nanother aspect of the \" hollywoodization \" of the war is the lack of realistic dialogue and in particular , the lack of swearing . \nanyone familiar with the literature on the behavior of the men during the war , such as fussell's superb _wartime : understanding and behavior in the second world war_ ( which has an extensive discussion on swearing ) , knows that the troops swore fluently and without letup . \n \" who is this private ryan that we have to die for him ? \" \nasks one infantrymen in the group of eight . \nrendered in wartime demotic , that should have been expressed as \" who is this little pecker that we have to get our dicks shot off for him ? \" \nor some variant thereof . \nconversations should have been literally sprinkled with the \" f \" word , and largely about ( the search for ) food and sex . \nthis is all the more inexplicable because the movie already had an \" r \" rating due to violence , so swearing could not possibly have been eliminated to make it a family film . \nhowever , the most troubling aspect of the film is the spielbergization of the topic . \nthe most intense hell humans have ever created for themselves is not emotionally wrenching enough for steven spielberg . \nhe cannot just cede control to the material ; he has to be bigger than it . \nas if afraid to let the viewer find their own ( perhaps unsettled and not entirely clear ) emotional foothold in the material , spielberg has to package it in hallmark moments to give the war a meaning and coherence it never had : the opening and closing scenes of ryan and his family in the war cemetary ( reminscent of the closing scene from _schindler's list ) , the saccharine exchange between ryan and his wife at the close ( every bit as bad as schindler's monologue about how his car , tiepin or ring could have saved another jew ) , quotes from abraham lincoln and emerson , captain miller's last words to private ryan , and an unbelievable storyline in which a prisoner whom they free earlier in the movie comes back to kill the captain . \nthat particular subplot is so hokey , so predictable , it nigh on ruins the film . \nnowhere in the film is there a resolute depiction of the meaninglessness , stupidity and waste which characterized the experience of war to the men who actually fought in combat ( imagine if miller had been killed by friendly fire or collateral damage ) . \nbecause of its failure to mine deeply into the terrible realities of world war ii , _saving private ryan_ can only pan for small truths in the shallows . \n . "

Aha! A clearly negative review of “Saving Private Ryan.”

This is at least partly an “overfitting” mistake. It probably learned other “Saving Private Ryan” or “Spielberg movies” words – it looks like “Spielberg’s” was number #3 on our list above – and learned that “reviews that talk about Saving Private Ryan are probably positive.”

Below, I’ll give brief examples of some other classification models for this data.

Logistic regression, ridge regression, LASSO, and elasticnet

We’ll look at three (well really only two) variants of the relatively straightforward regularized logistic regression model.

library(glmnet)
library(doMC)

Ridge regression (Logistic with L2-regularization)

registerDoMC(cores=2) # parallelize to speed up
sentmod.ridge <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$Sentiment,
                   family="binomial", 
                   alpha=0,  # alpha = 0: ridge regression
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.ridge)

This shows classification error as \(\lambda\) (the total weight of the regularization penalty) is increased from 0. The minimum error is at the leftmost dotted line, about \(\log(\lambda) \approx 3\). This value is stored in lambda.min.

# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.ridge <- predict(sentmod.ridge, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.ridge <- rep(NA,length(predicted_value.ridge))
predicted_class.ridge[predicted_value.ridge>0] <- "pos"
predicted_class.ridge[predicted_value.ridge<0] <- "neg"
tab_class.ridge <- table(actual_class,predicted_class.ridge)
tab_class.ridge

            predicted_class.ridge
actual_class neg pos
         neg 214  49
         pos  42 195

Accuracy of .818, exactly as with Naive Bayes. The misses are a little more even, with it being slightly more successful in identifying positive reviews and slightly less successful in identifying negative reviews.

confusionMatrix(tab_class.ridge, mode="everything")

Confusion Matrix and Statistics

            predicted_class.ridge
actual_class neg pos
         neg 214  49
         pos  42 195
                                          
               Accuracy : 0.818           
                 95% CI : (0.7813, 0.8509)
    No Information Rate : 0.512           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6355          
                                          
 Mcnemar's Test P-Value : 0.5294          
                                          
            Sensitivity : 0.8359          
            Specificity : 0.7992          
         Pos Pred Value : 0.8137          
         Neg Pred Value : 0.8228          
              Precision : 0.8137          
                 Recall : 0.8359          
                     F1 : 0.8247          
             Prevalence : 0.5120          
         Detection Rate : 0.4280          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8176          
                                          
       'Positive' Class : neg

At first blush, the coefficients should tell us what t he model learned:

plot(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=200*abs(coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,75*abs(coef(sentmod.ridge)[-1,1])))

That’s both confusing and misleading since the variance of coefficients is largest with the most obscure terms. (And, for plotting, the 40,000+ features include some very long one-off “tokens” that overlap with more common ones, e.g., “boy-drinks-entire-bottle-of-shampoo-and-may-or-may-not-get-girl-back,” “props-strategically-positioned-between-naked-actors-and-camera,” and “____________________________________________”)

With this model, it would be more informative to look at which coefficients have the most impact when making a prediction, by having larger coefficients and occurring more, or alternatively to look at which coefficients we are most certain of, downweighting by the inherent error. The impact will be proportional to \(log(n_w)\) and the error will be roughly proportional to \(1/sqrt(n_w)\).

So, impact:

plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=50*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,25*abs(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))

Most positive and negative features by impact:

sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]

 outstanding     seamless     flawless     chilling 
  0.03221820   0.02591475   0.02556874   0.02437356 
   perfectly         deft   astounding    memorable 
  0.02422200   0.02420305   0.02416697   0.02412884 
   feel-good  wonderfully      offbeat    fantastic 
  0.02384193   0.02367749   0.02339370   0.02337778 
      superb  masterfully         lore  understands 
  0.02332856   0.02332021   0.02307940   0.02301901 
      finest          gem breathtaking     missteps 
  0.02295212   0.02283362   0.02265643   0.02263633

sort(log(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]

       wasted     ludicrous         waste 
  -0.03628220   -0.03443045   -0.03160819 
       poorly          mess    ridiculous 
  -0.03097066   -0.03019796   -0.02981018 
         lame         awful     insulting 
  -0.02907724   -0.02888791   -0.02746581 
        worst       spoiled        boring 
  -0.02697566   -0.02669386   -0.02651112 
    stupidity          dull     laughable 
  -0.02512044   -0.02505126   -0.02484610 
unintentional       unfunny         sucks 
  -0.02479505   -0.02468694   -0.02417994 
      idiotic      lifeless 
  -0.02417114   -0.02337420

Regularization and cross-validation bought us a lot more general – less overfit – model than we saw with Naive Bayes.

Alternatively, by certainty:

plot(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Ridge Regression Coefficients (Error Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1], colnames(dfmat_train),pos=4,cex=30*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1]), col=rgb(0,0,0,10*abs(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1])))

Most positive and negative terms:

sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=T)[1:20]

 outstanding    perfectly performances    memorable 
  0.06016299   0.05515314   0.05487387   0.05327005 
       great         also    hilarious         life 
  0.05023645   0.04935470   0.04867251   0.04791490 
    deserves         best     terrific       superb 
  0.04616259   0.04591025   0.04549550   0.04487466 
 wonderfully       allows         both           as 
  0.04479474   0.04477793   0.04405678   0.04358100 
      strong         most      overall       others 
  0.04343652   0.04338336   0.04306128   0.04303831

sort(sqrt(colSums(dfmat_train))*coef(sentmod.ridge)[-1,1],dec=F)[1:20]

       wasted         worst           bad 
  -0.07601778   -0.07584566   -0.07389321 
       boring         waste          mess 
  -0.07152609   -0.06882991   -0.06842027 
   ridiculous      supposed         awful 
  -0.06527679   -0.06283200   -0.06109105 
       poorly          lame        stupid 
  -0.06099081   -0.05832472   -0.05720734 
    ludicrous          dull         worse 
  -0.05619818   -0.05393264   -0.05132278 
unfortunately          plot       nothing 
  -0.05025006   -0.04881129   -0.04866758 
      unfunny     laughable 
  -0.04843174   -0.04720511

This view implies some would-be “stop words” are important, and these seem to make sense on inspection. For example, “as” is indicative of phrases in positive reviews comparing movies to well-known and well-liked movies, e.g., “as good as.” There’s not a parallel “as bad as” that is as common in negative reviews.

LASSO (Logistic with L1-regularization)

Ridge regression gives you a coefficient for every feature. At the other extreme, we can use the LASSO to get some feature selection.

registerDoMC(cores=2) # parallelize to speed up
sentmod.lasso <- cv.glmnet(x=dfmat_train,
                   y=docvars(dfmat_train)$Sentiment,
                   family="binomial", 
                   alpha=1,  # alpha = 1: LASSO
                   nfolds=5, # 5-fold cross-validation
                   parallel=TRUE, 
                   intercept=TRUE,
                   type.measure="class")
plot(sentmod.lasso)

# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_value.lasso <- predict(sentmod.lasso, newx=dfmat_matched,s="lambda.min")[,1]
predicted_class.lasso <- rep(NA,length(predicted_value.lasso))
predicted_class.lasso[predicted_value.lasso>0] <- "pos"
predicted_class.lasso[predicted_value.lasso<0] <- "neg"
tab_class.lasso <- table(actual_class,predicted_class.lasso)
tab_class.lasso

            predicted_class.lasso
actual_class neg pos
         neg 202  61
         pos  29 208

This gets one more right than the others for an accuracy of .82. The pattern of misses goes further in the other direction from Naive Bayes, overpredicting positive reviews.

confusionMatrix(tab_class.lasso, mode="everything")

Confusion Matrix and Statistics

            predicted_class.lasso
actual_class neg pos
         neg 202  61
         pos  29 208
                                          
               Accuracy : 0.82            
                 95% CI : (0.7835, 0.8527)
    No Information Rate : 0.538           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6414          
                                          
 Mcnemar's Test P-Value : 0.001084        
                                          
            Sensitivity : 0.8745          
            Specificity : 0.7732          
         Pos Pred Value : 0.7681          
         Neg Pred Value : 0.8776          
              Precision : 0.7681          
                 Recall : 0.8745          
                     F1 : 0.8178          
             Prevalence : 0.4620          
         Detection Rate : 0.4040          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8238          
                                          
       'Positive' Class : neg

plot(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients, IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=2*abs(coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,1*abs(coef(sentmod.lasso)[-1,1])))

As we want when we run the LASSO, the vast majority of our coefficients are zero … most features have no influence on the predictions.

It’s less necessary but let’s look at impact:

plot(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="LASSO Coefficients (Impact Weighted), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1], colnames(dfmat_train),pos=4,cex=.8*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1]), col=rgb(0,0,0,.25*abs(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1])))

Most positive and negative features by impact:

sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=T)[1:20]

 outstanding    perfectly    memorable     deserves 
   1.8774578    1.6077352    1.5402021    1.1174430 
   hilarious     terrific       finest performances 
   1.1040564    1.0820637    0.9907954    0.9659045 
  refreshing        great       others breathtaking 
   0.8944540    0.7419333    0.6173077    0.5688596 
 wonderfully       allows      overall          war 
   0.5686626    0.5611135    0.5484030    0.5313923 
        also        world         life        flaws 
   0.5164209    0.4605382    0.4560559    0.4285162

Interestingly, there are a few there that would be negative indicators in most sentiment dictionaries, like “flaws” and “war”.

sort(log(colSums(dfmat_train))*coef(sentmod.lasso)[-1,1],dec=F)[1:20]

   ridiculous        wasted        boring 
    -2.902012     -2.883715     -2.645762 
    ludicrous          mess         worst 
    -2.476602     -2.470577     -2.270073 
       poorly         awful           bad 
    -2.219431     -2.022392     -2.008826 
        waste          lame      supposed 
    -1.957973     -1.913403     -1.840086 
 embarrassing       runtime       tedious 
    -1.517301     -1.283180     -1.276298 
         dull       unfunny       nothing 
    -1.187166     -1.131177     -1.068824 
    laughable unfortunately 
    -1.047326     -1.030627

Both lists also have words that indicate a transition from a particular negative or positive aspect, followed by a holistic sentiment in the opposite direction. “The pace dragged at times, but overall it is an astonishing act of filmmaking.” “The performances are tremendous, but unfortunately the poor writing makes this movie fall flat.”

Elastic net

The elastic net estimates not just \(\lambda\) (the overall amount of regularization) but also \(\alpha\) (the relative weight of the L1 loss relative to the L2 loss). In R, this can also be done with the glmnet package. “I leave that as an exercise.”

A first ensemble

We’ve got three sets of predictions now, so why don’t we try a simple ensemble in which our prediction for each review is based on a majority vote of the three. Sort of like a Rotten Tomatoes rating. They each learned slightly different things, so perhaps the whole is better than its parts.

predicted_class.ensemble3 <- rep("neg",length(actual_class))
num_predicted_pos3 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos")
predicted_class.ensemble3[num_predicted_pos3>1] <- "pos"
tab_class.ensemble3 <- table(actual_class,predicted_class.ensemble3)
tab_class.ensemble3

            predicted_class.ensemble3
actual_class neg pos
         neg 223  40
         pos  41 196

Hey, that is better! Accuracy 83.8%!

confusionMatrix(tab_class.ensemble3, mode="everything")

Confusion Matrix and Statistics

            predicted_class.ensemble3
actual_class neg pos
         neg 223  40
         pos  41 196
                                          
               Accuracy : 0.838           
                 95% CI : (0.8027, 0.8692)
    No Information Rate : 0.528           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6751          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8447          
            Specificity : 0.8305          
         Pos Pred Value : 0.8479          
         Neg Pred Value : 0.8270          
              Precision : 0.8479          
                 Recall : 0.8447          
                     F1 : 0.8463          
             Prevalence : 0.5280          
         Detection Rate : 0.4460          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8376          
                                          
       'Positive' Class : neg

Support vector machine

Without explaining SVM at all, let’s try a simple one.

library(e1071)

sentmod.svm <- svm(x=dfmat_train,
                   y=as.factor(docvars(dfmat_train)$Sentiment),
                   kernel="linear", 
                   cost=10,  # arbitrary regularization cost
                   probability=TRUE)

Ideally, we would tune the cost parameter via cross-validation or similar, as we did with \(\lambda\) above.

# actual_class <- docvars(dfmat_matched, "Sentiment")
predicted_class.svm <- predict(sentmod.svm, newdata=dfmat_matched)
tab_class.svm <- table(actual_class,predicted_class.svm)
tab_class.svm

            predicted_class.svm
actual_class neg pos
         neg 209  54
         pos  29 208

That’s actually a bit better than the others, individually if not combined, with accuracy of .834, and a bias toward overpredicting positives.

confusionMatrix(tab_class.svm, mode="everything")

Confusion Matrix and Statistics

            predicted_class.svm
actual_class neg pos
         neg 209  54
         pos  29 208
                                          
               Accuracy : 0.834           
                 95% CI : (0.7984, 0.8656)
    No Information Rate : 0.524           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6688          
                                          
 Mcnemar's Test P-Value : 0.00843         
                                          
            Sensitivity : 0.8782          
            Specificity : 0.7939          
         Pos Pred Value : 0.7947          
         Neg Pred Value : 0.8776          
              Precision : 0.7947          
                 Recall : 0.8782          
                     F1 : 0.8343          
             Prevalence : 0.4760          
         Detection Rate : 0.4180          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8360          
                                          
       'Positive' Class : neg

For a linear kernel, we can back out interpretable coefficients. This is not true with nonlinear kernels such as the “radial basis function.”

beta.svm <- drop(t(sentmod.svm$coefs)%*%dfmat_train[sentmod.svm$index,])

(Note the signs are reversed from our expected pos-neg.)

plot(colSums(dfmat_train),-beta.svm, pch=19, col=rgb(0,0,0,.3), cex=.5, log="x", main="Support Vector Machine Coefficients (Linear Kernel), IMDB", ylab="<--- Negative Reviews --- Positive Reviews --->", xlab="Total Appearances", xlim = c(1,50000))
text(colSums(dfmat_train),-beta.svm, colnames(dfmat_train),pos=4,cex=10*abs(beta.svm), col=rgb(0,0,0,5*abs(beta.svm)))

sort(-beta.svm,dec=T)[1:20]

   excellent    perfectly          job         seen 
  0.10159567   0.08328699   0.08199813   0.07928320 
      jackie        great         well       laughs 
  0.07850642   0.07773142   0.07410808   0.07354421 
        life     american       sherri        takes 
  0.07060505   0.06731434   0.06718588   0.06656558 
    together          war          son      hopkins 
  0.06613371   0.06551154   0.06545866   0.06507787 
    terrific performances          fun      twister 
  0.06477875   0.06422575   0.06368690   0.06119385

sort(-beta.svm,dec=F)[1:20]

          waste             bad         nothing 
    -0.13844821     -0.11973782     -0.11914619 
           only   unfortunately          boring 
    -0.11092148     -0.11052668     -0.10550711 
           poor          anyway          script 
    -0.09114519     -0.09077601     -0.08812765 
            any           worst           awful 
    -0.08732612     -0.08544895     -0.08318521 
          maybe            plot      horrendous 
    -0.07819739     -0.07793267     -0.07724877 
         should        supposed extraordinarily 
    -0.07511486     -0.07394538     -0.07042833 
          looks            mess 
    -0.06942303     -0.06895524

Looks a bit overfit to me and I would probably increase the regularization cost in further iterations.

Random Forests

library(randomForest)

Random forests is a very computationally intensive algorithm, so I will cut the number of features way way down just so this can run in a reasonable amount of time.

dfmat.rf <- corpus %>%
  dfm() %>%
  dfm_trim(min_docfreq=50,max_docfreq=300,verbose=TRUE)

dfmatrix.rf <- as.matrix(dfmat.rf)

set.seed(1234)
sentmod.rf <- randomForest(dfmatrix.rf[id_train,], 
                   y=as.factor(docvars(dfmat.rf)$Sentiment)[id_train],
                   xtest=dfmatrix.rf[id_test,],
                   ytest=as.factor(docvars(dfmat.rf)$Sentiment)[id_test],
                   importance=TRUE,
                   mtry=20,
                   ntree=100
                   )
#sentmod.rf

predicted_class.rf <- sentmod.rf$test[['predicted']]
tab_class.rf <- table(actual_class,predicted_class.rf)
confusionMatrix(tab_class.rf, mode="everything")

Confusion Matrix and Statistics

            predicted_class.rf
actual_class neg pos
         neg 187  76
         pos  36 201
                                          
               Accuracy : 0.776           
                 95% CI : (0.7369, 0.8118)
    No Information Rate : 0.554           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5545          
                                          
 Mcnemar's Test P-Value : 0.0002286       
                                          
            Sensitivity : 0.8386          
            Specificity : 0.7256          
         Pos Pred Value : 0.7110          
         Neg Pred Value : 0.8481          
              Precision : 0.7110          
                 Recall : 0.8386          
                     F1 : 0.7695          
             Prevalence : 0.4460          
         Detection Rate : 0.3740          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.7821          
                                          
       'Positive' Class : neg

That did a bit worse – Accuracy .776 – but we did give it considerably less information.

Getting marginal effects from a random forest model requires more finesse than I’m willing to apply here. We can get the “importance” of the different features, but this alone does not tell us in what direction the feature pushes the predictions.

varImpPlot(sentmod.rf)

Some usual suspects there, but we need our brains to fill in which ones are positive and negative. Some are ambiguous (“town”) and some we have seen are subtle (“overall”) or likely to be misleading (“war”).

Another ensemble

Now we’ve got five, so let’s ensemble those.

  predicted_class.ensemble5 <- rep("neg",length(actual_class))
num_predicted_pos5 <- 1*(predicted_class=="pos") + 1*(predicted_class.ridge=="pos") + 1*(predicted_class.lasso=="pos") + 
  1*(predicted_class.svm=="pos") + 
  1*(predicted_class.rf=="pos")
predicted_class.ensemble5[num_predicted_pos5>2] <- "pos"
tab_class.ensemble5 <- table(actual_class,predicted_class.ensemble5)
tab_class.ensemble5

            predicted_class.ensemble5
actual_class neg pos
         neg 220  43
         pos  28 209

confusionMatrix(tab_class.ensemble5,mode="everything")

Confusion Matrix and Statistics

            predicted_class.ensemble5
actual_class neg pos
         neg 220  43
         pos  28 209
                                          
               Accuracy : 0.858           
                 95% CI : (0.8243, 0.8874)
    No Information Rate : 0.504           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7161          
                                          
 Mcnemar's Test P-Value : 0.09661         
                                          
            Sensitivity : 0.8871          
            Specificity : 0.8294          
         Pos Pred Value : 0.8365          
         Neg Pred Value : 0.8819          
              Precision : 0.8365          
                 Recall : 0.8871          
                     F1 : 0.8611          
             Prevalence : 0.4960          
         Detection Rate : 0.4400          
   Detection Prevalence : 0.5260          
      Balanced Accuracy : 0.8582          
                                          
       'Positive' Class : neg

And like magic, now we’re up to 85.8% Accuracy in the test set.

LS0tCnRpdGxlOiAiVGV4dCBhcyBEYXRhIFR1dG9yaWFsIC0gSW50cm9kdWN0aW9uIHRvIFRleHQgQ2xhc3NpZmljYXRpb24gKGluIFIpIgpzdWJ0aXRsZTogIlRleHQgYXMgRGF0YSwgUExTQyA1OTcsIFBlbm4gU3RhdGUiCmF1dGhvcjogIkJ1cnQgTC4gTW9ucm9lIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgaGlnaGxpZ2h0OiB0YW5nbwogICAgdGhlbWU6IHVuaXRlZAogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgICB0b2M6IHllcwotLS0KSW4gdGhpcyBub3RlYm9vayB3ZSB3aWxsIHdvcmsgdGhyb3VnaCBhIGJhc2ljIGNsYXNzaWZpY2F0aW9uIHByb2JsZW0sIHVzaW5nIHRoZSBtb3ZpZSByZXZpZXdzIGRhdGEgc2V0LiBXZSBrbm93IHRoZSAibmVnYXRpdmUiIG9yICJwb3NpdGl2ZSIgbGFiZWxzIGZvciBlYWNoIG9mIHRoZSBtb3ZpZXMuIFdlJ2xsIHNldCBzb21lIG9mIHRoZXNlIGFzaWRlIGZvciBhIHRlc3Qgc2V0IGFuZCB0cmFpbiBvdXIgbW9kZWxzIG9uIHRoZSByZW1haW5kZXIgYXMgYSB0cmFpbmluZyBzZXQsIHVzaW5nIHVuaWdyYW0gcHJlc2VuY2Ugb3IgY291bnRzIGFzIHRoZSBmZWF0dXJlcy4gVGhlbiB3ZSdsbCBldmFsdWF0ZSB0aGUgcHJlZGljdGlvbnMgcXVhbnRpdGF0aXZlbHkgYXMgd2VsbCBhcyBsb29rIGF0IHNvbWUgd2F5cyB0byBpbnRlcnByZXQgd2hhdCB0aGUgbW9kZWxzIHRlbGwgdXMuCgpXZSdsbCBzdGFydCB3aXRoIE5haXZlIEJheWVzLCBtb3ZlIHRvIGxvZ2lzdGljIHJlZ3Jlc3Npb24gYW5kIGl0cyByaWRnZSBhbmQgTEFTU08gdmFyaWFudHMsIHRoZW4gc3VwcG9ydCB2ZWN0b3IgbWFjaGluZXMgYW5kIGZpbmFsbHkgcmFuZG9tIGZvcmVzdHMuIFdlJ2xsIGFsc28gY29tYmluZSB0aGUgbW9kZWxzIHRvIGV4YW1pbmUgYW4gZW5zZW1ibGUgcHJlZGljdGlvbi4KClJlbW92ZSB0aGUgY29tbWVudCBhbmQgaW5zdGFsbCB0aGUgcXVhbnRlZGEuY29ycG9yYSBwYWNrYWdlIGZyb20gZ2l0aHViOgpgYGB7cn0KIyBkZXZ0b29sczo6aW5zdGFsbF9naXRodWIoInF1YW50ZWRhL3F1YW50ZWRhLmNvcnBvcmEiKQpgYGAKCldlJ2xsIHVzZSB0aGVzZSBwYWNrYWdlczoKCmBgYHtyfQpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KHF1YW50ZWRhKQpsaWJyYXJ5KHF1YW50ZWRhLmNvcnBvcmEpCmxpYnJhcnkoY2FyZXQpCmBgYAoKV2UnbGwgc3RhcnQgd2l0aCB0aGUgZXhhbXBsZSBnaXZlbiBpbiB0aGUgYHF1YW50ZWRhYCBkb2N1bWVudGF0aW9uLiBSZWFkIGluIHRoZSBQYW5nIGFuZCBMZWUgZGF0YXNldCBvZiAyMDAwIG1vdmllIHJldmlld3MuIChUaGlzIGFwcGVhcnMgdG8gYmUgdGhlIHNhbWUgMjAwMCByZXZpZXdzIHlvdSB1c2VkIGluIHRoZSBkaWN0aW9uYXJ5IGV4ZXJjaXNlLCBidXQgaW4gYSBkaWZmZXJlbnQgb3JkZXIuKQoKYGBge3J9CmNvcnB1cyA8LSBkYXRhX2NvcnB1c19tb3ZpZXMKc3VtbWFyeShjb3JwdXMsNSkKYGBgCgpTaHVmZmxlIHRoZSByb3dzIHRvIHJhbmRvbWl6ZSB0aGUgb3JkZXIuCmBgYHtyfQpzZXQuc2VlZCgxMjM0KQppZF90cmFpbiA8LSBzYW1wbGUoMToyMDAwLDE1MDAsIHJlcGxhY2U9RikKaGVhZChpZF90cmFpbiwgMTApCmBgYAoKVXNlIHRoZSAxNTAwIGZvciBhIHRyYWluaW5nIHNldCBhbmQgdGhlIG90aGVyIDUwMCBhcyB5b3VyIHRlc3Qgc2V0LiBDcmVhdGUgZGZtcyBmb3IgZWFjaC4KYGBge3J9CmRvY3ZhcnMoY29ycHVzLCAiaWRfbnVtZXJpYyIpIDwtIDE6bmRvYyhjb3JwdXMpCgpkZm1hdF90cmFpbiA8LSBjb3JwdXNfc3Vic2V0KGNvcnB1cywgaWRfbnVtZXJpYyAlaW4lIGlkX3RyYWluKSAlPiUgZGZtKCkgIyU+JSBkZm1fd2VpZ2h0KHNjaGVtZT0iYm9vbGVhbiIpCgpkZm1hdF90ZXN0IDwtIGNvcnB1c19zdWJzZXQoY29ycHVzLCAhKGlkX251bWVyaWMgJWluJSBpZF90cmFpbikpICU+JSBkZm0oKSAjJT4lIGRmbV93ZWlnaHQoc2NoZW1lPSJib29sZWFuIikKYGBgCgojIyBOYWl2ZSBCYXllcwoKTmFpdmUgQmF5ZXMgaXMgYSBidWlsdCBpbiBtb2RlbCBmb3IgcXVhbnRlZGEsIHNvIGl0J3MgZWFzeSB0byB1c2U6CgpgYGB7cn0Kc2VudG1vZC5uYiA8LSB0ZXh0bW9kZWxfbmIoZGZtYXRfdHJhaW4sIGRvY3ZhcnMoZGZtYXRfdHJhaW4sICJTZW50aW1lbnQiKSwgZGlzdHJpYnV0aW9uID0gIkJlcm5vdWxsaSIpCnN1bW1hcnkoc2VudG1vZC5uYikKYGBgCgpVc2UgdGhlIGRmbV9tYXRjaCBjb21tYW5kIHRvIGxpbWl0IGRmbWF0X3Rlc3QgdG8gZmVhdHVyZXMgKHdvcmRzKSB0aGF0IGFwcGVhcmVkIGluIHRoZSB0cmFpbmluZyBkYXRhOgpgYGB7cn0KZGZtYXRfbWF0Y2hlZCA8LSBkZm1fbWF0Y2goZGZtYXRfdGVzdCwgZmVhdHVyZXM9ZmVhdG5hbWVzKGRmbWF0X3RyYWluKSkKYGBgCgpIb3cgZGlkIHdlIGRvPyBMZXQncyBsb29rIGF0IGEgImNvbmZ1c2lvbiIgbWF0cml4LgpgYGB7cn0KYWN0dWFsX2NsYXNzIDwtIGRvY3ZhcnMoZGZtYXRfbWF0Y2hlZCwgIlNlbnRpbWVudCIpCnByZWRpY3RlZF9jbGFzcyA8LSBwcmVkaWN0KHNlbnRtb2QubmIsIG5ld2RhdGE9ZGZtYXRfbWF0Y2hlZCkKdGFiX2NsYXNzIDwtIHRhYmxlKGFjdHVhbF9jbGFzcyxwcmVkaWN0ZWRfY2xhc3MpCnRhYl9jbGFzcwpgYGAKCk5vdCBiYWQsIGNvbnNpZGVyaW5nLiBMZXQncyBwdXQgc29tZSBudW1iZXJzIG9uIHRoYXQ6CmBgYHtyfQpjb25mdXNpb25NYXRyaXgodGFiX2NsYXNzLCBtb2RlPSJldmVyeXRoaW5nIikKYGBgCgpHaXZlbiB0aGUgYmFsYW5jZSBpbiB0aGUgZGF0YSBhbW9uZyBuZWdhdGl2ZXMgYW5kIHBvc2l0aXZlcywgIkFjY3VyYWN5IiBpc24ndCBhIGJhZCBwbGFjZSB0byBzdGFydC4gSGVyZSB3ZSBoYXZlIEFjY3VyYWN5IG9mIDgxLjglLgoKTGV0J3MgZG8gc29tZSBzbmlmZiB0ZXN0cy4gV2hhdCBhcmUgdGhlIG1vc3QgcG9zaXRpdmUgYW5kIG5lZ2F0aXZlIHdvcmRzPwoKYGBge3J9CiNNb3N0IHBvc2l0aXZlIHdvcmRzCnNvcnQoc2VudG1vZC5uYiRQY0d3WzIsXSxkZWM9VClbMToyMF0KYGBgCgpUaGVyZSdzIHJlYXNvbmFibGUgc3R1ZmYgdGhlcmU6ICJvdXRzdGFuZGluZyIsICJzZWFtbGVzcyIsICJsb3ZpbmdseSIsICJmbGF3bGVzcyIuIFRoZXJlJ3MgYWxzbyBzb21lIGV2aWRlbmNlIG9mIG92ZXJmaXR0aW5nOiAic3BpZWxiZXJnJ3MiLCAid2luc2xldCIsICJnYXR0YWNhIiwgIm11bGFuIi4gV2UnbGwgc2VlIHN1cHBvcnQgZm9yIHRoZSBvdmVyZml0dGluZyBjb25jbHVzaW9uIGJlbG93LgoKYGBge3J9CiNNb3N0IG5lZ2F0aXZlIHdvcmRzCnNvcnQoc2VudG1vZC5uYiRQY0d3WzIsXSxkZWM9RilbMToyMF0KYGBgCgpMZXQncyBnZXQgYSBiaXJkcy1leWUgdmlldy4KYGBge3IsIGZpZy53aWR0aD03LCBmaWcuaGVpZ2h0PTZ9CiMgUGxvdCB3ZWlnaHRzCnBsb3QoY29sU3VtcyhkZm1hdF90cmFpbiksc2VudG1vZC5uYiRQY0d3WzIsXSwgcGNoPTE5LCBjb2w9cmdiKDAsMCwwLC4zKSwgY2V4PS41LCBsb2c9IngiLCBtYWluPSJQb3N0ZXJpb3IgUHJvYmFiaWxpdGllcywgTmFpdmUgQmF5ZXMgQ2xhc3NpZmllciwgSU1EQiIsIHlsYWI9IjwtLS0gTmVnYXRpdmUgUmV2aWV3cyAtLS0gUG9zaXRpdmUgUmV2aWV3cyAtLS0+IiwgeGxhYj0iVG90YWwgQXBwZWFyYW5jZXMiKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLHNlbnRtb2QubmIkUGNHd1syLF0sIGNvbG5hbWVzKGRmbWF0X3RyYWluKSxwb3M9NCxjZXg9NSphYnMoLjUtc2VudG1vZC5uYiRQY0d3WzIsXSksIGNvbD1yZ2IoMCwwLDAsMS41KmFicyguNS1zZW50bW9kLm5iJFBjR3dbMixdKSkpCmBgYAoKTG9vayBhIGxpdHRsZSBjbG9zZXIgYXQgdGhlIG5lZ2F0aXZlLgpgYGB7ciwgZmlnLndpZHRoPTcsIGZpZy5oZWlnaHQ9Nn0KIyBQbG90IHdlaWdodHMKcGxvdChjb2xTdW1zKGRmbWF0X3RyYWluKSxzZW50bW9kLm5iJFBjR3dbMixdLCBwY2g9MTksIGNvbD1yZ2IoMCwwLDAsLjMpLCBjZXg9LjUsIGxvZz0ieCIsIG1haW49IlBvc3RlcmlvciBQcm9iYWJpbGl0aWVzLCBOYWl2ZSBCYXllcyBDbGFzc2lmaWVyLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW09YygxMCwxMDAwKSx5bGltPWMoMCwuMjUpKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLHNlbnRtb2QubmIkUGNHd1syLF0sIGNvbG5hbWVzKGRmbWF0X3RyYWluKSxwb3M9NCxjZXg9NSphYnMoLjUtc2VudG1vZC5uYiRQY0d3WzIsXSksIGNvbD1yZ2IoMCwwLDAsMS41KmFicyguNS1zZW50bW9kLm5iJFBjR3dbMixdKSkpCmBgYAoKQW5kIGEgbGl0dGxlIG1vcmUgY2xvc2VseSBhdCB0aGUgcG9zaXRpdmUgd29yZHM6CgpgYGB7ciwgZmlnLndpZHRoPTcsIGZpZy5oZWlnaHQ9Nn0KIyBQbG90IHdlaWdodHMKcGxvdChjb2xTdW1zKGRmbWF0X3RyYWluKSxzZW50bW9kLm5iJFBjR3dbMixdLCBwY2g9MTksIGNvbD1yZ2IoMCwwLDAsLjMpLCBjZXg9LjUsIGxvZz0ieCIsIG1haW49IlBvc3RlcmlvciBQcm9iYWJpbGl0aWVzLCBOYWl2ZSBCYXllcyBDbGFzc2lmaWVyLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW09YygxMCwxMDAwKSx5bGltPWMoMC43NSwxLjApKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLHNlbnRtb2QubmIkUGNHd1syLF0sIGNvbG5hbWVzKGRmbWF0X3RyYWluKSxwb3M9NCxjZXg9NSphYnMoLjUtc2VudG1vZC5uYiRQY0d3WzIsXSksIGNvbD1yZ2IoMCwwLDAsMS41KmFicyguNS1zZW50bW9kLm5iJFBjR3dbMixdKSkpCmBgYAoKCkxldCdzIGxvb2sgYSBsaXR0bGUgbW9yZSBjbG9zZWx5IGF0IHRoZSBkb2N1bWVudCBwcmVkaWN0aW9ucy4KCmBgYHtyfQpwcmVkaWN0ZWRfcHJvYiA8LSBwcmVkaWN0KHNlbnRtb2QubmIsIG5ld2RhdGE9ZGZtYXRfbWF0Y2hlZCwgdHlwZT0icHJvYmFiaWxpdHkiKQoKZGltKHByZWRpY3RlZF9wcm9iKQpoZWFkKHByZWRpY3RlZF9wcm9iKQpzdW1tYXJ5KHByZWRpY3RlZF9wcm9iKQpgYGAKCllvdSBjYW4gc2VlIHRoZXJlIG9uZSBwcm9ibGVtIHdpdGggdGhlICJuYWl2ZSIgcGFydCBvZiBuYWl2ZSBCYXllcy4gQnkgdGFraW5nIGFsbCBvZiB0aGUgZmVhdHVyZXMgKHdvcmRzKSBhcyBpbmRlcGVuZGVudCwgaXQgdGhpbmtzIGl0IGhhcyBzZWVuIGZhciBtb3JlIGluZm9ybWF0aW9uIHRoYW4gaXQgcmVhbGx5IGhhcywgYW5kIGlzIHRoZXJlZm9yZSBmYXIgbW9yZSBjb25maWRlbnQgYWJvdXQgaXRzIHByZWRpY3Rpb25zIHRoYW4gaXMgd2FycmFudGVkLgoKV2hhdCdzIHRoZSBtb3N0IHBvc2l0aXZlIHJldmlldyBpbiB0aGUgdGVzdCBzZXQgYWNjb3JkaW5nIHRvIHRoaXM/CgpgYGB7cn0KIyBzb3J0IGJ5ICpsZWFzdCBuZWdhdGl2ZSogc2luY2UgbmVhciB6ZXJvIGFyZW4ndCByb3VuZGVkCnNvcnQubGlzdChwcmVkaWN0ZWRfcHJvYlssMV0sIGRlYz1GKVsxXQpgYGAKCmBgYHtyfQppZF90ZXN0IDwtICEoKDE6MjAwMCkgJWluJSBpZF90cmFpbikKdGV4dHMoY29ycHVzKVtpZF90ZXN0XVs0NDBdCmBgYApMb29rcyBsaWtlIGBgQW1pc3RhZC4nJyBBIGdlbnVpbmVseSBwb3NpdGl2ZSByZXZpZXcsIGJ1dCBub3RlIGhvdyBtYW55IHRpbWVzICJzcGllbGJlcmciIGlzIG1lbnRpb25lZC4gVGhlIHByZWRpY3Rpb24gaXMgYmlhc2VkIHRvd2FyZCBwb3NpdGl2ZSBqdXN0IGJlY2F1c2UgU3BpZWxiZXJnIGhhZCBwb3NpdGl2ZSByZXZpZXdzIGluIHRoZSB0cmFpbmluZyBzZXQuIFdlIG1heSBub3Qgd2FudCB0aGF0IGJlaGF2aW9yLgoKTm90ZSBhbHNvIHRoYXQgdGhpcyBpcyBhIHZlcnkgbG9uZyByZXZpZXcuCgpgYGB7cn0KIyBzb3J0IGJ5ICpsZWFzdCBuZWcqIHNpbmNlIG5lYXIgemVybyBhcmVuJ3Qgcm91bmRlZApzb3J0Lmxpc3QocHJlZGljdGVkX3Byb2JbLDJdLCBkZWM9RilbMV0KYGBgCgpgYGB7cn0KdGV4dHMoY29ycHVzKVtpZF90ZXN0XVsyMTFdCmBgYAoKU2Nod2FyemVuZWdnZXIncyBgYEVuZCBvZiBEYXlzJycKCkl0IGFsc28gc2hvdWxkIGJlIGNsZWFyIGVub3VnaCB0aGF0IG1vcmUgd29yZHMgbWVhbnMgbW9yZSB2b3Rlcywgc28gbG9uZ2VyIGRvY3VtZW50cyBhcmUgbW9yZSBjbGVhcmx5IHBvc2l0aXZlIG9yIG5lZ2F0aXZlLiBUaGVyZSdzIGFuIGFyZ3VtZW50IGZvciB0aGF0LiBJdCBhbHNvIHdvdWxkIHVuZGVycGxheSBhIHJldmlldyB0aGF0IHJlYWQgaW4gaXQncyBlbnRpcmV0eTogYGB0ZXJyaWJsZS4nJyBUaGF0IGV2ZW4gdGhvdWdoIHRoZSByZXZpZXcgaXMgMTAwJSBjbGVhciBpbiBpdHMgc2VudGltZW50LgoKV2hhdCBpcyBpdCBtb3N0IGNvbmZ1c2VkIGFib3V0PwoKYGBge3J9CnNvcnQubGlzdChhYnMocHJlZGljdGVkX3Byb2IgLSAuNSksIGRlYz1GKVsxXQpgYGAKCmBgYHtyfQpwcmVkaWN0ZWRfcHJvYlsyMTIsXQpgYGAKClNvIC4uLiB0aGUgbW9kZWwgc2F5cyA0NSUgY2hhbmNlIG5lZ2F0aXZlLCA1NSUgcG9zaXRpdmUuCgpgYGB7cn0KdGV4dHMoY29ycHVzKVtpZF90ZXN0XVsyMTJdCmBgYAoKQSBuZWdhdGl2ZSByZXZpZXcgb2YgIk1hZmlhISIgYSBzcG9vZiBtb3ZpZSBJJ2QgbmV2ZXIgaGVhcmQgb2YuIFNhdGlyZSwgcGFyb2R5LCBzYXJjYXNtLCBhbmQgc2ltaWxhciBhcmUgbm90b3Jpb3VzbHkgZGlmZmljdWx0IHRvIGNvcnJlY3RseSBjbGFzc2lmeSwgc28gcGVyaGFwcyB0aGF0J3Mgd2hhdCBoYXBwZW5lZCBoZXJlLgoKTGV0J3MgbG9vayBhdCBhIGNsZWFyIG1pc3Rha2UuIApgYGB7cn0Kc29ydC5saXN0KHByZWRpY3RlZF9wcm9iWzE6MjUwLDFdLGRlYz1GKVsxXQpgYGAKYGBge3J9CnByZWRpY3RlZF9wcm9iWzE5NixdCmBgYApTbyAuLi4gdGhlIG1vZGVsIHNheXMgKkRFRklOSVRFTFkqIHBvc2l0aXZlLgoKYGBge3J9CnRleHRzKGNvcnB1cylbaWRfdGVzdF1bMTk2XQpgYGAKCkFoYSEgQSBjbGVhcmx5IG5lZ2F0aXZlIHJldmlldyBvZiAiU2F2aW5nIFByaXZhdGUgUnlhbi4iCgpUaGlzIGlzIGF0IGxlYXN0IHBhcnRseSBhbiAib3ZlcmZpdHRpbmciIG1pc3Rha2UuIEl0IHByb2JhYmx5IGxlYXJuZWQgb3RoZXIgIlNhdmluZyBQcml2YXRlIFJ5YW4iIG9yICJTcGllbGJlcmcgbW92aWVzIiB3b3JkcyAtLSBpdCBsb29rcyBsaWtlICJTcGllbGJlcmcncyIgd2FzIG51bWJlciAjMyBvbiBvdXIgbGlzdCBhYm92ZSAtLSBhbmQgbGVhcm5lZCB0aGF0ICJyZXZpZXdzIHRoYXQgdGFsayBhYm91dCBTYXZpbmcgUHJpdmF0ZSBSeWFuIGFyZSBwcm9iYWJseSBwb3NpdGl2ZS4iCgpCZWxvdywgSSdsbCBnaXZlIGJyaWVmIGV4YW1wbGVzIG9mIHNvbWUgb3RoZXIgY2xhc3NpZmljYXRpb24gbW9kZWxzIGZvciB0aGlzIGRhdGEuCgojIyBMb2dpc3RpYyByZWdyZXNzaW9uLCByaWRnZSByZWdyZXNzaW9uLCBMQVNTTywgYW5kIGVsYXN0aWNuZXQKCldlJ2xsIGxvb2sgYXQgdGhyZWUgKHdlbGwgcmVhbGx5IG9ubHkgdHdvKSB2YXJpYW50cyBvZiB0aGUgcmVsYXRpdmVseSBzdHJhaWdodGZvcndhcmQgcmVndWxhcml6ZWQgbG9naXN0aWMgcmVncmVzc2lvbiBtb2RlbC4KCgpgYGB7cn0KbGlicmFyeShnbG1uZXQpCmxpYnJhcnkoZG9NQykKYGBgCgojIyMgUmlkZ2UgcmVncmVzc2lvbiAoTG9naXN0aWMgd2l0aCBMMi1yZWd1bGFyaXphdGlvbikKCmBgYHtyfQpyZWdpc3RlckRvTUMoY29yZXM9MikgIyBwYXJhbGxlbGl6ZSB0byBzcGVlZCB1cApzZW50bW9kLnJpZGdlIDwtIGN2LmdsbW5ldCh4PWRmbWF0X3RyYWluLAogICAgICAgICAgICAgICAgICAgeT1kb2N2YXJzKGRmbWF0X3RyYWluKSRTZW50aW1lbnQsCiAgICAgICAgICAgICAgICAgICBmYW1pbHk9ImJpbm9taWFsIiwgCiAgICAgICAgICAgICAgICAgICBhbHBoYT0wLCAgIyBhbHBoYSA9IDA6IHJpZGdlIHJlZ3Jlc3Npb24KICAgICAgICAgICAgICAgICAgIG5mb2xkcz01LCAjIDUtZm9sZCBjcm9zcy12YWxpZGF0aW9uCiAgICAgICAgICAgICAgICAgICBwYXJhbGxlbD1UUlVFLCAKICAgICAgICAgICAgICAgICAgIGludGVyY2VwdD1UUlVFLAogICAgICAgICAgICAgICAgICAgdHlwZS5tZWFzdXJlPSJjbGFzcyIpCnBsb3Qoc2VudG1vZC5yaWRnZSkKYGBgCgpUaGlzIHNob3dzIGNsYXNzaWZpY2F0aW9uIGVycm9yIGFzICRcbGFtYmRhJCAodGhlIHRvdGFsIHdlaWdodCBvZiB0aGUgcmVndWxhcml6YXRpb24gcGVuYWx0eSkgaXMgaW5jcmVhc2VkIGZyb20gMC4KVGhlIG1pbmltdW0gZXJyb3IgaXMgYXQgdGhlIGxlZnRtb3N0IGRvdHRlZCBsaW5lLCBhYm91dCAkXGxvZyhcbGFtYmRhKSBcYXBwcm94IDMkLiBUaGlzIHZhbHVlIGlzIHN0b3JlZCBpbiBgbGFtYmRhLm1pbmAuCgpgYGB7cn0KIyBhY3R1YWxfY2xhc3MgPC0gZG9jdmFycyhkZm1hdF9tYXRjaGVkLCAiU2VudGltZW50IikKcHJlZGljdGVkX3ZhbHVlLnJpZGdlIDwtIHByZWRpY3Qoc2VudG1vZC5yaWRnZSwgbmV3eD1kZm1hdF9tYXRjaGVkLHM9ImxhbWJkYS5taW4iKVssMV0KcHJlZGljdGVkX2NsYXNzLnJpZGdlIDwtIHJlcChOQSxsZW5ndGgocHJlZGljdGVkX3ZhbHVlLnJpZGdlKSkKcHJlZGljdGVkX2NsYXNzLnJpZGdlW3ByZWRpY3RlZF92YWx1ZS5yaWRnZT4wXSA8LSAicG9zIgpwcmVkaWN0ZWRfY2xhc3MucmlkZ2VbcHJlZGljdGVkX3ZhbHVlLnJpZGdlPDBdIDwtICJuZWciCnRhYl9jbGFzcy5yaWRnZSA8LSB0YWJsZShhY3R1YWxfY2xhc3MscHJlZGljdGVkX2NsYXNzLnJpZGdlKQp0YWJfY2xhc3MucmlkZ2UKYGBgCgpBY2N1cmFjeSBvZiAuODE4LCBleGFjdGx5IGFzIHdpdGggTmFpdmUgQmF5ZXMuIFRoZSBtaXNzZXMgYXJlIGEgbGl0dGxlIG1vcmUgZXZlbiwgd2l0aCBpdCBiZWluZyBzbGlnaHRseSBtb3JlIHN1Y2Nlc3NmdWwgaW4gaWRlbnRpZnlpbmcgcG9zaXRpdmUgcmV2aWV3cyBhbmQgc2xpZ2h0bHkgbGVzcyBzdWNjZXNzZnVsIGluIGlkZW50aWZ5aW5nIG5lZ2F0aXZlIHJldmlld3MuCgpgYGB7cn0KY29uZnVzaW9uTWF0cml4KHRhYl9jbGFzcy5yaWRnZSwgbW9kZT0iZXZlcnl0aGluZyIpCmBgYAoKQXQgZmlyc3QgYmx1c2gsIHRoZSBjb2VmZmljaWVudHMgc2hvdWxkIHRlbGwgdXMgd2hhdCB0IGhlIG1vZGVsIGxlYXJuZWQ6CgpgYGB7ciwgZmlnLndpZHRoPTcsIGZpZy5oZWlnaHQ9Nn0KcGxvdChjb2xTdW1zKGRmbWF0X3RyYWluKSxjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdLCBwY2g9MTksIGNvbD1yZ2IoMCwwLDAsLjMpLCBjZXg9LjUsIGxvZz0ieCIsIG1haW49IlJpZGdlIFJlZ3Jlc3Npb24gQ29lZmZpY2llbnRzLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW0gPSBjKDEsNTAwMDApKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLGNvZWYoc2VudG1vZC5yaWRnZSlbLTEsMV0sIGNvbG5hbWVzKGRmbWF0X3RyYWluKSxwb3M9NCxjZXg9MjAwKmFicyhjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdKSwgY29sPXJnYigwLDAsMCw3NSphYnMoY29lZihzZW50bW9kLnJpZGdlKVstMSwxXSkpKQpgYGAKClRoYXQncyBib3RoIGNvbmZ1c2luZyBhbmQgbWlzbGVhZGluZyBzaW5jZSB0aGUgdmFyaWFuY2Ugb2YgY29lZmZpY2llbnRzIGlzIGxhcmdlc3Qgd2l0aCB0aGUgbW9zdCBvYnNjdXJlIHRlcm1zLiAoQW5kLCBmb3IgcGxvdHRpbmcsIHRoZSA0MCwwMDArIGZlYXR1cmVzIGluY2x1ZGUgc29tZSB2ZXJ5IGxvbmcgb25lLW9mZiAidG9rZW5zIiB0aGF0IG92ZXJsYXAgd2l0aCBtb3JlIGNvbW1vbiBvbmVzLCBlLmcuLCAiYm95LWRyaW5rcy1lbnRpcmUtYm90dGxlLW9mLXNoYW1wb28tYW5kLW1heS1vci1tYXktbm90LWdldC1naXJsLWJhY2ssIiAicHJvcHMtc3RyYXRlZ2ljYWxseS1wb3NpdGlvbmVkLWJldHdlZW4tbmFrZWQtYWN0b3JzLWFuZC1jYW1lcmEsIiBhbmQgIl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fIikKCldpdGggdGhpcyBtb2RlbCwgaXQgd291bGQgYmUgbW9yZSBpbmZvcm1hdGl2ZSB0byBsb29rIGF0IHdoaWNoIGNvZWZmaWNpZW50cyBoYXZlIHRoZSBtb3N0IGltcGFjdCB3aGVuIG1ha2luZyBhIHByZWRpY3Rpb24sIGJ5IGhhdmluZyBsYXJnZXIgY29lZmZpY2llbnRzICphbmQqIG9jY3VycmluZyBtb3JlLCBvciBhbHRlcm5hdGl2ZWx5IHRvIGxvb2sgYXQgd2hpY2ggY29lZmZpY2llbnRzIHdlIGFyZSBtb3N0IGNlcnRhaW4gb2YsIGRvd253ZWlnaHRpbmcgYnkgdGhlIGluaGVyZW50IGVycm9yLiBUaGUgaW1wYWN0IHdpbGwgYmUgcHJvcG9ydGlvbmFsIHRvICRsb2cobl93KSQgYW5kIHRoZSBlcnJvciB3aWxsIGJlIHJvdWdobHkgcHJvcG9ydGlvbmFsIHRvICQxL3NxcnQobl93KSQuCgpTbywgaW1wYWN0OgoKYGBge3IsIGZpZy53aWR0aD03LCBmaWcuaGVpZ2h0PTZ9CnBsb3QoY29sU3VtcyhkZm1hdF90cmFpbiksbG9nKGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdLCBwY2g9MTksIGNvbD1yZ2IoMCwwLDAsLjMpLCBjZXg9LjUsIGxvZz0ieCIsIG1haW49IlJpZGdlIFJlZ3Jlc3Npb24gQ29lZmZpY2llbnRzIChJbXBhY3QgV2VpZ2h0ZWQpLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW0gPSBjKDEsNTAwMDApKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLnJpZGdlKVstMSwxXSwgY29sbmFtZXMoZGZtYXRfdHJhaW4pLHBvcz00LGNleD01MCphYnMobG9nKGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdKSwgY29sPXJnYigwLDAsMCwyNSphYnMobG9nKGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdKSkpCmBgYAoKTW9zdCBwb3NpdGl2ZSBhbmQgbmVnYXRpdmUgZmVhdHVyZXMgYnkgaW1wYWN0OgoKYGBge3J9CnNvcnQobG9nKGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdLGRlYz1UKVsxOjIwXQpgYGAKCmBgYHtyfQpzb3J0KGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLnJpZGdlKVstMSwxXSxkZWM9RilbMToyMF0KYGBgCgpSZWd1bGFyaXphdGlvbiBhbmQgY3Jvc3MtdmFsaWRhdGlvbiBib3VnaHQgdXMgYSBsb3QgbW9yZSBnZW5lcmFsIC0tIGxlc3Mgb3ZlcmZpdCAtLSBtb2RlbCB0aGFuIHdlIHNhdyB3aXRoIE5haXZlIEJheWVzLgoKQWx0ZXJuYXRpdmVseSwgYnkgY2VydGFpbnR5OgoKYGBge3IsIGZpZy53aWR0aD03LCBmaWcuaGVpZ2h0PTZ9CnBsb3QoY29sU3VtcyhkZm1hdF90cmFpbiksc3FydChjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLnJpZGdlKVstMSwxXSwgcGNoPTE5LCBjb2w9cmdiKDAsMCwwLC4zKSwgY2V4PS41LCBsb2c9IngiLCBtYWluPSJSaWRnZSBSZWdyZXNzaW9uIENvZWZmaWNpZW50cyAoRXJyb3IgV2VpZ2h0ZWQpLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW0gPSBjKDEsNTAwMDApKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLHNxcnQoY29sU3VtcyhkZm1hdF90cmFpbikpKmNvZWYoc2VudG1vZC5yaWRnZSlbLTEsMV0sIGNvbG5hbWVzKGRmbWF0X3RyYWluKSxwb3M9NCxjZXg9MzAqYWJzKHNxcnQoY29sU3VtcyhkZm1hdF90cmFpbikpKmNvZWYoc2VudG1vZC5yaWRnZSlbLTEsMV0pLCBjb2w9cmdiKDAsMCwwLDEwKmFicyhzcXJ0KGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdKSkpCmBgYAoKTW9zdCBwb3NpdGl2ZSBhbmQgbmVnYXRpdmUgdGVybXM6CgpgYGB7cn0Kc29ydChzcXJ0KGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QucmlkZ2UpWy0xLDFdLGRlYz1UKVsxOjIwXQpgYGAKCmBgYHtyfQpzb3J0KHNxcnQoY29sU3VtcyhkZm1hdF90cmFpbikpKmNvZWYoc2VudG1vZC5yaWRnZSlbLTEsMV0sZGVjPUYpWzE6MjBdCmBgYAoKVGhpcyB2aWV3IGltcGxpZXMgc29tZSB3b3VsZC1iZSAic3RvcCB3b3JkcyIgYXJlIGltcG9ydGFudCwgYW5kIHRoZXNlIHNlZW0gdG8gbWFrZSBzZW5zZSBvbiBpbnNwZWN0aW9uLiBGb3IgZXhhbXBsZSwgImFzIiBpcyBpbmRpY2F0aXZlIG9mIHBocmFzZXMgaW4gcG9zaXRpdmUgcmV2aWV3cyBjb21wYXJpbmcgbW92aWVzIHRvIHdlbGwta25vd24gYW5kIHdlbGwtbGlrZWQgbW92aWVzLCBlLmcuLCAiYXMgZ29vZCBhcy4iIFRoZXJlJ3Mgbm90IGEgcGFyYWxsZWwgImFzIGJhZCBhcyIgdGhhdCBpcyBhcyBjb21tb24gaW4gbmVnYXRpdmUgcmV2aWV3cy4KCiMjIyBMQVNTTyAoTG9naXN0aWMgd2l0aCBMMS1yZWd1bGFyaXphdGlvbikKClJpZGdlIHJlZ3Jlc3Npb24gZ2l2ZXMgeW91IGEgY29lZmZpY2llbnQgZm9yIGV2ZXJ5IGZlYXR1cmUuIEF0IHRoZSBvdGhlciBleHRyZW1lLCB3ZSBjYW4gdXNlIHRoZSBMQVNTTyB0byBnZXQgc29tZSBmZWF0dXJlIHNlbGVjdGlvbi4KCmBgYHtyfQpyZWdpc3RlckRvTUMoY29yZXM9MikgIyBwYXJhbGxlbGl6ZSB0byBzcGVlZCB1cApzZW50bW9kLmxhc3NvIDwtIGN2LmdsbW5ldCh4PWRmbWF0X3RyYWluLAogICAgICAgICAgICAgICAgICAgeT1kb2N2YXJzKGRmbWF0X3RyYWluKSRTZW50aW1lbnQsCiAgICAgICAgICAgICAgICAgICBmYW1pbHk9ImJpbm9taWFsIiwgCiAgICAgICAgICAgICAgICAgICBhbHBoYT0xLCAgIyBhbHBoYSA9IDE6IExBU1NPCiAgICAgICAgICAgICAgICAgICBuZm9sZHM9NSwgIyA1LWZvbGQgY3Jvc3MtdmFsaWRhdGlvbgogICAgICAgICAgICAgICAgICAgcGFyYWxsZWw9VFJVRSwgCiAgICAgICAgICAgICAgICAgICBpbnRlcmNlcHQ9VFJVRSwKICAgICAgICAgICAgICAgICAgIHR5cGUubWVhc3VyZT0iY2xhc3MiKQpwbG90KHNlbnRtb2QubGFzc28pCmBgYAoKCmBgYHtyfQojIGFjdHVhbF9jbGFzcyA8LSBkb2N2YXJzKGRmbWF0X21hdGNoZWQsICJTZW50aW1lbnQiKQpwcmVkaWN0ZWRfdmFsdWUubGFzc28gPC0gcHJlZGljdChzZW50bW9kLmxhc3NvLCBuZXd4PWRmbWF0X21hdGNoZWQscz0ibGFtYmRhLm1pbiIpWywxXQpwcmVkaWN0ZWRfY2xhc3MubGFzc28gPC0gcmVwKE5BLGxlbmd0aChwcmVkaWN0ZWRfdmFsdWUubGFzc28pKQpwcmVkaWN0ZWRfY2xhc3MubGFzc29bcHJlZGljdGVkX3ZhbHVlLmxhc3NvPjBdIDwtICJwb3MiCnByZWRpY3RlZF9jbGFzcy5sYXNzb1twcmVkaWN0ZWRfdmFsdWUubGFzc288MF0gPC0gIm5lZyIKdGFiX2NsYXNzLmxhc3NvIDwtIHRhYmxlKGFjdHVhbF9jbGFzcyxwcmVkaWN0ZWRfY2xhc3MubGFzc28pCnRhYl9jbGFzcy5sYXNzbwpgYGAKClRoaXMgZ2V0cyBvbmUgbW9yZSByaWdodCB0aGFuIHRoZSBvdGhlcnMgZm9yIGFuIGFjY3VyYWN5IG9mIC44Mi4gVGhlIHBhdHRlcm4gb2YgbWlzc2VzIGdvZXMgZnVydGhlciBpbiB0aGUgb3RoZXIgZGlyZWN0aW9uIGZyb20gTmFpdmUgQmF5ZXMsIG92ZXJwcmVkaWN0aW5nIHBvc2l0aXZlIHJldmlld3MuCgpgYGB7cn0KY29uZnVzaW9uTWF0cml4KHRhYl9jbGFzcy5sYXNzbywgbW9kZT0iZXZlcnl0aGluZyIpCmBgYAoKYGBge3IsIGZpZy53aWR0aD03LCBmaWcuaGVpZ2h0PTZ9CnBsb3QoY29sU3VtcyhkZm1hdF90cmFpbiksY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSwgcGNoPTE5LCBjb2w9cmdiKDAsMCwwLC4zKSwgY2V4PS41LCBsb2c9IngiLCBtYWluPSJMQVNTTyBDb2VmZmljaWVudHMsIElNREIiLCB5bGFiPSI8LS0tIE5lZ2F0aXZlIFJldmlld3MgLS0tIFBvc2l0aXZlIFJldmlld3MgLS0tPiIsIHhsYWI9IlRvdGFsIEFwcGVhcmFuY2VzIiwgeGxpbSA9IGMoMSw1MDAwMCkpCnRleHQoY29sU3VtcyhkZm1hdF90cmFpbiksY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSwgY29sbmFtZXMoZGZtYXRfdHJhaW4pLHBvcz00LGNleD0yKmFicyhjb2VmKHNlbnRtb2QubGFzc28pWy0xLDFdKSwgY29sPXJnYigwLDAsMCwxKmFicyhjb2VmKHNlbnRtb2QubGFzc28pWy0xLDFdKSkpCmBgYAoKQXMgd2Ugd2FudCB3aGVuIHdlIHJ1biB0aGUgTEFTU08sIHRoZSB2YXN0IG1ham9yaXR5IG9mIG91ciBjb2VmZmljaWVudHMgYXJlIHplcm8gLi4uIG1vc3QgZmVhdHVyZXMgaGF2ZSBubyBpbmZsdWVuY2Ugb24gdGhlIHByZWRpY3Rpb25zLgoKSXQncyBsZXNzIG5lY2Vzc2FyeSBidXQgbGV0J3MgbG9vayBhdCBpbXBhY3Q6CgpgYGB7ciwgZmlnLndpZHRoPTcsIGZpZy5oZWlnaHQ9Nn0KcGxvdChjb2xTdW1zKGRmbWF0X3RyYWluKSxsb2coY29sU3VtcyhkZm1hdF90cmFpbikpKmNvZWYoc2VudG1vZC5sYXNzbylbLTEsMV0sIHBjaD0xOSwgY29sPXJnYigwLDAsMCwuMyksIGNleD0uNSwgbG9nPSJ4IiwgbWFpbj0iTEFTU08gQ29lZmZpY2llbnRzIChJbXBhY3QgV2VpZ2h0ZWQpLCBJTURCIiwgeWxhYj0iPC0tLSBOZWdhdGl2ZSBSZXZpZXdzIC0tLSBQb3NpdGl2ZSBSZXZpZXdzIC0tLT4iLCB4bGFiPSJUb3RhbCBBcHBlYXJhbmNlcyIsIHhsaW0gPSBjKDEsNTAwMDApKQp0ZXh0KGNvbFN1bXMoZGZtYXRfdHJhaW4pLGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSwgY29sbmFtZXMoZGZtYXRfdHJhaW4pLHBvcz00LGNleD0uOCphYnMobG9nKGNvbFN1bXMoZGZtYXRfdHJhaW4pKSpjb2VmKHNlbnRtb2QubGFzc28pWy0xLDFdKSwgY29sPXJnYigwLDAsMCwuMjUqYWJzKGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSkpKQpgYGAKCk1vc3QgcG9zaXRpdmUgYW5kIG5lZ2F0aXZlIGZlYXR1cmVzIGJ5IGltcGFjdDoKCmBgYHtyfQpzb3J0KGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSxkZWM9VClbMToyMF0KYGBgCgpJbnRlcmVzdGluZ2x5LCB0aGVyZSBhcmUgYSBmZXcgdGhlcmUgdGhhdCB3b3VsZCBiZSBuZWdhdGl2ZSBpbmRpY2F0b3JzIGluIG1vc3Qgc2VudGltZW50IGRpY3Rpb25hcmllcywgbGlrZSAiZmxhd3MiIGFuZCAid2FyIi4KCmBgYHtyfQpzb3J0KGxvZyhjb2xTdW1zKGRmbWF0X3RyYWluKSkqY29lZihzZW50bW9kLmxhc3NvKVstMSwxXSxkZWM9RilbMToyMF0KYGBgCgpCb3RoIGxpc3RzIGFsc28gaGF2ZSB3b3JkcyB0aGF0IGluZGljYXRlIGEgdHJhbnNpdGlvbiBmcm9tIGEgcGFydGljdWxhciBuZWdhdGl2ZSBvciBwb3NpdGl2ZSBhc3BlY3QsIGZvbGxvd2VkIGJ5IGEgaG9saXN0aWMgc2VudGltZW50IGluIHRoZSBvcHBvc2l0ZSBkaXJlY3Rpb24uICJUaGUgcGFjZSBkcmFnZ2VkIGF0IHRpbWVzLCBidXQgKm92ZXJhbGwqIGl0IGlzIGFuIGFzdG9uaXNoaW5nIGFjdCBvZiBmaWxtbWFraW5nLiIgIlRoZSBwZXJmb3JtYW5jZXMgYXJlIHRyZW1lbmRvdXMsIGJ1dCAqdW5mb3J0dW5hdGVseSogdGhlIHBvb3Igd3JpdGluZyBtYWtlcyB0aGlzIG1vdmllIGZhbGwgZmxhdC4iIAoKIyMjIEVsYXN0aWMgbmV0CgpUaGUgZWxhc3RpYyBuZXQgZXN0aW1hdGVzIG5vdCBqdXN0ICRcbGFtYmRhJCAodGhlIG92ZXJhbGwgYW1vdW50IG9mIHJlZ3VsYXJpemF0aW9uKSBidXQgYWxzbyAkXGFscGhhJCAodGhlIHJlbGF0aXZlIHdlaWdodCBvZiB0aGUgTDEgbG9zcyByZWxhdGl2ZSB0byB0aGUgTDIgbG9zcykuIEluIFIsIHRoaXMgY2FuIGFsc28gYmUgZG9uZSB3aXRoIHRoZSBgZ2xtbmV0YCBwYWNrYWdlLiAiSSBsZWF2ZSB0aGF0IGFzIGFuIGV4ZXJjaXNlLiIKCiMjIEEgZmlyc3QgZW5zZW1ibGUKCldlJ3ZlIGdvdCB0aHJlZSBzZXRzIG9mIHByZWRpY3Rpb25zIG5vdywgc28gd2h5IGRvbid0IHdlIHRyeSBhIHNpbXBsZSBlbnNlbWJsZSBpbiB3aGljaCBvdXIgcHJlZGljdGlvbiBmb3IgZWFjaCByZXZpZXcgaXMgYmFzZWQgb24gYSBtYWpvcml0eSB2b3RlIG9mIHRoZSB0aHJlZS4gU29ydCBvZiBsaWtlIGEgUm90dGVuIFRvbWF0b2VzIHJhdGluZy4gVGhleSBlYWNoIGxlYXJuZWQgc2xpZ2h0bHkgZGlmZmVyZW50IHRoaW5ncywgc28gcGVyaGFwcyB0aGUgd2hvbGUgaXMgYmV0dGVyIHRoYW4gaXRzIHBhcnRzLgoKYGBge3J9CnByZWRpY3RlZF9jbGFzcy5lbnNlbWJsZTMgPC0gcmVwKCJuZWciLGxlbmd0aChhY3R1YWxfY2xhc3MpKQpudW1fcHJlZGljdGVkX3BvczMgPC0gMSoocHJlZGljdGVkX2NsYXNzPT0icG9zIikgKyAxKihwcmVkaWN0ZWRfY2xhc3MucmlkZ2U9PSJwb3MiKSArIDEqKHByZWRpY3RlZF9jbGFzcy5sYXNzbz09InBvcyIpCnByZWRpY3RlZF9jbGFzcy5lbnNlbWJsZTNbbnVtX3ByZWRpY3RlZF9wb3MzPjFdIDwtICJwb3MiCnRhYl9jbGFzcy5lbnNlbWJsZTMgPC0gdGFibGUoYWN0dWFsX2NsYXNzLHByZWRpY3RlZF9jbGFzcy5lbnNlbWJsZTMpCnRhYl9jbGFzcy5lbnNlbWJsZTMKYGBgCgpIZXksIHRoYXQgaXMgYmV0dGVyISBBY2N1cmFjeSA4My44JSEKCmBgYHtyfQpjb25mdXNpb25NYXRyaXgodGFiX2NsYXNzLmVuc2VtYmxlMywgbW9kZT0iZXZlcnl0aGluZyIpCmBgYAoKCiMjIFN1cHBvcnQgdmVjdG9yIG1hY2hpbmUKCldpdGhvdXQgZXhwbGFpbmluZyBTVk0gYXQgYWxsLCBsZXQncyB0cnkgYSBzaW1wbGUgb25lLgoKYGBge3J9CmxpYnJhcnkoZTEwNzEpCmBgYAoKYGBge3J9CnNlbnRtb2Quc3ZtIDwtIHN2bSh4PWRmbWF0X3RyYWluLAogICAgICAgICAgICAgICAgICAgeT1hcy5mYWN0b3IoZG9jdmFycyhkZm1hdF90cmFpbikkU2VudGltZW50KSwKICAgICAgICAgICAgICAgICAgIGtlcm5lbD0ibGluZWFyIiwgCiAgICAgICAgICAgICAgICAgICBjb3N0PTEwLCAgIyBhcmJpdHJhcnkgcmVndWxhcml6YXRpb24gY29zdAogICAgICAgICAgICAgICAgICAgcHJvYmFiaWxpdHk9VFJVRSkKYGBgCgpJZGVhbGx5LCB3ZSB3b3VsZCB0dW5lIHRoZSBjb3N0IHBhcmFtZXRlciB2aWEgY3Jvc3MtdmFsaWRhdGlvbiBvciBzaW1pbGFyLCBhcyB3ZSBkaWQgd2l0aCAkXGxhbWJkYSQgYWJvdmUuCgpgYGB7cn0KIyBhY3R1YWxfY2xhc3MgPC0gZG9jdmFycyhkZm1hdF9tYXRjaGVkLCAiU2VudGltZW50IikKcHJlZGljdGVkX2NsYXNzLnN2bSA8LSBwcmVkaWN0KHNlbnRtb2Quc3ZtLCBuZXdkYXRhPWRmbWF0X21hdGNoZWQpCnRhYl9jbGFzcy5zdm0gPC0gdGFibGUoYWN0dWFsX2NsYXNzLHByZWRpY3RlZF9jbGFzcy5zdm0pCnRhYl9jbGFzcy5zdm0KYGBgCgpUaGF0J3MgYWN0dWFsbHkgYSBiaXQgYmV0dGVyIHRoYW4gdGhlIG90aGVycywgaW5kaXZpZHVhbGx5IGlmIG5vdCBjb21iaW5lZCwgd2l0aCBhY2N1cmFjeSBvZiAuODM0LCBhbmQgYSBiaWFzIHRvd2FyZCBvdmVycHJlZGljdGluZyBwb3NpdGl2ZXMuCgpgYGB7cn0KY29uZnVzaW9uTWF0cml4KHRhYl9jbGFzcy5zdm0sIG1vZGU9ImV2ZXJ5dGhpbmciKQpgYGAKCgpGb3IgYSBsaW5lYXIga2VybmVsLCB3ZSBjYW4gYmFjayBvdXQgaW50ZXJwcmV0YWJsZSBjb2VmZmljaWVudHMuIFRoaXMgaXMgbm90IHRydWUgd2l0aCBub25saW5lYXIga2VybmVscyBzdWNoIGFzIHRoZSAicmFkaWFsIGJhc2lzIGZ1bmN0aW9uLiIKCmBgYHtyfQpiZXRhLnN2bSA8LSBkcm9wKHQoc2VudG1vZC5zdm0kY29lZnMpJSolZGZtYXRfdHJhaW5bc2VudG1vZC5zdm0kaW5kZXgsXSkKYGBgCgooTm90ZSB0aGUgc2lnbnMgYXJlIHJldmVyc2VkIGZyb20gb3VyIGV4cGVjdGVkIHBvcy1uZWcuKQoKYGBge3IsZmlnLndpZHRoPTcsZmlnLmhlaWdodD02fQpwbG90KGNvbFN1bXMoZGZtYXRfdHJhaW4pLC1iZXRhLnN2bSwgcGNoPTE5LCBjb2w9cmdiKDAsMCwwLC4zKSwgY2V4PS41LCBsb2c9IngiLCBtYWluPSJTdXBwb3J0IFZlY3RvciBNYWNoaW5lIENvZWZmaWNpZW50cyAoTGluZWFyIEtlcm5lbCksIElNREIiLCB5bGFiPSI8LS0tIE5lZ2F0aXZlIFJldmlld3MgLS0tIFBvc2l0aXZlIFJldmlld3MgLS0tPiIsIHhsYWI9IlRvdGFsIEFwcGVhcmFuY2VzIiwgeGxpbSA9IGMoMSw1MDAwMCkpCnRleHQoY29sU3VtcyhkZm1hdF90cmFpbiksLWJldGEuc3ZtLCBjb2xuYW1lcyhkZm1hdF90cmFpbikscG9zPTQsY2V4PTEwKmFicyhiZXRhLnN2bSksIGNvbD1yZ2IoMCwwLDAsNSphYnMoYmV0YS5zdm0pKSkKYGBgCgpgYGB7cn0Kc29ydCgtYmV0YS5zdm0sZGVjPVQpWzE6MjBdCmBgYAoKYGBge3J9CnNvcnQoLWJldGEuc3ZtLGRlYz1GKVsxOjIwXQpgYGAKCkxvb2tzIGEgYml0IG92ZXJmaXQgdG8gbWUgYW5kIEkgd291bGQgcHJvYmFibHkgaW5jcmVhc2UgdGhlIHJlZ3VsYXJpemF0aW9uIGNvc3QgaW4gZnVydGhlciBpdGVyYXRpb25zLgoKIyMgUmFuZG9tIEZvcmVzdHMKCmBgYHtyfQpsaWJyYXJ5KHJhbmRvbUZvcmVzdCkKYGBgCgpSYW5kb20gZm9yZXN0cyBpcyBhIHZlcnkgY29tcHV0YXRpb25hbGx5IGludGVuc2l2ZSBhbGdvcml0aG0sIHNvIEkgd2lsbCBjdXQgdGhlIG51bWJlciBvZiBmZWF0dXJlcyB3YXkgd2F5IGRvd24ganVzdCBzbyB0aGlzIGNhbiBydW4gaW4gYSByZWFzb25hYmxlIGFtb3VudCBvZiB0aW1lLgoKYGBge3J9CmRmbWF0LnJmIDwtIGNvcnB1cyAlPiUKICBkZm0oKSAlPiUKICBkZm1fdHJpbShtaW5fZG9jZnJlcT01MCxtYXhfZG9jZnJlcT0zMDAsdmVyYm9zZT1UUlVFKQpgYGAKCmBgYHtyfQpkZm1hdHJpeC5yZiA8LSBhcy5tYXRyaXgoZGZtYXQucmYpCmBgYAoKYGBge3J9CnNldC5zZWVkKDEyMzQpCnNlbnRtb2QucmYgPC0gcmFuZG9tRm9yZXN0KGRmbWF0cml4LnJmW2lkX3RyYWluLF0sIAogICAgICAgICAgICAgICAgICAgeT1hcy5mYWN0b3IoZG9jdmFycyhkZm1hdC5yZikkU2VudGltZW50KVtpZF90cmFpbl0sCiAgICAgICAgICAgICAgICAgICB4dGVzdD1kZm1hdHJpeC5yZltpZF90ZXN0LF0sCiAgICAgICAgICAgICAgICAgICB5dGVzdD1hcy5mYWN0b3IoZG9jdmFycyhkZm1hdC5yZikkU2VudGltZW50KVtpZF90ZXN0XSwKICAgICAgICAgICAgICAgICAgIGltcG9ydGFuY2U9VFJVRSwKICAgICAgICAgICAgICAgICAgIG10cnk9MjAsCiAgICAgICAgICAgICAgICAgICBudHJlZT0xMDAKICAgICAgICAgICAgICAgICAgICkKI3NlbnRtb2QucmYKYGBgCgpgYGB7cn0KcHJlZGljdGVkX2NsYXNzLnJmIDwtIHNlbnRtb2QucmYkdGVzdFtbJ3ByZWRpY3RlZCddXQp0YWJfY2xhc3MucmYgPC0gdGFibGUoYWN0dWFsX2NsYXNzLHByZWRpY3RlZF9jbGFzcy5yZikKY29uZnVzaW9uTWF0cml4KHRhYl9jbGFzcy5yZiwgbW9kZT0iZXZlcnl0aGluZyIpCmBgYAoKVGhhdCBkaWQgYSBiaXQgd29yc2UgLS0gQWNjdXJhY3kgLjc3NiAtLSBidXQgd2UgZGlkIGdpdmUgaXQgY29uc2lkZXJhYmx5IGxlc3MgaW5mb3JtYXRpb24uCgpHZXR0aW5nIG1hcmdpbmFsIGVmZmVjdHMgZnJvbSBhIHJhbmRvbSBmb3Jlc3QgbW9kZWwgcmVxdWlyZXMgbW9yZSBmaW5lc3NlIHRoYW4gSSdtIHdpbGxpbmcgdG8gYXBwbHkgaGVyZS4gV2UgY2FuIGdldCB0aGUgImltcG9ydGFuY2UiIG9mIHRoZSBkaWZmZXJlbnQgZmVhdHVyZXMsIGJ1dCB0aGlzIGFsb25lIGRvZXMgbm90IHRlbGwgdXMgaW4gd2hhdCBkaXJlY3Rpb24gdGhlIGZlYXR1cmUgcHVzaGVzIHRoZSBwcmVkaWN0aW9ucy4KCmBgYHtyLGZpZy5oZWlnaHQ9NyxmaWcud2lkdGg9Nn0KdmFySW1wUGxvdChzZW50bW9kLnJmKQpgYGAKClNvbWUgdXN1YWwgc3VzcGVjdHMgdGhlcmUsIGJ1dCB3ZSBuZWVkIG91ciBicmFpbnMgdG8gZmlsbCBpbiB3aGljaCBvbmVzIGFyZSBwb3NpdGl2ZSBhbmQgbmVnYXRpdmUuIFNvbWUgYXJlIGFtYmlndW91cyAoInRvd24iKSBhbmQgc29tZSB3ZSBoYXZlIHNlZW4gYXJlIHN1YnRsZSAoIm92ZXJhbGwiKSBvciBsaWtlbHkgdG8gYmUgbWlzbGVhZGluZyAoIndhciIpLgoKIyMgQW5vdGhlciBlbnNlbWJsZQoKTm93IHdlJ3ZlIGdvdCBmaXZlLCBzbyBsZXQncyBlbnNlbWJsZSB0aG9zZS4KCmBgYHtyfQogIHByZWRpY3RlZF9jbGFzcy5lbnNlbWJsZTUgPC0gcmVwKCJuZWciLGxlbmd0aChhY3R1YWxfY2xhc3MpKQpudW1fcHJlZGljdGVkX3BvczUgPC0gMSoocHJlZGljdGVkX2NsYXNzPT0icG9zIikgKyAxKihwcmVkaWN0ZWRfY2xhc3MucmlkZ2U9PSJwb3MiKSArIDEqKHByZWRpY3RlZF9jbGFzcy5sYXNzbz09InBvcyIpICsgCiAgMSoocHJlZGljdGVkX2NsYXNzLnN2bT09InBvcyIpICsgCiAgMSoocHJlZGljdGVkX2NsYXNzLnJmPT0icG9zIikKcHJlZGljdGVkX2NsYXNzLmVuc2VtYmxlNVtudW1fcHJlZGljdGVkX3BvczU+Ml0gPC0gInBvcyIKdGFiX2NsYXNzLmVuc2VtYmxlNSA8LSB0YWJsZShhY3R1YWxfY2xhc3MscHJlZGljdGVkX2NsYXNzLmVuc2VtYmxlNSkKdGFiX2NsYXNzLmVuc2VtYmxlNQpjb25mdXNpb25NYXRyaXgodGFiX2NsYXNzLmVuc2VtYmxlNSxtb2RlPSJldmVyeXRoaW5nIikKYGBgCgpBbmQgbGlrZSBtYWdpYywgbm93IHdlJ3JlIHVwIHRvIDg1LjglIEFjY3VyYWN5IGluIHRoZSB0ZXN0IHNldC4KCg==