THIS IS INCOMPLETE AND HAS A FEW VERY LONG RUN TIME COMMANDS. PROCEED WITH CAUTION.
This tutorial uses the example from the stm vignette as a starting point.
# install the following packages as necessary
library(lda)
library(slam)
library(stm)
stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
For this example, we will use a variant of the “PoliBlogs08” data set (Eisenstein and Xing 2010), an example used in the stm vignette. The original data consists of 13246 blogposts scraped from 6 blogs, along with a “rating” as “Conservative” or “Liberal.”
poliblogsFull <- read.csv("poliblogs2008.csv", colClasses = c("character", "character", "character", "factor", "integer", "factor"))
summary(poliblogsFull)
X documents docname
Length:13246 Length:13246 Length:13246
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
rating day blog
Conservative:7582 Min. : 1.0 at :3197
Liberal :5664 1st Qu.:110.0 db :1879
Median :201.0 ha :3708
Mean :193.5 mm : 677
3rd Qu.:277.0 tp :2080
Max. :366.0 tpm:1705
For exposition purposes later, it will help to note some basics here. The day
variable indicates the day of 2008 the blog was written, running from 1 to 366 since 2008 was a leap year. The blog
variable indicates which one of the following six blogs posted the text for that observation:
Variants of this example are available in different places, for STM as well as LDA, SAGE, and a variety of topic models. It’s easy to get confused. There are a variety of prefit STM models for this data that you can download from http://www.princeton.edu/~bms4/VignetteObjects.RData. These are stored in poliblogPrevFit
, poliblogContent
, and poliblogInteraction
and a prerun STM model selection object stored in poliblogSelect
. They are run on the object out
that is created on page 9 of the stm vignette, and includes 13426 documents and 9244 terms (the terms that appear in more than 15 documents). We will avoid reusing these variable names (running commands directly from the provided vignette will do so).
Also provided in the package are the stm input objects for a sample of 5000 blogs, which we will use here for illustration purposes. These objects are stored in poliblog5k.docs
(word counts in stm format), poliblog5k.voc
(the vocabulary), and poliblog5k.meta
(the metadata).
The meta
object also contains a small snippet - up to 50 characters - of the text in a (confusing) list column poliblog5k.meta$text
. We will create our own slightly larger snippets - 200 characters - for use in diagnostics later.
Let’s take a look at the first three documents in all three of these formats:
poliblog5k.sample <- as.integer(rownames(poliblog5k.meta))
poliblog5k.fulltext <- poliblogsFull$documents[poliblog5k.sample]
poliblog5k.shorttext <- substr(poliblog5k.fulltext,1,200)
poliblog5k.meta$text[1:3]
[[1]]
[1] "How happy do you think Team Barry is that, thanks"
[[2]]
[1] "Tough stuff, but conservative passions this year"
[[3]]
[1] "Epic Ideological Failby digbyBefore you listen to"
poliblog5k.shorttext[1:3]
[1] "How happy do you think Team Barry is that, thanks to Clark, the big soundbite from Obama’s latest peroration is going to be his tribute to McCain’s patriotism instead of his own? Here’s the clip, in "
[2] "Tough stuff, but conservative passions this year have always been more anti-Obama than pro-McCain (at least until Palin joined the ticket) so I’m curious what you guys think. Nate Silver, a lefty but"
[3] "Epic Ideological Failby digbyBefore you listen to one more wingnut try to tell you that the financial crisis and global recession was caused by irresponsible black people, illegal immigrants and Fanni"
poliblog5k.fulltext[1:3]
[1] "How happy do you think Team Barry is that, thanks to Clark, the big soundbite from Obama’s latest peroration is going to be his tribute to McCain’s patriotism instead of his own? Here’s the clip, in which Clark isn’t named but is clearly enough the target. Today’s disquisition follows in the grand Obama tradition of dressing up what is, essentially, a “Checkers” speech designed to get him out of a jam politically as some meditation on the American character for the leg-tingling pleasure of media fanboys across the spectrum. I wonder if any part of this one will be repudiated at a later date when it becomes inconvenient.Hotline has the full transcript; the part on service is down near the bottom, after the salute to dissidents and his slap at MoveOn.org for the “General Betray Us” smear, which, you’ll recall, His Holiness couldn’t be bothered to condemn in a senate resolution given that he was in the thick of a primary fight at the time. The takeaway: “I will never question the patriotism of others in this campaign.” Except, of course, he already has. He acknowledges at one point in the speech that part of the reason people question his patriotism is due to his own “carelessness,” an oblique reference to his choice to stop wearing the stupid flag pin. What he doesn’t acknowledge is the reason he stopped wearing it — namely, to draw a distinction between “true” and “false” patriots as he defines them. He was happy to play patriot games (just like he was happy with his church) until he got to the general election and realized they may not be a winner for him, which is why that pin’s clearly visible on his lapel in the video below. Out: “New politics.” In: Machiavellian cunning!The one other interesting line, near the very end, was spotted by DrewM Equality of opportunity, not equality of results: Consider that another general-election pander from candidate Obama on which President Obama might differ."
[2] "Tough stuff, but conservative passions this year have always been more anti-Obama than pro-McCain (at least until Palin joined the ticket) so I’m curious what you guys think. Nate Silver, a lefty but one who usually plays it straight in his poll analyses, gives Maverick a five percent chance at this point and only then if he gives up on Pennsylvania and starts targeting New Hampshire and New Mexico. Compare that to the thin spreads in various Senate races (as compiled at Silver’s FiveThirtyEight site) that the GOP desperately needs to win to preserve the filibuster: Mitch McConnell, Saxby Chambliss, and Roger Wicker are all clinging to leads of just a few points while Norm Coleman, Liddy Dole, Gordon Smith, and Ted Stevens trail narrowly. Every last one of them’s an incumbent. If the RNC pulls the plug on McCain, they could shower those seven with cash for the last week and try to put them over the top. Or, alternatively, they could stick with Maverick and hope for the best. How lucky do you feel?The stakes according to Frum:He goes so far as to suggest that Senate candidates concede the likelihood of Obama’s victory and run on the sort of divided government platform McCain himself intends to push this week. Exit question: You’re the chairman of the RNC and your phone’s ringing off the hook with demands for money. What do you do? After you print up a few million copies of Treacher’s post and mail it to Republicans, I mean.Update (Ed): What do I do? I do basic math. The Republicans are defending 23 seats in the Senate, and the Democrats 13. There’s no way on God’s green Earth that the GOP will have enough seats to block the Democratic agenda no matter how much the RNC spends; they’ll be lucky to get 43 seats, and they can’t spend the next two years filibustering everything if they plan to win seats back in 2010. They’re better off spending the money on McCain — his odds are much better than the Senate Republicans."
[3] "Epic Ideological Failby digbyBefore you listen to one more wingnut try to tell you that the financial crisis and global recession was caused by irresponsible black people, illegal immigrants and Fannienfreddie, take this short, simple trip down memory lane with economist Joseph Stiglitz in the latest issue of Vanity Fair. It concludes with this:The truth is most of the individual mistakes boil down to just one: a belief that markets are self-adjusting and that the role of government should be minimal. Looking back at that belief during hearings this fall on Capitol Hill, Alan Greenspan said out loud, “I have found a flaw.” Congressman Henry Waxman pushed him, responding, “In other words, you found that your view of the world, your ideology, was not right; it was not working.” “Absolutely, precisely,” Greenspan said. The embrace by America—and much of the rest of the world—of this flawed economic philosophy made it inevitable that we would eventually arrive at the place we are today.Democrats are working very hard to discredit the very concept of ideology in favor of technocratic competence. And I would guess most Americans find that to be something of a relief by now. But I think it's as much a mistake to sweep this under the rug as it is to let bygones be bygones on the torture regime. There is ideology and then there is ideology and people should know the difference. These dogmatic deregulators and market fundamentalists ran a decades long experiment that failed on an epic scale. If the country doesn't understand what went wrong here -- if they get confused by complexity and propaganda --- there is every reason that the free lunch mentality these ideologues promoted will make a comeback the minute we see the light at the end of the tunnel. Ideology matters.Update: Some of us have talking about this for a long, long time.digby 12/10/2008 08:00:00 PM LinktoComments('8559403618191707814')postCount('8559403618191707814'); | postCountTB('8559403618191707814');"
Note from the full text, especially the third one there from the “Digby” blog, that these were not parsed very carefully (by the original researchers in 2008). Most will tell you not to worry about that. They’re wrong.
The stm package converts a vector of text and a dataframe of metadata into stm formatted objects using the command textProcessor
which calls the package tm for its preprocessing routines.
# * default parameters
poliblog5k.proc <- textProcessor(documents=poliblog5k.fulltext,
metadata = poliblog5k.meta,
lowercase = TRUE, #*
removestopwords = TRUE, #*
removenumbers = TRUE, #*
removepunctuation = TRUE, #*
stem = TRUE, #*
wordLengths = c(3,Inf), #*
sparselevel = 1, #*
language = "en", #*
verbose = TRUE, #*
onlycharacter = TRUE, # not def
striphtml = FALSE, #*
customstopwords = NULL, #*
v1 = FALSE) #*
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Stemming...
provided 5000 variables to replace 1 variables
Creating Output...
The processed object is a list of four objects: documents
, vocab
, meta
, and docs.removed
. The documents
object is a list, one per document, of 2 row matrices; the first row indicates the index of a word found in the document, and the second row indicates the (nonzero) counts. If preprocessing causes any documents to be empty, they are removed, as are the corresponding rows of the meta
object.
These objects are in turn passed to the prepDocuments
function, which filters vocabulary, and again removes empty documents and corresponding rows in the metadata. The authors of stm say it struggles with extremely large vocabularies, and in the vignette example filter to under 10000 terms by eliminating those terms that don’t appear in more than 15 documents. The data objects provided in the package or poliblog5k
seem to filter out terms that don’t appear in more than 50 documents, leaving about 2600-2800 terms.
poliblog5k.out <- prepDocuments(poliblog5k.proc$documents, poliblog5k.proc$vocab, poliblog5k.proc$meta, lower.thresh=50)
Removing 27200 of 29917 terms (148730 of 850617 tokens) due to frequency
Your corpus now has 5000 documents, 2717 terms and 701887 tokens.
(The number of terms, 2717, is more than the 2632 in the provided poliblog5k.voc
. The mismatches seem to be in words with punctuation in the middle – like “re-elect” or “you’re” – and I can’t seem to make them match exactly with textProcessor options.)
We’ve mostly read in and processed data with quanteda, so it’s worth noting that you can do that and then use the convert
function to convert to stm and a variety of other formats.
library(quanteda)
Package version: 1.4.1
Parallel computing: 2 of 8 threads used.
See https://quanteda.io for tutorials and examples.
Attaching package: ‘quanteda’
The following object is masked from ‘package:utils’:
View
poliblog5k.fullmeta <- data.frame(doc_id=rownames(poliblog5k.meta), poliblog5k.meta, shorttext=poliblog5k.shorttext, fulltext=poliblog5k.fulltext, stringsAsFactors=FALSE)
poliblog5k.corpus <- quanteda::corpus(poliblog5k.fullmeta, docid_field="doc_id",text_field="fulltext")
poliblog5k.dfm <- quanteda::dfm(poliblog5k.corpus,
tolower=TRUE,
stem=TRUE,
remove=stopwords("english"),
remove_numbers=TRUE,
remove_punct=TRUE,
remove_symbols=TRUE,
ngrams=1)
dim(poliblog5k.dfm)
[1] 5000 50853
Again, let’s trim that to words appearing in more than 50 documents.
poliblog5k.dfm <- dfm_trim(poliblog5k.dfm, min_docfreq=51, docfreq_type="count")
dim(poliblog5k.dfm)
[1] 5000 2659
Yet another slightly different count.
In any case, we convert to stm format using
poliblog5k.dfm2stm <- quanteda::convert(poliblog5k.dfm, to = "stm")
names(poliblog5k.dfm2stm)
[1] "documents" "vocab" "meta"
Let’s start with running this like a topic model without structure. It’s not exact, but this is very similar to the SAGE (Eisenstein, et al.) sparse estimation of a model with a correlated topic model (CTM) generative process (Blei, et al.)
# Spectral initialization is advised by the authors
# Should replicate exactly under spectral initialization
#
# This takes about 30 seconds.
poliblog5k.fit.nometa <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral")
Beginning Spectral Initialization
Calculating the gram matrix...
Finding anchor words...
....................
Recovering initialization...
..........................
Initialization complete.
....................................................................................................
Completed E-Step (4 seconds).
Completed M-Step.
Completing Iteration 1 (approx. per word bound = -7.057)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 2 (approx. per word bound = -6.964, relative change = 1.323e-02)
....................................................................................................
Completed E-Step (4 seconds).
Completed M-Step.
Completing Iteration 3 (approx. per word bound = -6.935, relative change = 4.205e-03)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 4 (approx. per word bound = -6.921, relative change = 1.924e-03)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 5 (approx. per word bound = -6.914, relative change = 1.062e-03)
Topic 1: legisl, bill, presid, court, law
Topic 2: democrat, republican, parti, conserv, gop
Topic 3: think, peopl, like, know, one
Topic 4: health, school, abort, care, women
Topic 5: get, one, can, show, will
Topic 6: obama, barack, campaign, polit, wright
Topic 7: race, campaign, franken, vote, senat
Topic 8: iran, israel, attack, will, terrorist
Topic 9: obama, senat, said, biden, joe
Topic 10: will, american, tax, economi, energi
Topic 11: obama, mccain, poll, state, campaign
Topic 12: iraq, war, iraqi, troop, militari
Topic 13: one, time, report, year, new
Topic 14: bush, presid, said, news, white
Topic 15: mccain, john, campaign, palin, said
Topic 16: world, nation, will, russia, war
Topic 17: elect, vote, voter, state, immigr
Topic 18: investig, report, senat, state, case
Topic 19: tax, billion, money, hous, govern
Topic 20: hillari, clinton, obama, will, primari
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 6 (approx. per word bound = -6.910, relative change = 6.441e-04)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 7 (approx. per word bound = -6.907, relative change = 4.201e-04)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 8 (approx. per word bound = -6.905, relative change = 2.849e-04)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 9 (approx. per word bound = -6.903, relative change = 1.956e-04)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 10 (approx. per word bound = -6.902, relative change = 1.361e-04)
Topic 1: court, legisl, law, bill, presid
Topic 2: democrat, republican, parti, conserv, elect
Topic 3: think, peopl, like, know, say
Topic 4: school, women, health, care, abort
Topic 5: get, one, can, show, don
Topic 6: obama, barack, campaign, polit, wright
Topic 7: race, campaign, senat, franken, gop
Topic 8: iran, israel, attack, terrorist, will
Topic 9: obama, senat, said, biden, joe
Topic 10: will, american, tax, economi, energi
Topic 11: obama, mccain, poll, state, voter
Topic 12: iraq, war, iraqi, militari, troop
Topic 13: time, report, new, stori, one
Topic 14: bush, presid, said, news, white
Topic 15: mccain, palin, john, campaign, sarah
Topic 16: world, nation, will, countri, america
Topic 17: elect, vote, voter, state, immigr
Topic 18: report, investig, offici, senat, offic
Topic 19: money, financi, billion, govern, million
Topic 20: hillari, clinton, will, primari, obama
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 11 (approx. per word bound = -6.902, relative change = 9.636e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 12 (approx. per word bound = -6.901, relative change = 6.891e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 13 (approx. per word bound = -6.901, relative change = 4.963e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 14 (approx. per word bound = -6.901, relative change = 3.610e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 15 (approx. per word bound = -6.901, relative change = 2.613e-05)
Topic 1: court, law, legisl, bill, tortur
Topic 2: democrat, republican, parti, elect, will
Topic 3: think, peopl, like, know, say
Topic 4: women, school, children, care, health
Topic 5: get, one, show, don, can
Topic 6: obama, barack, campaign, polit, wright
Topic 7: race, senat, campaign, gop, rep
Topic 8: iran, israel, attack, terrorist, will
Topic 9: obama, senat, said, biden, joe
Topic 10: will, tax, american, economi, energi
Topic 11: obama, mccain, poll, state, voter
Topic 12: iraq, war, militari, iraqi, troop
Topic 13: time, report, new, stori, one
Topic 14: bush, presid, said, news, white
Topic 15: mccain, palin, john, campaign, sarah
Topic 16: world, nation, will, america, countri
Topic 17: elect, vote, voter, state, immigr
Topic 18: report, investig, offici, offic, depart
Topic 19: money, financi, million, govern, billion
Topic 20: hillari, clinton, will, primari, campaign
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 16 (approx. per word bound = -6.900, relative change = 1.869e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 17 (approx. per word bound = -6.900, relative change = 1.369e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Completing Iteration 18 (approx. per word bound = -6.900, relative change = 1.002e-05)
....................................................................................................
Completed E-Step (3 seconds).
Completed M-Step.
Model Converged
## suppress all this output with verbose=FALSE
We can get a detailed overview of key terms for every topic (or some) using the labelTopics
command. This command provides the top words according to four different statistics, “highest probability” (just \(\beta\)), “FREX”, “Lift”, and “Score.” FREX is very similar to PMI and typically I find it most helpful.
labelTopics(poliblog5k.fit.nometa)
OMP: Warning #96: Cannot form a team with 8 threads, using 2 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
Topic 1 Top Words:
Highest Prob: court, law, legisl, tortur, bill, presid, bush
FREX: legisl, tortur, court, constitut, law, detaine, suprem
Lift: legisl, surveil, detaine, judici, interrog, detent, guantanamo
Score: legisl, tortur, court, detaine, law, interrog, cia
Topic 2 Top Words:
Highest Prob: democrat, republican, parti, elect, will, conserv, vote
FREX: republican, parti, democrat, conserv, pelosi, gop, progress
Lift: lean, pelosi, centrist, nanci, republican, speaker, boehner
Score: lean, republican, democrat, parti, gop, conserv, vote
Topic 3 Top Words:
Highest Prob: think, peopl, like, say, know, one, just
FREX: thing, linktocommentspostcount, postcounttb, guy, think, realli, someth
Lift: sorri, digbyi, idiot, linktocommentspostcount, postcounttb, digbi, dday
Score: sorri, think, linktocommentspostcount, postcounttb, guy, know, thing
Topic 4 Top Words:
Highest Prob: women, school, children, care, abort, life, educ
FREX: abort, school, children, gay, women, religi, educ
Lift: stem, abort, gay, religi, birth, teach, cathol
Score: stem, abort, school, church, women, gay, children
Topic 5 Top Words:
Highest Prob: get, one, show, don, can, like, doesn
FREX: film, doesn, isn, video, eastern, didn, don
Lift: chat, seedubya, film, hollywood, clip, movi, scroll
Score: chat, film, don, movi, eastern, didn, doesn
Topic 6 Top Words:
Highest Prob: obama, barack, campaign, polit, wright, black, ayer
FREX: wright, ayer, barack, obama, black, chicago, rezko
Lift: rezko, wright, ayer, jeremiah, reverend, racist, racism
Score: rezko, obama, wright, barack, ayer, campaign, jeremiah
Topic 7 Top Words:
Highest Prob: race, senat, gop, campaign, rep, new, dem
FREX: franken, coleman, rep, smith, minnesota, dem, race
Lift: franken, coleman, smith, minnesota, mitch, mcconnel, norm
Score: franken, coleman, dem, ballot, gop, minnesota, rep
Topic 8 Top Words:
Highest Prob: iran, israel, attack, terrorist, will, nuclear, iranian
FREX: israel, iran, isra, hama, iranian, terrorist, palestinian
Lift: gaza, hama, isra, palestinian, israel, jew, osama
Score: hama, iran, israel, isra, iranian, palestinian, nuclear
Topic 9 Top Words:
Highest Prob: obama, senat, said, biden, debat, joe, polici
FREX: biden, joe, debat, lieberman, senat, foreign, polici
Lift: lieberman, biden, joe, plumber, graham, debat, readi
Score: lieberman, obama, biden, senat, joe, debat, foreign
Topic 10 Top Words:
Highest Prob: will, tax, american, economi, energi, oil, year
FREX: oil, energi, tax, economi, price, drill, econom
Lift: coal, drill, unemploy, gas, offshor, oil, effici
Score: coal, tax, oil, economi, energi, drill, price
Topic 11 Top Words:
Highest Prob: obama, poll, mccain, state, voter, lead, percent
FREX: poll, pennsylvania, percent, virginia, margin, ralli, ohio
Lift: gallup, undecid, rasmussen, pollster, pennsylvania, poll, colorado
Score: gallup, poll, obama, mccain, voter, rasmussen, percent
Topic 12 Top Words:
Highest Prob: iraq, war, militari, troop, iraqi, forc, afghanistan
FREX: iraqi, troop, iraq, afghanistan, pentagon, armi, withdraw
Lift: sadr, maliki, almaliki, basra, baghdad, petraeus, militia
Score: sadr, iraq, iraqi, troop, militari, afghanistan, taliban
Topic 13 Top Words:
Highest Prob: report, time, new, stori, year, one, york
FREX: warm, global, publish, newspap, editor, stori, dead
Lift: ice, scientist, warm, flight, editor, newspap, publish
Score: ice, global, warm, stori, polic, report, scientist
Topic 14 Top Words:
Highest Prob: bush, presid, said, news, white, administr, hous
FREX: fox, rove, bush, cheney, white, interview, watch
Lift: rove, olbermann, flv, karl, cheney, fox, keith
Score: rove, bush, cheney, fox, presid, administr, rice
Topic 15 Top Words:
Highest Prob: mccain, palin, john, campaign, sarah, attack, say
FREX: palin, mccain, sarah, john, alaska, clark, romney
Lift: clark, palin, sarah, mccain, giuliani, rudi, pow
Score: clark, mccain, palin, sarah, romney, john, campaign
Topic 16 Top Words:
Highest Prob: world, nation, will, america, countri, war, state
FREX: russia, russian, world, georgia, democraci, china, europ
Lift: russian, russia, soviet, germani, europ, german, korea
Score: russian, russia, world, georgia, soviet, nato, china
Topic 17 Top Words:
Highest Prob: elect, vote, voter, state, immigr, group, organ
FREX: immigr, acorn, illeg, fraud, union, registr, counti
Lift: registr, acorn, immigr, alien, fraud, licens, counti
Score: registr, acorn, voter, immigr, fraud, vote, illeg
Topic 18 Top Words:
Highest Prob: report, investig, offici, offic, depart, case, state
FREX: investig, blagojevich, attorney, depart, staff, appoint, prosecut
Lift: blagojevich, rod, fbi, investig, probe, indict, prosecut
Score: blagojevich, investig, attorney, depart, prosecut, fbi, prosecutor
Topic 19 Top Words:
Highest Prob: money, financi, million, govern, billion, hous, market
FREX: financi, bailout, mortgag, loan, earmark, taxpay, billion
Lift: earmark, paulson, freddi, mae, subprim, treasuri, bailout
Score: earmark, mortgag, bailout, billion, loan, paulson, market
Topic 20 Top Words:
Highest Prob: hillari, clinton, primari, will, campaign, win, democrat
FREX: hillari, clinton, deleg, primari, edward, nomin, michigan
Lift: deleg, superdeleg, hillari, dnc, edward, clinton, super
Score: deleg, hillari, clinton, superdeleg, primari, romney, edward
Going through these … rough guesses at the topics seem to be.
Some of these look a bit “undercooked,” meaning a model with more topics might have separated them (e.g., 4 and 13). Some may be spurious (e.g., 9, which may be about Senators, may be about vice presidential possibilities, may be about the democratic primary, or may just be triggering on correlations with the word “Joe”). Some of these appear at first blush to be “junk” (e.g., 3 and 5).
We can get an overview of the distribution of these topics by plotting the fit object:
plot(poliblog5k.fit.nometa)
We can find the top documents associated with a topic with the findThoughts
function:
findThoughts(poliblog5k.fit.nometa,texts = poliblog5k.fulltext, n = 2, topics = c(6))
Topic 6:
The New York Sun is reporting that Barack Obama repudiated the views of Nation of Islam leader Louis Farrakhan that were discussed in Richard Cohen's Washington Post column. Cohen's criticism regarding Obama's ties to the Church and the Pastor that gave an award to Farrakhan were reaching a large audience that included potential Democrat voters who might be swayed to withdraw support from Obama. This statement by Obama is a political maneuver that should be given little credence. Obama is very actively involved in his church; he knew of this award long before Richard Cohen publicized its grant to Farrakhan. Furthermore, Pastor Wright has had a long relationship and alliance with Louis Farrakhan. Obama did not object to these ties between Pastor Wright and Farrakhan before; nor has Obama rejected the anti-Israel diatribes of Wright. Regardless, Obama adheres to a church and a minister that have long espoused positions inimical to the American-Israel relationship, let alone the trumpeting of black values and racial exclusiveness. This follows a pattern for Obama: he shows extreme loyalty to a church and pastor whose controversial views eventually become publicized. Then Obama "disappears" the Minister and Obama's campaign (not Obama himself) issues a statement that Obama does not agree with everything that Wright espouses. He solicits and gains support from the controversial George Soros, a man whose anti-Israel passions and allegations regarding America's Jewish community and Congress are well-known. When these ties become publicized, Obama's campaign (not Obama himself) issues a statement that Obama does not agree with Soros on this topic. When Obama articulates anti-Israel positions in off-the cuff remarks, his campaign (not Obama himself-stop me if you have heard this before) issues clarifications that attempt to explain away the plain English import of Obama's (the supreme orator) expressed views.In other words, Obama only disavows when it is politically opportune to do so. He seems to have never objected to these views before they become publicized and create a political firestorm because they belie his image of peace, compassion, unity. Obama is not a profile in courage and his disavowals are political pabulum.For a review of Obama's troubling stance toward Israel, see my article today, "Barack Obama and Israel."
How far can 527 groups go before they create more of a backlash than forward momentum? Opposing 527s have unveiled ads that may answer that question. A pro-Obama 527 makes an issue of McCain’s health and Sarah Palin’s inexperience, while a pro-McCain group talks about Obama’s connections with William Ayers, Tony Rezko, and Jeremiah Wright.Let’s give the opposition first shot:By comparison, this ad is much more mild — more fact-based, for one thing, and even understated. Obama didn’t just “associate” with William Ayers, he worked with Ayers for years at the Chicago Annenberg Challenge and the Woods Fund. Rezko raised over $250,000 for Obama, who lied twice about the funding during the course of this campaign. The most controversial part of the ad will be the Jeremiah Wright link. Obama already threw his former pastor under the bus and quit Trinity United Church of Christ, and his supporters will argue that this is old news. Still, Obama sat in his church for over 20 years while Wright offered his radical, conspiratorial theories on race and American politics, and Obama didn’t break those ties until Wright suggested that Obama was just playing politics by publicly rebuking him.JCW decided to take the high road in its approach. National Nurses Organizing Committee took a decidedly different approach.
These both appear to be conservative discussion of Obama and particularly associations with controversial people like Ayers and Wright. So that looks ok.
We can look at multiple, or all, topics this way as well. For this we’ll just look at the shorttext.
findThoughts(poliblog5k.fit.nometa,texts = poliblog5k.shorttext, n = 3, topics = 1:20)
Topic 1:
As you may have heard by now, Barack Obama voted for the FISA cave-in bill in the Senate today, and Hillary voted against it. Hillary has now explained her vote in a new statement... The legislation
New wiretapping bill dubbed ‘repugnant’ and ‘a capitulation.’ Under a “compromise” wiretapping bill the House is expected to approve tomorrow, U.S. phone companies that cooperated with President
In Radio Address, Bush Hypes Consequences of Wiretapping Law Expiration In his weekly radio address, President Bush not only blames Congress for tonight’s expiration of the Protect America Act,
Topic 2:
Two successive national-election losses still hasn’t clued Republicans into the need for dramatic change in their direction. The House GOP caucus rejected a plan by John Boehner and Eric Cantor to im
Here's yet more evidence that the Dems are poised for huge gains in Congress: The Cook Report has released a new set of updated rankings on 25 House races -- and all 25 are shifts in the Dems' directi
This is really something. We already knew that House Dems are expected to rack up major victories this fall, but this latest development is really eye-opening. The Cook Political Report, whose rating
Topic 3:
What's Wrong With This Picture?by digby Here are who the Telegraph considers to be the 50 most influential political pundits in America. The following are the top choices starting with number 10, Mark
Tears Of A Clownby digby“I don’t think people look at me as the establishment, do you?” Matthews asked me. “Am I part of the winner’s circle in American life? I don’t think so.”I just read the Chris M
“I think we’ve all been demeaned.”by tristeroIndeed we all have,, whether or not we ever saw a Swift Boat. But this is what movement conservatives do with emotionally weighty situations or actions. Th
Topic 4:
Washington University refuses to back down on Schlafly award. Today, Washington University chancellor Mark Wrighton finally responded to the intense criticism the school has been receiving over
Right-Wing Ad Tries To Frighten Minorities By Comparing Embryonic Stem Cell Research To Tuskegee Study Up for consideration in Michigan is Proposal 2, a measure to permit embryonic stem cell res
McCain spent about half of his speech yesterday to the N.A.A.C.P. outlining his education plans for America. The audience heard an approach much different than the plan proposed by Senator Obama.McCai
Topic 5:
The Northern Alliance Radio Network will be on the air today, with our eight-hour-long broadcast schedule starting at 9 am CT. If you’re in the Twin Cities, you can hear us on AM 1280 The Patriot, or
Today, on the Ed Morrissey Show (3 pm ET), we’re thrilled to welcome back Mary Katharine Ham. MK and I will talk about the latest developments in the Blago-Rahma scandal, the Joe Scarborough takedown
Just buying the tickets to this film felt like a guilty pleasure. What could be more cheesy than a movie musical with serious actors like Meryl Streep, Julie Walters, Pierce Brosnan, Colin Firth, and
Topic 6:
The New York Sun is reporting that Barack Obama repudiated the views of Nation of Islam leader Louis Farrakhan that were discussed in Richard Cohen's Washington Post column. Cohen's criticism regardin
How far can 527 groups go before they create more of a backlash than forward momentum? Opposing 527s have unveiled ads that may answer that question. A pro-Obama 527 makes an issue of McCain’s health
The Washington Post editorial board follows the lead in some ways of the New York Times, which pretends that Jeremiah Wright suddenly popped out of the ground this week, offering lunatic conspiracy th
Topic 7:
Here's tonight's run-down on the Congressional races: Coleman Suspends Negative Ads, Sort Of Sen. Norm Coleman (R-MN), who has fallen behind in the polls against Al Franken thanks to the economic cri
It now looks like the Senate GOP could end up trying to block the seating of Al Franken, assuming he is declared the winner next week in the Minnesota recount. NRSC chairman John Cornyn put out a stat
So will any of the wrongly-rejected absentee ballots in Minnesota, which have been the subject of copious litigation between the Franken and Coleman campaigns, actually get counted? The latest report
Topic 8:
The Israelis have sent a warning to Gaza and its Hamas leadership after the latest rocket attack on Ashkelon. If the attacks continue, Israel will invade Gaza and conduct large-scale military opera
Three “Middle Eastern militants” are in custody in the Philippines for plotting to attack the US embassy in Manila, as well as three other embassies in the capital. Authorities suspect them of belong
Those of us who questioned Jimmy Carter's ability to bring peace to the Israeli-Palestinian conflict, we should be ashamed of ourselves. The former President, it turns out, has everything all worked o
Topic 9:
Barack Obama is reportedly sending signals that he wants Joe Lieberman to stay in the Dem caucus. But let's not get distracted. The question isn't whether Lieberman gets to "stay in the Dem caucus"
The Obama and McCain campaigns just jointly announced that they've reached an agreement on the format for three presidential debates and one veep one. The agreement features an interesting variety of
We now have a third Senator stepping up and strongly condemning the idea of Joe Lieberman remaining as Homeland Security chair: Senator Byron Dorgan of North Dakota... Dorgan hammered Lieberman f
Topic 10:
In a speech just now in Grand Rapids, Michigan, Barack Obama departed from the prepared remarks and unleashed some of his most empathetic language yet about people's economic distress. Obama reiterat
In a speech going on right now in Montgomery County, Pennsylvania, a Philadelphia suburb, Barack Obama wades a bit deeper into the action on the bailout package, urging the House to pass it today and
Truckers rolled into Washington DC to protest the price of a fill-up, while Barack Obama continued to oppose both Hillary Clinton and John McCain on a gas-tax “holiday”. Obama’s opposition to the ga
Topic 11:
Here's our daily composite of the five major national tracking polls. Barack Obama's lead may have contracted slightly since yesterday -- but the overall difference is very small, and he remains well
A new set of Rasmussen polls, all conducted yesterday in the middle of John McCain's post-convention bounce, suggests that this race remains close on the state-by-state level. ⢠In Colorado, Obama
Here's our daily composite of the five major national tracking polls. Barack Obama is holding on to his big lead over John McCain, which has remained relatively unchanged since the financial meltdown
Topic 12:
The fighting that has erupted in Basra should come as no surprise to anyone who has followed the course of the war in Iraq. While the US has spent the last year increasing force size in western Iraq
Thanks to our good friends in the Pakistani government, several hundred Taliban fighters have inflitrated across the border from their bases in the northwest frontier provinces and are seizing village
In a joint US-Iraqi operation that began in May in the northern Iraqi city of Mosul directed at al-Qaeda's last redoubt, the Times OnLine reports that the Iraqis have achieved a spectacular success an
Topic 13:
Climate change report forecasts global sea levels to rise up to 4 feet by 2100. According to a new report led by the U.S. Geological Survey, the U.S. “faces the possibility of much more rapid cl
The Australian reports a few inconvenient truths regarding global climate change that have yet to receive much attention from a media sold on global warming. Not only has the Earth cooled since its p
Kudos to John L. Daly, who has written a very interesting study of ice at the North Pole. Global Warmists are once again observing cyclical changes and declaring them "proof" of the dire effects of gl
Topic 14:
Karl Rove orchestrating the ‘Bush Legacy project.’ President Bush’s interview with ABC’s Charlie Gibson this week was the “first of several planned ‘exit interviews.’” According to White House p
MSNBC: White House Has Had A Copy Of McClellan’s Memoir For ‘At Least A Month’ MSNBC correspondent Jeannie Ohm reported breaking news this afternoon that former White House Press Secretary Scott
Bush Is ‘Puzzled’ By McClellan’s Book, Didn’t Think It Would Be So ‘Harsh’ Earlier today, White House Press Secretary Dana Perino put out a statement bashing former press secretary Scott McClell
Topic 15:
The Bridge To Nowhere Lie Returns: McCain Claims Palin ‘Stood Up Against’ The Project After the talking point was thoroughly debunked, the McCain campaign slowly backed away from the claim that
Media Embrace McCain’s ‘Maverick’ Re-Branding Effort By choosing Gov. Sarah Palin (R-AK) as his running mate, Sen. John McCain (R-AZ) has tried to reinvigorate the perception that he is a “maver
The McCain campaign, keeping up the pressure over Wes Clark's comments, is holding its second conference call on this topic in two days -- but now the story has taken a new turn, with a McCain surroga
Topic 16:
Barack Obama has released the following statement on the Russia-Georgia War: Good morning. The situation in Georgia continues to deteriorate because of the escalation of Russia's use of military forc
It's after the jump. Dig in. Video and more soon. Late Update: Here's the vid... Thank you to the citizens of Berlin and to the people of Germany. Let me thank Chancellor Merkel and Foreign Min
This is kind of fun. In a big speech John McCain just delivered on nuclear proliferation, there was an amusing little nugget that seemed like a pretty obvious effort to dispel worries about his age:
Topic 17:
ACORN and its “affiliate”, Project Vote, claimed that they have registered over 1.3 million new voters in this election cycle. Even when counting Mickey Mouse and the starting lineup of the Dallas Co
The laughably and ironically nicknamed Free State (lose the r and it would be more accurate) is already one of only 8 states that allow illegal aliens to obtain drivers licenses. Several local govern
Right Wing Rages Against New Voter Registrations: The ‘Purpose’ Of ACORN Is To Commit ‘Voter Fraud’ This week, the New York Times reported that “tens of thousands of eligible voters in at least
Topic 18:
It's all over but the weeping for the corrupt governor of Illinois. The Chicago Tribune is reporting that Governor Rod Blagojevich's most trusted advisor and friend wore a wiretap authorized by a fed
The Anchorage Daily News is reporting that an FBI agent who worked on the criminal case against Ted Stevens has filed an 8 page complaint charging his fellow agents and the prosecution with misconduct
It seems that Attorney General Designee Eric Holder forgot to list that Illinois Governor Rod Blagojevich tried to hire him to sort out the state's long dormant casino license on his 47-page response
Topic 19:
The bailout is ballooning. Imagine that. Some estimates now put the total bill at over 8 trillion dollars.Those of you who thought that the money was going to the poor folks holding bad mortgages need
When the New York Times refers to the bailout plan for Citigroup as "radical," it must be somewhere out near Mars: Federal regulators approved a radical plan to stabilize Citigroup in an arrangement
IndyMac seized by regulators, marking second largest bank failure in U.S. history. Late yesterday, the Federal Deposit Insurance Corporation (FDIC) and the Office of Thrift Supervision (OTS) “to
Topic 20:
The news nets are still tabulating delegates won by all the candidates last night and we probably won't have any firm numbers until later this afternoon. But NBC and the Obama campaign are in near agr
The Hillary campaign has a new statement out responding to the Obama camp's claim that they won more delegates in Nevada: Hillary Clinton won the Nevada Caucuses today by winning a majority of the de
Senator Barack Obama won all three Democratic contests on Saturday, sweeping caucuses in Washington State and Nebraska while thumping Hillary Clinton in the Louisiana primary:While Mr. Obama’s victori
The first three in Topic 1 are about FISA and wiretapping … not torture … so that may be a more general “legislation / law” topic. The first three in Topic 13 are all about global warming, so that merits a closer inspection.
The plotQuote
function will give you similar information in a more graphical format.
firstdocs.13 <- findThoughts(poliblog5k.fit.nometa,texts = poliblog5k.shorttext, n = 5, topics = c(13))$docs[[1]]
plotQuote(firstdocs.13, main="Top Documents, Topic 13 - Global Warming?")
The default label of “report, new, time” – which probably indicates a lot of “A New York Times report …” – may be misleading. These five are all criticisms of “global warmists” with several discussing the “coming ice age.”
Or we can go back to words and our old friend the wordcloud:
cloud(poliblog5k.fit.nometa, topic=13, scale=c(2,.25))
Boy. It’s tough to call that “global warming.” The vast majority of those words are media related. We would really need to look closely at the documents to see what’s going on.
But let’s move on. STM’s bread and butter is in incorporating “structure” by modeling on metadata.
In the vignette example, the authors model “prevalence” of topics on “rating” and “s(day)”. The latter calculates a smoothed function (a b-spline) across the variable, appropriate for a variable like day
that takes on continuous or many values.
Let’s do that.
poliblog5k.fit.rat_day <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ rating + s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
Or we might think just rating matters.
poliblog5k.fit.rat_only <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ rating,
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
Or we might think just day matters.
poliblog5k.fit.day_only <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
Or maybe it’s blog and day.
poliblog5k.fit.blog_day <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ blog + s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
It turns out this makes almost no difference to the model. These documents are long enough that the influence of the prior from the covariate structure is imperceptible,
# Beta for topic 1
cor(poliblog5k.fit.nometa$beta[[1]][[1]][1,],poliblog5k.fit.rat_day$beta[[1]][[1]][1,])
[1] 0.9999778
cor(poliblog5k.fit.nometa$beta[[1]][[1]][1,],poliblog5k.fit.blog_day$beta[[1]][[1]][1,])
[1] 0.9999262
cor(poliblog5k.fit.nometa$beta[[1]][[1]][1,],poliblog5k.fit.day_only$beta[[1]][[1]][1,])
[1] 0.9999967
cor(poliblog5k.fit.nometa$beta[[1]][[1]][1,],poliblog5k.fit.rat_only$beta[[1]][[1]][1,])
[1] 0.9999822
cor(poliblog5k.fit.nometa$theta[,1],poliblog5k.fit.rat_day$theta[,1])
[1] 0.999578
cor(poliblog5k.fit.nometa$theta[,1],poliblog5k.fit.blog_day$theta[,1])
[1] 0.9989984
cor(poliblog5k.fit.nometa$theta[,1],poliblog5k.fit.rat_only$theta[,1])
[1] 0.9998582
cor(poliblog5k.fit.nometa$theta[,1],poliblog5k.fit.day_only$theta[,1])
[1] 0.9996832
The STM model does differ considerably in this example if you also model the content of topics as a function of covariates.
(This takes a bit longer, 2-3 minutes in this case.)
poliblog5k.fit.cont.rat <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ rating + s(day),
content =~ rating,
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
poliblog5k.labels_rat <- labelTopics(poliblog5k.fit.cont.rat)
poliblog5k.labels_rat
Topic Words:
Topic 1: detaine, fisa, guantanamo, detent, surveil, judici, interrog
Topic 2: republican, gop, boehner, pelosi, democrat, parti, centrist
Topic 3: crap, conservat, elit, guy, blogospher, gender, tale
Topic 4: abort, church, prolif, teach, birth, health, cathol
Topic 5: movi, hollywood, song, limbaugh, mom, music, clip
Topic 6: ayer, wright, obama, jeremiah, reverend, barack, rezko
Topic 7: franken, coleman, minnesota, ballot, norm, incumb, recount
Topic 8: iran, iranian, tehran, gaza, arabia, ahmadinejad, osama
Topic 9: biden, joe, plumber, lieberman, graham, foreign, segment
Topic 10: drill, coal, prosper, tax, economi, worker, gas
Topic 11: gallup, rasmussen, poll, battleground, pennsylvania, undecid, nevada
Topic 12: maliki, baghdad, iraqi, sadr, basra, shiit, petraeus
Topic 13: ice, scientist, scienc, warm, emiss, anim, scientif
Topic 14: olbermann, cbs, nbc, gibson, bias, keith, editor
Topic 15: rudi, mccain, giuliani, maverick, palin, sarah, pow
Topic 16: nato, russia, soviet, europ, russian, georgia, africa
Topic 17: registr, acorn, fraud, worker, california, missouri, counti
Topic 18: blagojevich, prosecutor, prosecut, attorney, investig, probe, impeach
Topic 19: mae, bailout, paulson, treasuri, freddi, mortgag, earmark
Topic 20: superdeleg, deleg, hillari, caucus, edward, iowa, clinton
Covariate Words:
Group Conservative: latter, afterward, especi, buri, surviv, finger, appar
Group Liberal: matt, today, excerpt, shape, swift, current, overwhelm
Topic-Covariate Interactions:
Topic 1, Group Conservative: legislatur, jone, properti, democraci, appoint, raz, applic
Topic 1, Group Liberal: mcconnel, prohibit, illeg, armi, nixon, human, graham
Topic 2, Group Conservative: reid, mcconnel, mitch, drill, hurrican, pork, fiscal
Topic 2, Group Liberal: digbi, turnout, elit, prolif, primari, cash, repudi
Topic 3, Group Conservative: seedubya, exit, color, barri, religi, bless, religion
Topic 3, Group Liberal: digbyi, digbi, bias, postcounttb, linktocommentspostcount, obsess, regim
Topic 4, Group Conservative: park, neighborhood, wife, doctrin, dream, intellectu, signatur
Topic 4, Group Liberal: huckabe, stem, pastor, evangel, christian, rev, thompson
Topic 5, Group Conservative: morrissey, scroll, denver, seedubya, mitch, updat, indiana
Topic 5, Group Liberal: matthew, msm, thompson, digbyi, africanamerican, classic, rant
Topic 6, Group Conservative: church, toni, africanamerican, francisco, liar, kennedi, bus
Topic 6, Group Liberal: spokesperson, hillari, outlet, click, camp, dem, undecid
Topic 7, Group Conservative: parliament, turnout, johnson, inflat, london, secretari, elect
Topic 7, Group Liberal: mitch, mcconnel, smith, steven, gop, rep, jeff
Topic 8, Group Conservative: moham, civilian, peac, inspector, technolog, infrastructur, cell
Topic 8, Group Liberal: pakistan, saddam, alqaeda, pakistani, parliament, navi, hussein
Topic 9, Group Conservative: gaff, advis, almaliki, stabl, advisor, iraq, afghanistan
Topic 9, Group Liberal: reid, caucus, stimulus, harri, packag, skip, dem
Topic 10, Group Conservative: revenu, investor, deficit, europ, germani, commerc, emiss
Topic 10, Group Liberal: mortgag, debt, homeown, teacher, lender, auto, taxpay
Topic 11, Group Conservative: maverick, oregon, gaff, centrist, gap, exit, gender
Topic 11, Group Liberal: ralli, biden, sarah, palin, joe, nyt, plumber
Topic 12, Group Conservative: pakistani, pakistan, islamist, navi, virginia, osama, alqaeda
Topic 12, Group Liberal: thinkfast, flv, korea, inspector, ope, raz, ambassador
Topic 13, Group Conservative: book, trend, britain, circul, copi, green, modern
Topic 13, Group Liberal: film, spi, tale, music, drill, johnson, scroll
Topic 14, Group Conservative: palin, msm, sarah, alaska, blogger, mainstream, circul
Topic 14, Group Liberal: hanniti, flv, recess, cia, digg, dana, don
Topic 15, Group Conservative: huckabe, fred, eastern, kerri, evangel, smile, gibson
Topic 15, Group Liberal: gaff, rak, raz, client, cbs, lobbi, digg
Topic 16, Group Conservative: provinc, ship, ambassador, spi, gulf, villag, arrest
Topic 16, Group Liberal: nuclear, weapon, afghanistan, enrich, taliban, global, pakistan
Topic 17, Group Conservative: hispan, enforc, border, san, park, personnel, lobbi
Topic 17, Group Liberal: gay, battleground, enact, turnout, africanamerican, plumber, court
Topic 18, Group Conservative: lobbyist, immun, fundrais, reelect, congressman, mayor, alaska
Topic 18, Group Liberal: moham, homeland, guantanamo, applic, bay, permiss, award
Topic 19, Group Conservative: crap, enrich, johnson, inspector, graham, discrimin, perman
Topic 19, Group Liberal: deficit, infrastructur, unemploy, thinkfast, economist, oil, consum
Topic 20, Group Conservative: ohio, nbc, ambassador, racial, gender, gore, defeat
Topic 20, Group Liberal: romney, turnout, huckabe, oregon, fundrais, circul, grassroot
plot(poliblog5k.fit.cont.rat)
The overall topic words seem to indicate the topics are now:
We can estimate for multiple groups, like blog here. Be aware this takes considerably longer. About 8 minutes in the example below.
system.time(
poliblog5k.fit.cont.blog <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ blog + s(day),
content =~ blog,
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
)
user system elapsed
389.264 70.575 459.867
We can also estimate metadata interactions, with a binary moderatng variable.
poliblog5k.fit.ratday.int <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 20,
prevalence =~ rating * s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=FALSE)
poliblog5k.ratday.int.prep <- estimateEffect(formula = c(19) ~ rating*day, stmobj = poliblog5k.fit.ratday.int, metadata = poliblog5k.meta, uncertainty="None")
plot(poliblog5k.ratday.int.prep, covariate = "day", model = poliblog5k.fit.ratday.int, method = "continuous", xlab = "Days", moderator = "rating", moderator.value = "Liberal", linecol = "blue", ylim = c(0, .12), printlegend = F)
plot(poliblog5k.ratday.int.prep, covariate = "day", model = poliblog5k.fit.ratday.int, method = "continuous", xlab = "Days", moderator = "rating", moderator.value = "Conservative", linecol = "red", add = T, printlegend = F)
legend(0, .08, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red"))
estEffpoliblog <- estimateEffect(1:20 ~ rating + s(day), poliblog5k.fit.rat_day, meta = poliblog5k.meta, uncertainty = "Global")
summary(estEffpoliblog, topics=1)
Call:
estimateEffect(formula = 1:20 ~ rating + s(day), stmobj = poliblog5k.fit.rat_day,
metadata = poliblog5k.meta, uncertainty = "Global")
Topic 1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.647e-05 1.432e-02 0.005 0.996297
ratingLiberal 2.008e-02 3.687e-03 5.446 5.41e-08 ***
s(day)1 7.033e-02 2.761e-02 2.547 0.010887 *
s(day)2 4.447e-02 1.798e-02 2.473 0.013430 *
s(day)3 1.032e-02 1.974e-02 0.523 0.601301
s(day)4 6.038e-02 1.810e-02 3.337 0.000855 ***
s(day)5 4.927e-02 1.830e-02 2.692 0.007133 **
s(day)6 -5.632e-03 1.683e-02 -0.335 0.737944
s(day)7 3.364e-02 1.834e-02 1.834 0.066684 .
s(day)8 8.047e-03 2.058e-02 0.391 0.695760
s(day)9 6.098e-02 2.228e-02 2.737 0.006231 **
s(day)10 1.347e-02 2.167e-02 0.622 0.534047
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
poliblog5k.eff.day.nometa <- estimateEffect(1:20 ~ s(day), poliblog5k.fit.nometa, meta = poliblog5k.meta, uncertainty = "Global")
summary(poliblog5k.eff.day.nometa, topics=19)
Call:
estimateEffect(formula = 1:20 ~ s(day), stmobj = poliblog5k.fit.nometa,
metadata = poliblog5k.meta, uncertainty = "Global")
Topic 19:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.013862 0.015954 0.869 0.38495
s(day)1 0.065560 0.030989 2.116 0.03443 *
s(day)2 0.037224 0.017392 2.140 0.03238 *
s(day)3 0.003956 0.021425 0.185 0.85350
s(day)4 0.054261 0.019237 2.821 0.00481 **
s(day)5 0.036220 0.019670 1.841 0.06563 .
s(day)6 -0.003282 0.018759 -0.175 0.86112
s(day)7 0.029519 0.019765 1.494 0.13536
s(day)8 0.007867 0.022743 0.346 0.72944
s(day)9 0.053945 0.023438 2.302 0.02140 *
s(day)10 0.009722 0.022540 0.431 0.66624
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(poliblog5k.eff.day.nometa, "day", method = "continuous", topics = 19,model = poliblog5k.fit.nometa, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month")
monthnames <- months(monthseq)
axis(1,at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames)
poliblog5k.eff.day.rat_day <- estimateEffect(1:20 ~ s(day), poliblog5k.fit.rat_day, meta = poliblog5k.meta, uncertainty = "Global")
summary(poliblog5k.eff.day.rat_day, topics=19)
Call:
estimateEffect(formula = 1:20 ~ s(day), stmobj = poliblog5k.fit.rat_day,
metadata = poliblog5k.meta, uncertainty = "Global")
Topic 19:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009106 0.014966 0.608 0.54293
s(day)1 0.075717 0.029789 2.542 0.01106 *
s(day)2 0.040852 0.018231 2.241 0.02508 *
s(day)3 0.008178 0.020515 0.399 0.69018
s(day)4 0.060850 0.018664 3.260 0.00112 **
s(day)5 0.047174 0.018575 2.540 0.01112 *
s(day)6 -0.005778 0.018954 -0.305 0.76048
s(day)7 0.037943 0.018143 2.091 0.03655 *
s(day)8 0.008792 0.023797 0.369 0.71179
s(day)9 0.061950 0.023306 2.658 0.00788 **
s(day)10 0.012290 0.022760 0.540 0.58922
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(poliblog5k.eff.day.rat_day, "day", method = "continuous", topics = 19,model = poliblog5k.fit.rat_day, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month")
monthnames <- months(monthseq)
axis(1,at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames)
There is in fact a pattern to the garbage topics, if we model by blog.
poliblog5k.eff.blog_day <- estimateEffect(1:20 ~ blog + s(day), poliblog5k.fit.blog_day, meta = poliblog5k.meta, uncertainty = "Global")
summary(poliblog5k.eff.blog_day, topics=3)
Call:
estimateEffect(formula = 1:20 ~ blog + s(day), stmobj = poliblog5k.fit.blog_day,
metadata = poliblog5k.meta, uncertainty = "Global")
Topic 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0030404 0.0146461 -0.208 0.835559
blogdb 0.0343221 0.0063726 5.386 7.54e-08 ***
blogha 0.0002643 0.0045484 0.058 0.953662
blogmm 0.0145629 0.0083330 1.748 0.080592 .
blogtp 0.0418508 0.0059159 7.074 1.71e-12 ***
blogtpm -0.0144186 0.0054358 -2.652 0.008015 **
s(day)1 0.0718233 0.0289446 2.481 0.013119 *
s(day)2 0.0467922 0.0180917 2.586 0.009727 **
s(day)3 0.0125426 0.0196402 0.639 0.523101
s(day)4 0.0625965 0.0178221 3.512 0.000448 ***
s(day)5 0.0510657 0.0179250 2.849 0.004406 **
s(day)6 -0.0029779 0.0179131 -0.166 0.867975
s(day)7 0.0394827 0.0178074 2.217 0.026654 *
s(day)8 0.0103715 0.0217261 0.477 0.633117
s(day)9 0.0637732 0.0229270 2.782 0.005430 **
s(day)10 0.0162424 0.0208451 0.779 0.435900
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(poliblog5k.eff.blog_day, covariate = "blog", topics = c(3), model = poliblog5k.fit.blog_day, method = "pointestimate",
main = "Effect of Blog on Topic Proportion",
xlim = c(0, .25), labeltype = "custom",
custom.labels = c("Hot Air", "Digby", "American Thinker", "Talking Points Memo", "Michelle Malkin", "Think Progress"))
plot(poliblog5k.eff.blog_day, covariate = "blog", topics = c(5), model = poliblog5k.fit.blog_day, method = "pointestimate",
main = "Effect of Blog on Topic Proportion",
xlim = c(0, .25), labeltype = "custom",
custom.labels = c("Hot Air", "Digby", "American Thinker", "Talking Points Memo", "Michelle Malkin", "Think Progress"))
What does this imply for our topics as “topics”?
What does this imply for our estimates of “topic proportion”?
What does this imply for topic concentration parameters?
This takes about 6 minutes in this example.
## requires installation of packages: Rtsne, rsvd, geometry
system.time(
poliblog5k.fit.lee_mimno <- stm(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = 0, # K=0 instructs STM to run Lee-Mimno
seed = 1234, # randomness now, seed matters
prevalence =~ rating + s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=TRUE)
)
This finds a 55-topic model. I’m a little surprised they look as good as they do.
plot(poliblog5k.fit.lee_mimno)
labelTopics(poliblog5k.fit.lee_mimno)
Topic 1 Top Words:
Highest Prob: women, right, abort, gay, marriag, issu, religi
FREX: abort, gay, women, marriag, sex, religi, woman
Lift: abort, gay, sex, marriag, sexual, women, prolif
Score: abort, gay, women, marriag, sex, religi, sexual
Topic 2 Top Words:
Highest Prob: world, america, nation, countri, american, foreign, will
FREX: world, america, foreign, democraci, must, countri, freedom
Lift: ahmadinejad, democraci, world, europ, diplomaci, abroad, freedom
Score: ahmadinejad, world, foreign, america, democraci, countri, secur
Topic 3 Top Words:
Highest Prob: health, care, american, famili, will, work, job
FREX: health, care, union, worker, famili, insur, job
Lift: auto, health, insur, worker, union, wage, minimum
Score: auto, health, insur, worker, care, union, famili
Topic 4 Top Words:
Highest Prob: tax, cut, plan, billion, spend, bailout, pay
FREX: tax, bailout, cut, billion, spend, budget, pay
Lift: bailout, tax, deficit, revenu, stimulus, budget, fiscal
Score: bailout, tax, billion, spend, budget, stimulus, deficit
Topic 5 Top Words:
Highest Prob: win, will, vote, state, elect, victori, can
FREX: win, victori, elector, lose, michigan, won, pennsylvania
Lift: battleground, win, elector, turnout, indiana, victori, momentum
Score: battleground, win, vote, elector, victori, pennsylvania, turnout
Topic 6 Top Words:
Highest Prob: time, stori, new, post, york, articl, read
FREX: publish, page, stori, blog, piec, newspap, editor
Lift: birth, publish, editor, newspap, editori, circul, page
Score: birth, stori, publish, articl, book, newspap, editor
Topic 7 Top Words:
Highest Prob: blagojevich, senat, governor, illinoi, seat, polit, rezko
FREX: blagojevich, illinoi, rezko, appoint, corrupt, scandal, seat
Lift: blagojevich, rezko, rod, jackson, probe, illinoi, toni
Score: blagojevich, rezko, illinoi, rod, seat, chicago, governor
Topic 8 Top Words:
Highest Prob: hous, white, congress, pelosi, committe, rep, member
FREX: pelosi, hous, white, speaker, hill, congress, rep
Lift: boehner, pelosi, capitol, speaker, nanci, hill, hous
Score: boehner, hous, pelosi, white, congress, rep, speaker
Topic 9 Top Words:
Highest Prob: bush, presid, administr, georg, cheney, year, vice
FREX: bush, presid, cheney, georg, administr, vice, execut
Lift: branch, bush, cheney, presid, dick, vice, georg
Score: branch, bush, presid, cheney, administr, vice, georg
Topic 10 Top Words:
Highest Prob: fund, money, want, need, get, project, use
FREX: fund, earmark, project, money, bridg, system, didn
Lift: bridg, earmark, pork, infrastructur, project, wast, nowher
Score: bridg, earmark, pork, fund, money, spend, congress
Topic 11 Top Words:
Highest Prob: obama, barack, polici, will, team, chang, advis
FREX: team, obama, barack, advis, transit, presidentelect, polici
Lift: cabinet, presidentelect, transit, team, advisor, german, gate
Score: cabinet, obama, barack, presidentelect, advis, team, transit
Topic 12 Top Words:
Highest Prob: one, will, film, love, day, life, year
FREX: film, cathol, love, movi, hollywood, father, god
Lift: cathol, film, hollywood, movi, music, beauti, soul
Score: cathol, film, movi, hollywood, music, god, love
Topic 13 Top Words:
Highest Prob: show, get, don, see, doesn, can, video
FREX: video, radio, rush, clip, don, doesn, show
Lift: chat, limbaugh, morrissey, clip, rush, youtub, radio
Score: chat, radio, limbaugh, video, clip, show, don
Topic 14 Top Words:
Highest Prob: obama, wright, black, church, white, america, american
FREX: wright, church, black, pastor, jeremiah, racist, racial
Lift: church, wright, jeremiah, reverend, pastor, racist, racial
Score: church, wright, obama, jeremiah, black, pastor, racist
Topic 15 Top Words:
Highest Prob: intellig, offici, depart, administr, cia, inform, agenc
FREX: cia, intellig, rice, agenc, depart, document, spi
Lift: cia, spi, rice, intellig, leak, fbi, agenc
Score: cia, intellig, rice, depart, spi, agenc, fbi
Topic 16 Top Words:
Highest Prob: state, new, florida, obama, carolina, virginia, north
FREX: florida, carolina, virginia, colorado, pennsylvania, ohio, kerri
Lift: colorado, nevada, carolina, virginia, florida, missouri, iowa
Score: colorado, virginia, carolina, florida, nevada, ohio, pennsylvania
Topic 17 Top Words:
Highest Prob: hillari, clinton, primari, deleg, edward, obama, will
FREX: hillari, deleg, clinton, edward, primari, superdeleg, nomin
Lift: deleg, superdeleg, hillari, clinton, edward, super, primari
Score: deleg, hillari, clinton, superdeleg, edward, primari, obama
Topic 18 Top Words:
Highest Prob: court, law, rule, constitut, judg, justic, suprem
FREX: court, judg, constitut, rule, detaine, suprem, law
Lift: detaine, court, judici, suprem, constitut, judg, rule
Score: detaine, court, suprem, justic, law, constitut, lawyer
Topic 19 Top Words:
Highest Prob: peopl, crime, govern, will, prosecut, polit, crimin
FREX: crime, prosecut, crimin, murder, prison, detent, arrest
Lift: detent, prosecut, jail, crime, murder, crimin, violent
Score: detent, prosecut, crime, murder, prison, arrest, crimin
Topic 20 Top Words:
Highest Prob: energi, oil, price, drill, gas, product, will
FREX: oil, drill, energi, price, gas, product, coal
Lift: drill, oil, coal, energi, gas, price, product
Score: drill, oil, energi, price, gas, coal, product
Topic 21 Top Words:
Highest Prob: iran, nuclear, weapon, iranian, will, program, threat
FREX: iran, nuclear, iranian, weapon, saudi, enrich, korea
Lift: enrich, tehran, iran, iranian, korea, weapon, nuclear
Score: enrich, iran, iranian, nuclear, weapon, missil, saudi
Topic 22 Top Words:
Highest Prob: poll, lead, obama, margin, show, error, number
FREX: poll, error, margin, lead, rasmussen, ahead, compar
Lift: error, poll, rasmussen, margin, pollster, edg, narrow
Score: error, poll, margin, obama, rasmussen, lead, pollster
Topic 23 Top Words:
Highest Prob: immigr, citi, illeg, gun, home, polic, offic
FREX: immigr, gun, illeg, citi, alien, san, polic
Lift: francisco, immigr, san, alien, gun, illeg, licens
Score: francisco, immigr, illeg, polic, citi, san, gun
Topic 24 Top Words:
Highest Prob: franken, count, ballot, coleman, minnesota, vote, campaign
FREX: franken, coleman, count, minnesota, ballot, recount, challeng
Lift: franken, minnesota, coleman, recount, ballot, norm, count
Score: franken, coleman, ballot, minnesota, recount, count, norm
Topic 25 Top Words:
Highest Prob: obama, voter, point, among, percent, may, even
FREX: among, percent, voter, favor, advantag, point, independ
Lift: gallup, undecid, gap, percentag, advantag, newsweek, demograph
Score: gallup, obama, voter, percent, undecid, among, mccain
Topic 26 Top Words:
Highest Prob: israel, isra, hama, palestinian, arab, will, east
FREX: hama, isra, israel, palestinian, gaza, arab, east
Lift: gaza, hama, palestinian, isra, israel, arab, carter
Score: gaza, israel, hama, isra, palestinian, arab, jew
Topic 27 Top Words:
Highest Prob: like, one, think, just, make, can, get
FREX: linktocommentspostcount, postcounttb, realli, guy, thing, think, mayb
Lift: gender, digbyi, digbi, linktocommentspostcount, postcounttb, dday, crap
Score: gender, linktocommentspostcount, postcounttb, guy, think, digbi, like
Topic 28 Top Words:
Highest Prob: romney, huckabe, eastern, mccain, mitt, updat, fred
FREX: romney, huckabe, mitt, giuliani, eastern, fred, rudi
Lift: giuliani, romney, mitt, huckabe, rudi, fred, eastern
Score: giuliani, romney, huckabe, mitt, eastern, rudi, fred
Topic 29 Top Words:
Highest Prob: campaign, convent, million, money, obama, will, rais
FREX: convent, fundrais, donor, donat, million, kennedi, rais
Lift: inaugur, donor, fundrais, donat, convent, rnc, dnc
Score: inaugur, fundrais, donor, convent, obama, donat, money
Topic 30 Top Words:
Highest Prob: gop, race, dem, candid, new, seat, senat
FREX: dem, smith, gop, incumb, district, race, seat
Lift: incumb, smith, dem, jeff, district, gordon, martin
Score: incumb, dem, gop, race, seat, smith, rep
Topic 31 Top Words:
Highest Prob: obama, ayer, communiti, barack, group, chicago, organ
FREX: ayer, communiti, chicago, jewish, radic, organ, william
Lift: jewish, ayer, laski, chicago, communiti, weather, radic
Score: jewish, ayer, obama, chicago, barack, communiti, radic
Topic 32 Top Words:
Highest Prob: attack, terrorist, kill, terror, bomb, qaeda, muslim
FREX: muslim, bin, laden, terrorist, qaeda, osama, islam
Lift: laden, osama, bin, suicid, muslim, moham, qaeda
Score: laden, terrorist, qaeda, bin, osama, bomb, islam
Topic 33 Top Words:
Highest Prob: democrat, republican, parti, elect, polit, candid, gop
FREX: democrat, parti, republican, elect, gop, nomine, candid
Lift: lean, parti, democrat, republican, toss, nomine, partisan
Score: lean, democrat, republican, parti, gop, elect, candid
Topic 34 Top Words:
Highest Prob: compani, firm, lobbyist, busi, former, campaign, contract
FREX: lobbyist, firm, contract, lobbi, compani, employe, hire
Lift: lobbi, contract, lobbyist, client, firm, davi, boston
Score: lobbi, lobbyist, compani, contract, employe, firm, client
Topic 35 Top Words:
Highest Prob: conserv, liber, polit, progress, will, movement, reagan
FREX: conserv, liber, reagan, progress, movement, ideolog, principl
Lift: louisiana, conservat, conserv, reagan, elit, liber, ronald
Score: louisiana, conserv, liber, progress, reagan, movement, conservat
Topic 36 Top Words:
Highest Prob: loan, mortgag, fanni, home, johnson, freddi, crisi
FREX: fanni, freddi, loan, mae, mortgag, johnson, mac
Lift: mae, fanni, freddi, mac, subprim, borrow, lender
Score: mae, mortgag, fanni, loan, freddi, lender, mac
Topic 37 Top Words:
Highest Prob: iraq, iraqi, troop, forc, will, surg, secur
FREX: iraqi, petraeus, maliki, surg, baghdad, iraq, troop
Lift: maliki, petraeus, iraqi, baghdad, sunni, shiit, surg
Score: maliki, iraqi, iraq, troop, petraeus, baghdad, surg
Topic 38 Top Words:
Highest Prob: senat, bill, vote, legisl, lieberman, will, pass
FREX: legisl, lieberman, bill, reid, senat, pass, harri
Lift: mitch, reid, legisl, lieberman, fisa, immun, veto
Score: mitch, senat, lieberman, bill, vote, legisl, reid
Topic 39 Top Words:
Highest Prob: global, warm, chang, climat, year, water, will
FREX: warm, global, climat, emiss, gore, scientist, offshor
Lift: offshor, warm, emiss, climat, global, environment, ice
Score: offshor, global, warm, climat, emiss, scientist, ice
Topic 40 Top Words:
Highest Prob: palin, sarah, governor, alaska, run, experi, ticket
FREX: palin, sarah, alaska, governor, mate, ticket, experi
Lift: palin, sarah, alaska, mate, ticket, todd, governor
Score: palin, sarah, alaska, governor, gov, ticket, mate
Topic 41 Top Words:
Highest Prob: market, financi, bank, govern, billion, taxpay, street
FREX: market, bank, treasuri, paulson, stock, asset, financi
Lift: paulson, treasuri, asset, stock, bail, investor, bank
Score: paulson, market, treasuri, bank, billion, financi, taxpay
Topic 42 Top Words:
Highest Prob: economi, econom, will, crisi, american, job, year
FREX: economi, econom, crisi, recess, job, wall, street
Lift: phil, economi, recess, economist, econom, unemploy, crisi
Score: phil, economi, econom, crisi, recess, unemploy, economist
Topic 43 Top Words:
Highest Prob: said, report, yesterday, sen, percent, today, accord
FREX: percent, steven, yesterday, sen, accord, thinkfast, releas
Lift: rak, thinkfast, steven, thinkprogress, dil, hurrican, ted
Score: rak, thinkfast, sen, steven, percent, investig, report
Topic 44 Top Words:
Highest Prob: debat, obama, biden, joe, ralli, event, will
FREX: biden, ralli, joe, debat, event, plumber, flag
Lift: ralli, biden, joe, plumber, pin, debat, flag
Score: ralli, biden, joe, obama, debat, plumber, barack
Topic 45 Top Words:
Highest Prob: mccain, john, sen, issu, said, raz, today
FREX: mccain, john, raz, sen, graham, maverick, issu
Lift: raz, mccain, graham, pow, john, maverick, digg
Score: mccain, raz, john, sen, maverick, graham, pow
Topic 46 Top Words:
Highest Prob: voter, vote, elect, acorn, fraud, state, registr
FREX: acorn, fraud, registr, voter, counti, regist, elect
Lift: registr, acorn, fraud, counti, regist, intimid, voter
Score: registr, acorn, voter, fraud, vote, ballot, counti
Topic 47 Top Words:
Highest Prob: said, say, think, know, peopl, ask, question
FREX: think, interview, ask, know, thing, say, said
Lift: regret, gonna, transcript, anybodi, repli, interview, somebodi
Score: regret, said, think, say, know, interview, ask
Topic 48 Top Words:
Highest Prob: war, iraq, militari, american, afghanistan, troop, veteran
FREX: war, saddam, iraq, veteran, militari, afghanistan, invas
Lift: saddam, hussein, war, veteran, invas, vietnam, antiwar
Score: saddam, iraq, war, militari, afghanistan, troop, veteran
Topic 49 Top Words:
Highest Prob: armi, pentagon, dead, build, confirm, sadr, british
FREX: pentagon, armi, sadr, british, marin, dead, navi
Lift: sadr, pentagon, marin, british, basra, armi, flight
Score: sadr, armi, pentagon, basra, british, militia, marin
Topic 50 Top Words:
Highest Prob: media, news, report, press, fox, coverag, stori
FREX: fox, media, news, coverag, msnbc, nbc, network
Lift: sorri, olbermann, fox, outlet, nbc, media, coverag
Score: sorri, media, fox, news, coverag, msnbc, press
Topic 51 Top Words:
Highest Prob: school, univers, rove, research, studi, student, educ
FREX: rove, univers, school, karl, professor, student, research
Lift: stem, rove, karl, professor, univers, academ, student
Score: stem, rove, school, univers, research, student, karl
Topic 52 Top Words:
Highest Prob: obama, campaign, attack, call, say, barack, camp
FREX: campaign, camp, late, attack, advis, negat, clark
Lift: surrog, spokesperson, clark, camp, campaign, negat, unclear
Score: surrog, obama, campaign, attack, camp, barack, clark
Topic 53 Top Words:
Highest Prob: russia, russian, georgia, pakistan, govern, afghanistan, taliban
FREX: russian, russia, taliban, pakistan, georgia, nato, pakistani
Lift: taliban, afghan, russian, pakistani, nato, tribal, russia
Score: taliban, russian, russia, pakistan, nato, pakistani, georgia
Topic 54 Top Words:
Highest Prob: tortur, interrog, use, prison, techniqu, guantanamo, said
FREX: interrog, tortur, techniqu, guantanamo, prison, legal, abu
Lift: techniqu, interrog, tortur, guantanamo, abu, method, prison
Score: techniqu, tortur, interrog, guantanamo, prison, abu, legal
Topic 55 Top Words:
Highest Prob: withdraw, timet, obama, troop, iraq, polici, foreign
FREX: timet, withdraw, troop, almaliki, date, prime, mission
Lift: timet, withdraw, almaliki, date, mission, prime, troop
Score: timet, withdraw, almaliki, troop, iraq, obama, iraqi
NOTE: Running the six values of K below, and dedicating only one computer core to the task, the following takes 20 minutes to run.
system.time(
poliblog5k.searchK <- searchK(documents = poliblog5k.docs,
vocab = poliblog5k.voc,
K = c(10,20,30,40,50,60), #specify K to try
N = 500, # matches 10% default
proportion = 0.5, # default
heldout.seed = 1234, # optional
M = 10, # default
cores = 1, # default
prevalence =~ rating + s(day),
max.em.its = 75,
data = poliblog5k.meta,
init.type = "Spectral",
verbose=TRUE)
)
plot(poliblog5k.searchK)
It’s hard to argue there’s a “true” \(K\) in there.