Revised October 2021.
The code for the example (that in the first four code chunks) is that provided in the documentation: https://tutorials.quanteda.io/machine-learning/wordfish/
Wordfish is built into quanteda, so it's easy to run. We'll also compare the Wordfish output to that of a two-dimensional topic model, so we'll go ahead and load the stm package as well.
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(stm)
stm v1.3.6 successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
We'll use the example provided in the quanteda documentation, based on the speeches of 14 members of the Irish parliament on the 2010 budget.
toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)
dfmat_irish <- dfm(toks_irish)
tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6, 5))
summary(tmod_wf)
Call:
textmodel_wordfish.dfm(x = dfmat_irish, dir = c(6, 5))
Estimated Document Positions:
theta se
Lenihan, Brian (FF) 1.79395 0.02007
Bruton, Richard (FG) -0.62137 0.02824
Burton, Joan (LAB) -1.13501 0.01568
Morgan, Arthur (SF) -0.07840 0.02896
Cowen, Brian (FF) 1.77839 0.02330
Kenny, Enda (FG) -0.75343 0.02635
ODonnell, Kieran (FG) -0.47615 0.04309
Gilmore, Eamon (LAB) -0.58474 0.02992
Higgins, Michael (LAB) -1.00390 0.03964
Quinn, Ruairi (LAB) -0.92648 0.04183
Gormley, John (Green) 1.18354 0.07224
Ryan, Eamon (Green) 0.14816 0.06322
Cuffe, Ciaran (Green) 0.71537 0.07291
OCaolain, Caoimhghin (SF) -0.03993 0.03877
Estimated Feature Scores:
when i presented the supplementary
beta -0.1593 0.3179 0.3604 0.1934 1.077
psi 1.6241 2.7239 -1.7958 5.3308 -1.134
budget to this house last april
beta 0.03546 0.3078 0.2474 0.1399 0.2420 -0.1563
psi 2.70992 4.5190 3.4603 1.0396 0.9853 -0.5725
said we could work our way
beta -0.8339 0.4158 -0.6138 0.5223 0.6894 0.2751
psi -0.4515 3.5124 1.0857 1.1151 2.5277 1.4190
through period of severe economic distress
beta 0.6116 0.4986 0.2778 1.229 0.4238 1.799
psi 1.1604 -0.1779 4.4656 -2.013 1.5714 -4.456
today can report that notwithstanding
beta 0.09153 0.3041 0.6199 0.0152 1.799
psi 0.83874 1.5644 -0.2467 3.8379 -4.456
difficulties past
beta 1.175 0.4747
psi -1.357 0.9321
There are some nice plotting functions that make visualizing the estimated ``ideal points'' and confidence intervals easy:
textplot_scale1d(tmod_wf)
textplot_scale1d(tmod_wf, groups = docvars(dfmat_irish, "party"))
Those have a feature we would substantively expect. The members of the governing coalition Fianna Fáil (FF) and the Greens are at one end and the opposition parties Fine Gael (FG), Labour (LAB), and Sinn Féin (SF) are at the other, mostly grouped cleanly by party.
We should also look at the content of this dimension estimated by Wordfish. The generic plotting device for this has a nice word highlighting feature:
textplot_scale1d(tmod_wf, margin="features", highlighted = c("government", "global", "children", "bank", "economy", "the", "citizenship", "productivity", "deficit"))
That completely ignores the obvious relationship between beta and term frequency, obscuring the content. To a rough approximation, this can be corrected with part of the "Fightin Words" logic:
zeta_wf <- tmod_wf$beta*sqrt(exp(tmod_wf$psi))
names(zeta_wf) <- colnames(dfmat_irish)
sort(zeta_wf,dec=T)[1:30]
to in the will
2.947986 2.942568 2.779370 2.761293
of our and we
2.590582 2.439671 2.416178 2.407680
new million have by
1.649988 1.558124 1.521765 1.516920
€ 2010 for be
1.504021 1.491775 1.469583 1.430539
on this investment also
1.402273 1.395907 1.353887 1.332236
a measures over scheme
1.305618 1.289669 1.267752 1.263855
jobs i public spending
1.243072 1.240858 1.213490 1.209096
more tax
1.208429 1.168634
sort(zeta_wf,dec=F)[1:30]
he minister fianna taoiseach
-1.6512857 -1.5335016 -1.3928624 -1.3738525
fáil bank one hear
-1.3532490 -1.3088524 -1.1902332 -1.0635363
could was his anglo
-1.0562421 -1.0333503 -0.9879102 -0.9828783
widows brian lenihan let
-0.9209047 -0.8991126 -0.8835787 -0.8820783
say got they deputy
-0.8509755 -0.8325549 -0.7988309 -0.7763807
because taxpayer too mothers
-0.7733182 -0.7683363 -0.7681856 -0.7621122
people alternative never citizenship
-0.7551422 -0.7438396 -0.7268125 -0.7188290
minister's shops
-0.7185174 -0.7181756
plot(tmod_wf$psi,zeta_wf, col=rgb(0,0,0,.5), pch=19, cex=.5)
text(tmod_wf$psi,zeta_wf, names(zeta_wf), pos=4, cex=.6)
This captures more what the impact of each word here is.
First, note that the most "governmenty" words are function/stop words, suggesting the dimension is partially based on length of document. Naturally the government talks more, as they are introducing the budget under debate.
cor(log(rowSums(dfmat_irish)),tmod_wf$theta)
[1] 0.1503318
Second, though, it's not every stop word. The government uses "we", "our", "will", "have". The opposition uses "he", "his", "they", "not", "no".
Beyond these, the government talks of its "schemes" and "investments" and "measures" and "public spending" and the growth of "jobs". The opposition puts titles and names to the "he" and "his" ... "taoiseach" (Prime minister), "deputy minister", etc., references higher abstractions like "election" and "citizenship" and people hurting from government policy, e.g., "mothers", "widows".
So, in this case, Wordfish is capturing something resembling government / opposition contrasts. But it's not clear that this is based on things we care about, that this is meaningful for the parties in the "middle", or that this is meaningful for intraparty positions. And it's hopefully clear that this is not an ideological scaling.
It is an example where Wordfish provides the most plausible results -- a corpus focused on one specific issue. In a broader corpus, topical content is likely to define dimensional scaling. A better approach in that instance is something like WordShoal (Lauderdale and Herzog) which scales within topics and then combines those dimensions.
We've seen some indicators that a two-topic model can do a similar job. (STM provides a warning message to that effect when you estimate a two-topic model.) Let's try STM.
dfmat_irish_stm <- quanteda::convert(dfmat_irish, to = "stm")
names(dfmat_irish_stm)
[1] "documents" "vocab" "meta"
Noting that it is cuckoo-bananas to run a topic model on 14 "documents" ...
irish_stmfit <- stm(documents = dfmat_irish_stm$documents,
vocab = dfmat_irish_stm$vocab,
K = 2,
max.em.its = 75,
data = dfmat_irish_stm$meta,
init.type = "Spectral")
K=2 is equivalent to a unidimensional scaling model which you may prefer.
Beginning Spectral Initialization
Calculating the gram matrix...
Finding anchor words...
..
Recovering initialization...
...................................................
Initialization complete.
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 1 (approx. per word bound = -6.311)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 2 (approx. per word bound = -6.245, relative change = 1.036e-02)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 3 (approx. per word bound = -6.235, relative change = 1.568e-03)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 4 (approx. per word bound = -6.235, relative change = 9.442e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 5 (approx. per word bound = -6.235, relative change = 4.480e-05)
Topic 1: the, of, to, and, in
Topic 2: the, to, of, and, in
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 6 (approx. per word bound = -6.234, relative change = 3.591e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 7 (approx. per word bound = -6.234, relative change = 2.953e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 8 (approx. per word bound = -6.234, relative change = 2.371e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 9 (approx. per word bound = -6.234, relative change = 1.891e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 10 (approx. per word bound = -6.234, relative change = 1.544e-05)
Topic 1: the, of, to, and, in
Topic 2: the, to, of, and, in
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 11 (approx. per word bound = -6.234, relative change = 1.332e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 12 (approx. per word bound = -6.234, relative change = 1.170e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Completing Iteration 13 (approx. per word bound = -6.234, relative change = 1.081e-05)
..............
Completed E-Step (0 seconds).
Completed M-Step.
Model Converged
Let's look at those topics.
labelTopics(irish_stmfit)
Topic 1 Top Words:
Highest Prob: the, of, to, and, in, a, is
FREX: deputies, fianna, fáil, bankers, taoiseach, he, border
Lift: 1.3, 204, 500,000, access, ad, advocated, afraid
Score: fianna, deputies, 1.2, fáil, bankers, recipients, border
Topic 2 Top Words:
Highest Prob: the, to, of, and, in, a, we
FREX: invest, innovation, investment, level, reductions, car, enterprising
Lift: 125,000, 130, 2013, 2016, 25,000, 65, achieve
Score: innovation, 28, enterprising, older, stabilise, invest, environmental
FREX in fact does seem to indicate similar concepts, at its extremes, to our zeta measure above, in its extremes. FREX shows topic 1 to be the opposition end -- references to the taoiseach and his party with some ideological content ("bankers", "border") -- and topic 2 to be the government end -- references to investment and innovation.
Note, however, that the equivalent of "positions" -- the thetas indicating topic proportion -- are mostly shoved to the extremes, suggesting that this specific two-topic model is mainly acting as a government/opposition classifier.
compare.df <- cbind(name=rownames(docvars(dfmat_irish)),wordfish = tmod_wf$theta, stm = irish_stmfit$theta[,2])
compare.df
name wordfish stm
[1,] "1" "1.7939525940103" "0.999856413781216"
[2,] "2" "-0.62136840790286" "0.774604620163482"
[3,] "3" "-1.13501126312838" "0.00020992608309747"
[4,] "4" "-0.078403544126331" "0.00034570599056793"
[5,] "5" "1.77839279585767" "0.999813117226926"
[6,] "6" "-0.753426375315632" "0.000319302880576932"
[7,] "7" "-0.47615363594704" "0.000792269897318122"
[8,] "8" "-0.584743107956521" "0.000582960656530577"
[9,] "9" "-1.00389867131574" "0.470666150075437"
[10,] "10" "-0.926483567016403" "0.48904322622357"
[11,] "11" "1.18354437979617" "0.997922283766731"
[12,] "12" "0.14815927345194" "0.996624540259811"
[13,] "13" "0.715368140729654" "0.998058890115667"
[14,] "14" "-0.0399286111368368" "0.000426769783541685"
It does hopefully make clear the mathematical similarities between unsupervised topic modeling and unsupervised scaling -- one can be interpreted as the other -- despite the ostensibly very different conceptual measurement objectives,