An Introduction to Scaling with Wordfish

Wordfish
Scaling with a two-topic model

Revised October 2021.

The code for the example (that in the first four code chunks) is that provided in the documentation: https://tutorials.quanteda.io/machine-learning/wordfish/

Wordfish

Wordfish is built into quanteda, so it's easy to run. We'll also compare the Wordfish output to that of a two-dimensional topic model, so we'll go ahead and load the stm package as well.

library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(stm)

stm v1.3.6 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com

We'll use the example provided in the quanteda documentation, based on the speeches of 14 members of the Irish parliament on the 2010 budget.

toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)
dfmat_irish <- dfm(toks_irish)
tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6, 5))
summary(tmod_wf)


Call:
textmodel_wordfish.dfm(x = dfmat_irish, dir = c(6, 5))

Estimated Document Positions:
                             theta      se
Lenihan, Brian (FF)        1.79395 0.02007
Bruton, Richard (FG)      -0.62137 0.02824
Burton, Joan (LAB)        -1.13501 0.01568
Morgan, Arthur (SF)       -0.07840 0.02896
Cowen, Brian (FF)          1.77839 0.02330
Kenny, Enda (FG)          -0.75343 0.02635
ODonnell, Kieran (FG)     -0.47615 0.04309
Gilmore, Eamon (LAB)      -0.58474 0.02992
Higgins, Michael (LAB)    -1.00390 0.03964
Quinn, Ruairi (LAB)       -0.92648 0.04183
Gormley, John (Green)      1.18354 0.07224
Ryan, Eamon (Green)        0.14816 0.06322
Cuffe, Ciaran (Green)      0.71537 0.07291
OCaolain, Caoimhghin (SF) -0.03993 0.03877

Estimated Feature Scores:
        when      i presented    the supplementary
beta -0.1593 0.3179    0.3604 0.1934         1.077
psi   1.6241 2.7239   -1.7958 5.3308        -1.134
      budget     to   this  house   last   april
beta 0.03546 0.3078 0.2474 0.1399 0.2420 -0.1563
psi  2.70992 4.5190 3.4603 1.0396 0.9853 -0.5725
        said     we   could   work    our    way
beta -0.8339 0.4158 -0.6138 0.5223 0.6894 0.2751
psi  -0.4515 3.5124  1.0857 1.1151 2.5277 1.4190
     through  period     of severe economic distress
beta  0.6116  0.4986 0.2778  1.229   0.4238    1.799
psi   1.1604 -0.1779 4.4656 -2.013   1.5714   -4.456
       today    can  report   that notwithstanding
beta 0.09153 0.3041  0.6199 0.0152           1.799
psi  0.83874 1.5644 -0.2467 3.8379          -4.456
     difficulties   past
beta        1.175 0.4747
psi        -1.357 0.9321

There are some nice plotting functions that make visualizing the estimated ``ideal points'' and confidence intervals easy:

textplot_scale1d(tmod_wf)

textplot_scale1d(tmod_wf, groups = docvars(dfmat_irish, "party"))

Those have a feature we would substantively expect. The members of the governing coalition Fianna Fáil (FF) and the Greens are at one end and the opposition parties Fine Gael (FG), Labour (LAB), and Sinn Féin (SF) are at the other, mostly grouped cleanly by party.

We should also look at the content of this dimension estimated by Wordfish. The generic plotting device for this has a nice word highlighting feature:

textplot_scale1d(tmod_wf, margin="features", highlighted = c("government", "global", "children", "bank", "economy", "the", "citizenship", "productivity", "deficit"))

That completely ignores the obvious relationship between beta and term frequency, obscuring the content. To a rough approximation, this can be corrected with part of the "Fightin Words" logic:

zeta_wf <- tmod_wf$beta*sqrt(exp(tmod_wf$psi))
names(zeta_wf) <- colnames(dfmat_irish)
sort(zeta_wf,dec=T)[1:30]

        to         in        the       will 
  2.947986   2.942568   2.779370   2.761293 
        of        our        and         we 
  2.590582   2.439671   2.416178   2.407680 
       new    million       have         by 
  1.649988   1.558124   1.521765   1.516920 
         €       2010        for         be 
  1.504021   1.491775   1.469583   1.430539 
        on       this investment       also 
  1.402273   1.395907   1.353887   1.332236 
         a   measures       over     scheme 
  1.305618   1.289669   1.267752   1.263855 
      jobs          i     public   spending 
  1.243072   1.240858   1.213490   1.209096 
      more        tax 
  1.208429   1.168634

sort(zeta_wf,dec=F)[1:30]

         he    minister      fianna   taoiseach 
 -1.6512857  -1.5335016  -1.3928624  -1.3738525 
       fáil        bank         one        hear 
 -1.3532490  -1.3088524  -1.1902332  -1.0635363 
      could         was         his       anglo 
 -1.0562421  -1.0333503  -0.9879102  -0.9828783 
     widows       brian     lenihan         let 
 -0.9209047  -0.8991126  -0.8835787  -0.8820783 
        say         got        they      deputy 
 -0.8509755  -0.8325549  -0.7988309  -0.7763807 
    because    taxpayer         too     mothers 
 -0.7733182  -0.7683363  -0.7681856  -0.7621122 
     people alternative       never citizenship 
 -0.7551422  -0.7438396  -0.7268125  -0.7188290 
 minister's       shops 
 -0.7185174  -0.7181756

plot(tmod_wf$psi,zeta_wf, col=rgb(0,0,0,.5), pch=19, cex=.5)
text(tmod_wf$psi,zeta_wf, names(zeta_wf), pos=4, cex=.6)

This captures more what the impact of each word here is.

First, note that the most "governmenty" words are function/stop words, suggesting the dimension is partially based on length of document. Naturally the government talks more, as they are introducing the budget under debate.

cor(log(rowSums(dfmat_irish)),tmod_wf$theta)

[1] 0.1503318

Second, though, it's not every stop word. The government uses "we", "our", "will", "have". The opposition uses "he", "his", "they", "not", "no".

Beyond these, the government talks of its "schemes" and "investments" and "measures" and "public spending" and the growth of "jobs". The opposition puts titles and names to the "he" and "his" ... "taoiseach" (Prime minister), "deputy minister", etc., references higher abstractions like "election" and "citizenship" and people hurting from government policy, e.g., "mothers", "widows".

So, in this case, Wordfish is capturing something resembling government / opposition contrasts. But it's not clear that this is based on things we care about, that this is meaningful for the parties in the "middle", or that this is meaningful for intraparty positions. And it's hopefully clear that this is not an ideological scaling.

It is an example where Wordfish provides the most plausible results -- a corpus focused on one specific issue. In a broader corpus, topical content is likely to define dimensional scaling. A better approach in that instance is something like WordShoal (Lauderdale and Herzog) which scales within topics and then combines those dimensions.

Scaling with a two-topic model

We've seen some indicators that a two-topic model can do a similar job. (STM provides a warning message to that effect when you estimate a two-topic model.) Let's try STM.

dfmat_irish_stm <- quanteda::convert(dfmat_irish, to = "stm")
names(dfmat_irish_stm)

[1] "documents" "vocab"     "meta"

Noting that it is cuckoo-bananas to run a topic model on 14 "documents" ...

irish_stmfit <- stm(documents = dfmat_irish_stm$documents, 
                     vocab = dfmat_irish_stm$vocab,
                     K = 2,
                     max.em.its = 75,
                     data = dfmat_irish_stm$meta,
                     init.type = "Spectral")

K=2 is equivalent to a unidimensional scaling model which you may prefer.

Beginning Spectral Initialization 
     Calculating the gram matrix...
     Finding anchor words...
    ..
     Recovering initialization...
    ...................................................
Initialization complete.
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -6.311) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -6.245, relative change = 1.036e-02) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -6.235, relative change = 1.568e-03) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -6.235, relative change = 9.442e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -6.235, relative change = 4.480e-05) 
Topic 1: the, of, to, and, in 
 Topic 2: the, to, of, and, in 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -6.234, relative change = 3.591e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -6.234, relative change = 2.953e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -6.234, relative change = 2.371e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -6.234, relative change = 1.891e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -6.234, relative change = 1.544e-05) 
Topic 1: the, of, to, and, in 
 Topic 2: the, to, of, and, in 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -6.234, relative change = 1.332e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 12 (approx. per word bound = -6.234, relative change = 1.170e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Completing Iteration 13 (approx. per word bound = -6.234, relative change = 1.081e-05) 
..............
Completed E-Step (0 seconds). 
Completed M-Step. 
Model Converged

Let's look at those topics.

labelTopics(irish_stmfit)

Topic 1 Top Words:
     Highest Prob: the, of, to, and, in, a, is 
     FREX: deputies, fianna, fáil, bankers, taoiseach, he, border 
     Lift: 1.3, 204, 500,000, access, ad, advocated, afraid 
     Score: fianna, deputies, 1.2, fáil, bankers, recipients, border 
Topic 2 Top Words:
     Highest Prob: the, to, of, and, in, a, we 
     FREX: invest, innovation, investment, level, reductions, car, enterprising 
     Lift: 125,000, 130, 2013, 2016, 25,000, 65, achieve 
     Score: innovation, 28, enterprising, older, stabilise, invest, environmental

FREX in fact does seem to indicate similar concepts, at its extremes, to our zeta measure above, in its extremes. FREX shows topic 1 to be the opposition end -- references to the taoiseach and his party with some ideological content ("bankers", "border") -- and topic 2 to be the government end -- references to investment and innovation.

Note, however, that the equivalent of "positions" -- the thetas indicating topic proportion -- are mostly shoved to the extremes, suggesting that this specific two-topic model is mainly acting as a government/opposition classifier.

compare.df <- cbind(name=rownames(docvars(dfmat_irish)),wordfish = tmod_wf$theta, stm = irish_stmfit$theta[,2])
compare.df

      name wordfish              stm                   
 [1,] "1"  "1.7939525940103"     "0.999856413781216"   
 [2,] "2"  "-0.62136840790286"   "0.774604620163482"   
 [3,] "3"  "-1.13501126312838"   "0.00020992608309747" 
 [4,] "4"  "-0.078403544126331"  "0.00034570599056793" 
 [5,] "5"  "1.77839279585767"    "0.999813117226926"   
 [6,] "6"  "-0.753426375315632"  "0.000319302880576932"
 [7,] "7"  "-0.47615363594704"   "0.000792269897318122"
 [8,] "8"  "-0.584743107956521"  "0.000582960656530577"
 [9,] "9"  "-1.00389867131574"   "0.470666150075437"   
[10,] "10" "-0.926483567016403"  "0.48904322622357"    
[11,] "11" "1.18354437979617"    "0.997922283766731"   
[12,] "12" "0.14815927345194"    "0.996624540259811"   
[13,] "13" "0.715368140729654"   "0.998058890115667"   
[14,] "14" "-0.0399286111368368" "0.000426769783541685"

It does hopefully make clear the mathematical similarities between unsupervised topic modeling and unsupervised scaling -- one can be interpreted as the other -- despite the ostensibly very different conceptual measurement objectives,

LS0tCnRpdGxlOiAiQW4gSW50cm9kdWN0aW9uIHRvIFNjYWxpbmcgd2l0aCBXb3JkZmlzaCIKc3VidGl0bGU6IFByZXBhcmVkIGZvciBUZXh0IGFzIERhdGEsIFBlbm4gU3RhdGUKYXV0aG9yOiBCdXJ0IEwuIE1vbnJvZQpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgaGlnaGxpZ2h0OiB0YW5nbwogICAgdGhlbWU6IHVuaXRlZAogICAgdG9jOiB5ZXMKICBodG1sX2RvY3VtZW50OgogICAgZGZfcHJpbnQ6IHBhZ2VkCiAgICB0b2M6IHllcwotLS0KClJldmlzZWQgT2N0b2JlciAyMDIxLgoKVGhlIGNvZGUgZm9yIHRoZSBleGFtcGxlICh0aGF0IGluIHRoZSBmaXJzdCBmb3VyIGNvZGUgY2h1bmtzKSBpcyB0aGF0IHByb3ZpZGVkIGluIHRoZSBkb2N1bWVudGF0aW9uOiBodHRwczovL3R1dG9yaWFscy5xdWFudGVkYS5pby9tYWNoaW5lLWxlYXJuaW5nL3dvcmRmaXNoLwoKIyMgV29yZGZpc2gKCldvcmRmaXNoIGlzIGJ1aWx0IGludG8gcXVhbnRlZGEsIHNvIGl0J3MgZWFzeSB0byBydW4uIFdlJ2xsIGFsc28gY29tcGFyZSB0aGUgV29yZGZpc2ggb3V0cHV0IHRvIHRoYXQgb2YgYSB0d28tZGltZW5zaW9uYWwgdG9waWMgbW9kZWwsIHNvIHdlJ2xsIGdvIGFoZWFkIGFuZCBsb2FkIHRoZSBzdG0gcGFja2FnZSBhcyB3ZWxsLgoKYGBge3J9CmxpYnJhcnkocXVhbnRlZGEpCmxpYnJhcnkocXVhbnRlZGEudGV4dG1vZGVscykKbGlicmFyeShxdWFudGVkYS50ZXh0cGxvdHMpCmxpYnJhcnkoc3RtKQpgYGAKCldlJ2xsIHVzZSB0aGUgZXhhbXBsZSBwcm92aWRlZCBpbiB0aGUgcXVhbnRlZGEgZG9jdW1lbnRhdGlvbiwgYmFzZWQgb24gdGhlIHNwZWVjaGVzIG9mIDE0IG1lbWJlcnMgb2YgdGhlIElyaXNoIHBhcmxpYW1lbnQgb24gdGhlIDIwMTAgYnVkZ2V0LgoKYGBge3J9CnRva3NfaXJpc2ggPC0gdG9rZW5zKGRhdGFfY29ycHVzX2lyaXNoYnVkZ2V0MjAxMCwgcmVtb3ZlX3B1bmN0ID0gVFJVRSkKZGZtYXRfaXJpc2ggPC0gZGZtKHRva3NfaXJpc2gpCnRtb2Rfd2YgPC0gdGV4dG1vZGVsX3dvcmRmaXNoKGRmbWF0X2lyaXNoLCBkaXIgPSBjKDYsIDUpKQpzdW1tYXJ5KHRtb2Rfd2YpCmBgYAoKVGhlcmUgYXJlIHNvbWUgbmljZSBwbG90dGluZyBmdW5jdGlvbnMgdGhhdCBtYWtlIHZpc3VhbGl6aW5nIHRoZSBlc3RpbWF0ZWQgYGBpZGVhbCBwb2ludHMnJyBhbmQgY29uZmlkZW5jZSBpbnRlcnZhbHMgZWFzeToKCmBgYHtyfQp0ZXh0cGxvdF9zY2FsZTFkKHRtb2Rfd2YpCmBgYAoKYGBge3J9CnRleHRwbG90X3NjYWxlMWQodG1vZF93ZiwgZ3JvdXBzID0gZG9jdmFycyhkZm1hdF9pcmlzaCwgInBhcnR5IikpCmBgYAoKVGhvc2UgaGF2ZSBhIGZlYXR1cmUgd2Ugd291bGQgc3Vic3RhbnRpdmVseSBleHBlY3QuIFRoZSBtZW1iZXJzIG9mIHRoZSBnb3Zlcm5pbmcgY29hbGl0aW9uIEZpYW5uYSBGw6FpbCAoRkYpIGFuZCB0aGUgR3JlZW5zIGFyZSBhdCBvbmUgZW5kIGFuZCB0aGUgb3Bwb3NpdGlvbiBwYXJ0aWVzIEZpbmUgR2FlbCAoRkcpLCBMYWJvdXIgKExBQiksIGFuZCBTaW5uIEbDqWluIChTRikgYXJlIGF0IHRoZSBvdGhlciwgbW9zdGx5IGdyb3VwZWQgY2xlYW5seSBieSBwYXJ0eS4KCldlIHNob3VsZCBhbHNvIGxvb2sgYXQgdGhlICpjb250ZW50KiBvZiB0aGlzIGRpbWVuc2lvbiBlc3RpbWF0ZWQgYnkgV29yZGZpc2guIFRoZSBnZW5lcmljIHBsb3R0aW5nIGRldmljZSBmb3IgdGhpcyBoYXMgYSBuaWNlIHdvcmQgaGlnaGxpZ2h0aW5nIGZlYXR1cmU6CgpgYGB7cn0KdGV4dHBsb3Rfc2NhbGUxZCh0bW9kX3dmLCBtYXJnaW49ImZlYXR1cmVzIiwgaGlnaGxpZ2h0ZWQgPSBjKCJnb3Zlcm5tZW50IiwgImdsb2JhbCIsICJjaGlsZHJlbiIsICJiYW5rIiwgImVjb25vbXkiLCAidGhlIiwgImNpdGl6ZW5zaGlwIiwgInByb2R1Y3Rpdml0eSIsICJkZWZpY2l0IikpCmBgYAoKVGhhdCBjb21wbGV0ZWx5IGlnbm9yZXMgdGhlIG9idmlvdXMgcmVsYXRpb25zaGlwIGJldHdlZW4gYmV0YSBhbmQgdGVybSBmcmVxdWVuY3ksIG9ic2N1cmluZyB0aGUgY29udGVudC4gVG8gYSByb3VnaCBhcHByb3hpbWF0aW9uLCB0aGlzIGNhbiBiZSBjb3JyZWN0ZWQgd2l0aCBwYXJ0IG9mIHRoZSAiRmlnaHRpbiBXb3JkcyIgbG9naWM6CgpgYGB7ciwgZmlnLndpZHRoPTYsIGZpZy5oZWlnaHQ9NH0KemV0YV93ZiA8LSB0bW9kX3dmJGJldGEqc3FydChleHAodG1vZF93ZiRwc2kpKQpuYW1lcyh6ZXRhX3dmKSA8LSBjb2xuYW1lcyhkZm1hdF9pcmlzaCkKc29ydCh6ZXRhX3dmLGRlYz1UKVsxOjMwXQpzb3J0KHpldGFfd2YsZGVjPUYpWzE6MzBdCgpwbG90KHRtb2Rfd2YkcHNpLHpldGFfd2YsIGNvbD1yZ2IoMCwwLDAsLjUpLCBwY2g9MTksIGNleD0uNSkKdGV4dCh0bW9kX3dmJHBzaSx6ZXRhX3dmLCBuYW1lcyh6ZXRhX3dmKSwgcG9zPTQsIGNleD0uNikKYGBgCgpUaGlzIGNhcHR1cmVzIG1vcmUgd2hhdCB0aGUgaW1wYWN0IG9mIGVhY2ggd29yZCBoZXJlIGlzLgoKRmlyc3QsIG5vdGUgdGhhdCB0aGUgbW9zdCAiZ292ZXJubWVudHkiIHdvcmRzIGFyZSBmdW5jdGlvbi9zdG9wIHdvcmRzLCBzdWdnZXN0aW5nIHRoZSBkaW1lbnNpb24gaXMgcGFydGlhbGx5IGJhc2VkIG9uIGxlbmd0aCBvZiBkb2N1bWVudC4gTmF0dXJhbGx5IHRoZSBnb3Zlcm5tZW50IHRhbGtzIG1vcmUsIGFzIHRoZXkgYXJlIGludHJvZHVjaW5nIHRoZSBidWRnZXQgdW5kZXIgZGViYXRlLgoKYGBge3J9CmNvcihsb2cocm93U3VtcyhkZm1hdF9pcmlzaCkpLHRtb2Rfd2YkdGhldGEpCmBgYAoKU2Vjb25kLCB0aG91Z2gsIGl0J3Mgbm90IGV2ZXJ5IHN0b3Agd29yZC4gVGhlIGdvdmVybm1lbnQgdXNlcyAid2UiLCAib3VyIiwgIndpbGwiLCAiaGF2ZSIuIFRoZSBvcHBvc2l0aW9uIHVzZXMgImhlIiwgImhpcyIsICJ0aGV5IiwgIm5vdCIsICJubyIuCgpCZXlvbmQgdGhlc2UsIHRoZSBnb3Zlcm5tZW50IHRhbGtzIG9mIGl0cyAic2NoZW1lcyIgYW5kICJpbnZlc3RtZW50cyIgYW5kICJtZWFzdXJlcyIgYW5kICJwdWJsaWMgc3BlbmRpbmciIGFuZCB0aGUgZ3Jvd3RoIG9mICJqb2JzIi4gVGhlIG9wcG9zaXRpb24gcHV0cyB0aXRsZXMgYW5kIG5hbWVzIHRvIHRoZSAiaGUiIGFuZCAiaGlzIiAuLi4gInRhb2lzZWFjaCIgKFByaW1lIG1pbmlzdGVyKSwgImRlcHV0eSBtaW5pc3RlciIsIGV0Yy4sIHJlZmVyZW5jZXMgaGlnaGVyIGFic3RyYWN0aW9ucyBsaWtlICJlbGVjdGlvbiIgYW5kICJjaXRpemVuc2hpcCIgYW5kIHBlb3BsZSBodXJ0aW5nIGZyb20gZ292ZXJubWVudCBwb2xpY3ksIGUuZy4sICJtb3RoZXJzIiwgIndpZG93cyIuCgpTbywgaW4gdGhpcyBjYXNlLCBXb3JkZmlzaCBpcyBjYXB0dXJpbmcgc29tZXRoaW5nIHJlc2VtYmxpbmcgZ292ZXJubWVudCAvIG9wcG9zaXRpb24gY29udHJhc3RzLiBCdXQgaXQncyBub3QgY2xlYXIgdGhhdCB0aGlzIGlzIGJhc2VkIG9uIHRoaW5ncyB3ZSBjYXJlIGFib3V0LCB0aGF0IHRoaXMgaXMgbWVhbmluZ2Z1bCBmb3IgdGhlIHBhcnRpZXMgaW4gdGhlICJtaWRkbGUiLCBvciB0aGF0IHRoaXMgaXMgbWVhbmluZ2Z1bCBmb3IgaW50cmFwYXJ0eSBwb3NpdGlvbnMuIEFuZCBpdCdzIGhvcGVmdWxseSBjbGVhciB0aGF0IHRoaXMgaXMgKm5vdCogYW4gaWRlb2xvZ2ljYWwgc2NhbGluZy4KCkl0IGlzIGFuIGV4YW1wbGUgd2hlcmUgV29yZGZpc2ggcHJvdmlkZXMgdGhlIG1vc3QgcGxhdXNpYmxlIHJlc3VsdHMgLS0gYSBjb3JwdXMgZm9jdXNlZCBvbiBvbmUgc3BlY2lmaWMgaXNzdWUuIEluIGEgYnJvYWRlciBjb3JwdXMsIHRvcGljYWwgY29udGVudCBpcyBsaWtlbHkgdG8gZGVmaW5lIGRpbWVuc2lvbmFsIHNjYWxpbmcuIEEgYmV0dGVyIGFwcHJvYWNoIGluIHRoYXQgaW5zdGFuY2UgaXMgc29tZXRoaW5nIGxpa2UgV29yZFNob2FsIChMYXVkZXJkYWxlIGFuZCBIZXJ6b2cpIHdoaWNoIHNjYWxlcyB3aXRoaW4gdG9waWNzIGFuZCB0aGVuIGNvbWJpbmVzIHRob3NlIGRpbWVuc2lvbnMuCgoKIyMgU2NhbGluZyB3aXRoIGEgdHdvLXRvcGljIG1vZGVsCgpXZSd2ZSBzZWVuIHNvbWUgaW5kaWNhdG9ycyB0aGF0IGEgdHdvLXRvcGljIG1vZGVsIGNhbiBkbyBhIHNpbWlsYXIgam9iLiAoU1RNIHByb3ZpZGVzIGEgd2FybmluZyBtZXNzYWdlIHRvIHRoYXQgZWZmZWN0IHdoZW4geW91IGVzdGltYXRlIGEgdHdvLXRvcGljIG1vZGVsLikgTGV0J3MgdHJ5IFNUTS4KCmBgYHtyfQpkZm1hdF9pcmlzaF9zdG0gPC0gcXVhbnRlZGE6OmNvbnZlcnQoZGZtYXRfaXJpc2gsIHRvID0gInN0bSIpCm5hbWVzKGRmbWF0X2lyaXNoX3N0bSkKYGBgCgpOb3RpbmcgdGhhdCBpdCBpcyBjdWNrb28tYmFuYW5hcyB0byBydW4gYSB0b3BpYyBtb2RlbCBvbiAxNCAiZG9jdW1lbnRzIiAuLi4KCmBgYHtyfQppcmlzaF9zdG1maXQgPC0gc3RtKGRvY3VtZW50cyA9IGRmbWF0X2lyaXNoX3N0bSRkb2N1bWVudHMsIAogICAgICAgICAgICAgICAgICAgICB2b2NhYiA9IGRmbWF0X2lyaXNoX3N0bSR2b2NhYiwKICAgICAgICAgICAgICAgICAgICAgSyA9IDIsCiAgICAgICAgICAgICAgICAgICAgIG1heC5lbS5pdHMgPSA3NSwKICAgICAgICAgICAgICAgICAgICAgZGF0YSA9IGRmbWF0X2lyaXNoX3N0bSRtZXRhLAogICAgICAgICAgICAgICAgICAgICBpbml0LnR5cGUgPSAiU3BlY3RyYWwiKQpgYGAKCkxldCdzIGxvb2sgYXQgdGhvc2UgdG9waWNzLgoKYGBge3J9CmxhYmVsVG9waWNzKGlyaXNoX3N0bWZpdCkKYGBgCgpGUkVYIGluIGZhY3QgZG9lcyBzZWVtIHRvIGluZGljYXRlIHNpbWlsYXIgY29uY2VwdHMsIGF0IGl0cyBleHRyZW1lcywgdG8gb3VyIHpldGEgbWVhc3VyZSBhYm92ZSwgaW4gaXRzIGV4dHJlbWVzLiBGUkVYIHNob3dzIHRvcGljIDEgdG8gYmUgdGhlIG9wcG9zaXRpb24gZW5kIC0tIHJlZmVyZW5jZXMgdG8gdGhlIHRhb2lzZWFjaCBhbmQgaGlzIHBhcnR5IHdpdGggc29tZSBpZGVvbG9naWNhbCBjb250ZW50ICgiYmFua2VycyIsICJib3JkZXIiKSAtLSBhbmQgdG9waWMgMiB0byBiZSB0aGUgZ292ZXJubWVudCBlbmQgLS0gcmVmZXJlbmNlcyB0byBpbnZlc3RtZW50IGFuZCBpbm5vdmF0aW9uLgoKTm90ZSwgaG93ZXZlciwgdGhhdCB0aGUgZXF1aXZhbGVudCBvZiAicG9zaXRpb25zIiAtLSB0aGUgdGhldGFzIGluZGljYXRpbmcgdG9waWMgcHJvcG9ydGlvbiAtLSBhcmUgbW9zdGx5IHNob3ZlZCB0byB0aGUgZXh0cmVtZXMsIHN1Z2dlc3RpbmcgdGhhdCB0aGlzIHNwZWNpZmljIHR3by10b3BpYyBtb2RlbCBpcyBtYWlubHkgYWN0aW5nIGFzIGEgZ292ZXJubWVudC9vcHBvc2l0aW9uIGNsYXNzaWZpZXIuIAoKYGBge3J9CmNvbXBhcmUuZGYgPC0gY2JpbmQobmFtZT1yb3duYW1lcyhkb2N2YXJzKGRmbWF0X2lyaXNoKSksd29yZGZpc2ggPSB0bW9kX3dmJHRoZXRhLCBzdG0gPSBpcmlzaF9zdG1maXQkdGhldGFbLDJdKQpjb21wYXJlLmRmCmBgYAoKSXQgZG9lcyBob3BlZnVsbHkgbWFrZSBjbGVhciB0aGUgbWF0aGVtYXRpY2FsIHNpbWlsYXJpdGllcyBiZXR3ZWVuIHVuc3VwZXJ2aXNlZCB0b3BpYyBtb2RlbGluZyBhbmQgdW5zdXBlcnZpc2VkIHNjYWxpbmcgLS0gb25lIGNhbiBiZSBpbnRlcnByZXRlZCBhcyB0aGUgb3RoZXIgLS0gZGVzcGl0ZSB0aGUgb3N0ZW5zaWJseSB2ZXJ5IGRpZmZlcmVudCBjb25jZXB0dWFsIG1lYXN1cmVtZW50IG9iamVjdGl2ZXMsCgoK

An Introduction to Scaling with Wordfish

Prepared for Text as Data, Penn State

Burt L. Monroe

Wordfish

Scaling with a two-topic model