structural topic modeling






Structural topic modeling

How to use the stm package (from the stm vignette): stm workflow

The topic modeling part

Find topics in your data!

install.packages("stm", "SnowballC") # probably new
install.packages("dplyr", "tidyr") # if you don't have them already

Getting my data ready. Note that your data should be a data frame where each row has one document. You should have a column called “documents” that has all of the text. Any other variables can be added as additional columns.

df <- read.table("/Users/TARDIS/Documents/STUDIES/context_word_seg/utt_orth_phon_KEY.txt", header=1, sep="\t", stringsAsFactors=F, quote="", comment.char ="") # this gets used for word-lists contexts, it will get over-written for other contexts
library(dplyr); library(tidyr)
data <- df %>%
  select(-phon) %>%
  extract(col=utt, into=c("child", "age_weeks"), regex="^([[:alpha:]]{2})([[:digit:]]{2})")
data$temp <- gl(n=ceiling(nrow(data)/30), k=30)[1:nrow(data)]
data <- group_by(data, child, age_weeks, temp) %>%
  summarize(documents=paste(orth, collapse=" ")) %>%
  select(-temp)

Now bring it to stm for processing there.

library(stm)
## stm v1.1.3 (2016-01-14) successfully loaded. See ?stm for help.
processed <- textProcessor(data$documents, metadata = data)
## Building corpus... 
## Converting to Lower Case... 
## Removing stopwords... 
## Removing numbers... 
## Removing punctuation... 
## Stemming... 
## Creating Output...
out <- prepDocuments(processed$documents, processed$vocab, processed$meta) # removes infrequent terms depending on user-set parameter lower.thresh (the minimum number of documents a word needs to appear in in order for the word to be kept within the vocabulary)
## Removing 504 of 1222 terms (504 of 9742 tokens) due to frequency 
## Your corpus now has 460 documents, 718 terms and 9238 tokens.

Take a look at those messages. I left everything at default here, but you may or may not want to, depending on your research question.

Now let’s get some topics!! (This might take a looooong time to run.)

fit0 <- stm(out$documents, # the documents
            out$vocab, # the words
            K = 10, # 10 topics
            max.em.its = 75, # set to run for a maximum of 75 EM iterations
            data = out$meta, # all the variables (we're not actually including any predictors in this model, though)
            init.type = "Spectral")  
## Beginning Initialization.
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      .......
## Initialization complete.
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -5.240) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -4.973, relative change = 5.097e-02) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -4.893, relative change = 1.603e-02) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -4.865, relative change = 5.813e-03) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -4.851, relative change = 2.764e-03) 
## Topic 1: come, darl, yes, ssh, bath 
##  Topic 2: look, want, hey, like, can 
##  Topic 3: dear, hmm, come, want, smile 
##  Topic 4: come, now, tell, got, yes 
##  Topic 5: hello, hmm, boo, gillian, hey 
##  Topic 6: yes, come, hey, can, tell 
##  Topic 7: yes, got, tell, come, can 
##  Topic 8: hey, mummi, yes, dear, hannah 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, hey, come, got 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -4.844, relative change = 1.453e-03) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -4.840, relative change = 8.955e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -4.837, relative change = 5.594e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -4.835, relative change = 4.705e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 10 (approx. per word bound = -4.833, relative change = 4.139e-04) 
## Topic 1: come, darl, yes, bath, ssh 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, good, want, smile 
##  Topic 4: come, now, got, alright, know 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, come, well, tell, can 
##  Topic 7: yes, got, tell, come, want 
##  Topic 8: hey, mummi, yes, smile, hannah 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, clever, hey, come 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 11 (approx. per word bound = -4.832, relative change = 3.135e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 12 (approx. per word bound = -4.830, relative change = 2.621e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 13 (approx. per word bound = -4.829, relative change = 2.242e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 14 (approx. per word bound = -4.828, relative change = 2.136e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 15 (approx. per word bound = -4.827, relative change = 2.183e-04) 
## Topic 1: darl, come, yes, bath, alright 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, good, got, smile 
##  Topic 4: come, now, alright, know, got 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, well, come, tell, can 
##  Topic 7: yes, got, tell, come, want 
##  Topic 8: hey, mummi, smile, yes, hannah 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, clever, hey, boy 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 16 (approx. per word bound = -4.826, relative change = 2.327e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 17 (approx. per word bound = -4.825, relative change = 2.282e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 18 (approx. per word bound = -4.824, relative change = 1.872e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 19 (approx. per word bound = -4.823, relative change = 1.281e-04) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 20 (approx. per word bound = -4.823, relative change = 9.551e-05) 
## Topic 1: darl, come, yes, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, good, got, girl 
##  Topic 4: come, now, know, alright, got 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, well, come, tell, can 
##  Topic 7: yes, got, tell, come, two 
##  Topic 8: hey, mummi, smile, hannah, yes 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, clever, hey, come 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 21 (approx. per word bound = -4.823, relative change = 8.411e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 22 (approx. per word bound = -4.822, relative change = 9.121e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 23 (approx. per word bound = -4.822, relative change = 9.178e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 24 (approx. per word bound = -4.821, relative change = 8.362e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 25 (approx. per word bound = -4.821, relative change = 6.609e-05) 
## Topic 1: darl, come, yes, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, good, got, girl 
##  Topic 4: come, now, know, alright, get 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, come, well, tell, can 
##  Topic 7: yes, got, tell, two, come 
##  Topic 8: hey, mummi, smile, hannah, yes 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, clever, hey, well 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 26 (approx. per word bound = -4.821, relative change = 6.486e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 27 (approx. per word bound = -4.820, relative change = 7.197e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 28 (approx. per word bound = -4.820, relative change = 8.080e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 29 (approx. per word bound = -4.819, relative change = 8.420e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 30 (approx. per word bound = -4.819, relative change = 8.447e-05) 
## Topic 1: darl, come, yes, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, good, got, girl 
##  Topic 4: come, now, know, alright, get 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, come, tell, well, can 
##  Topic 7: yes, tell, got, two, come 
##  Topic 8: hey, smile, mummi, hannah, look 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, clever, well, hey 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 31 (approx. per word bound = -4.819, relative change = 6.516e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 32 (approx. per word bound = -4.818, relative change = 6.256e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 33 (approx. per word bound = -4.818, relative change = 7.905e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 34 (approx. per word bound = -4.818, relative change = 6.690e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 35 (approx. per word bound = -4.818, relative change = 4.168e-05) 
## Topic 1: darl, come, yes, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, got, good, girl 
##  Topic 4: come, now, know, alright, want 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, come, tell, well, can 
##  Topic 7: yes, tell, got, two, one 
##  Topic 8: hey, smile, mummi, hannah, look 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, well, clever, hey 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 36 (approx. per word bound = -4.817, relative change = 3.416e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 37 (approx. per word bound = -4.817, relative change = 3.527e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 38 (approx. per word bound = -4.817, relative change = 3.889e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 39 (approx. per word bound = -4.817, relative change = 4.008e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 40 (approx. per word bound = -4.817, relative change = 3.828e-05) 
## Topic 1: darl, yes, come, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, got, good, girl 
##  Topic 4: come, now, know, alright, want 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, tell, come, can, well 
##  Topic 7: yes, got, tell, two, one 
##  Topic 8: hey, smile, mummi, look, hannah 
##  Topic 9: mummi, yes, girl, hello, hannah 
##  Topic 10: tickl, yes, well, clever, christoph 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 41 (approx. per word bound = -4.817, relative change = 3.309e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 42 (approx. per word bound = -4.816, relative change = 3.105e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 43 (approx. per word bound = -4.816, relative change = 2.992e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 44 (approx. per word bound = -4.816, relative change = 2.875e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 45 (approx. per word bound = -4.816, relative change = 2.589e-05) 
## Topic 1: darl, yes, come, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, got, good, girl 
##  Topic 4: come, now, know, alright, want 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, tell, come, can, stori 
##  Topic 7: yes, got, tell, two, one 
##  Topic 8: hey, smile, mummi, look, hannah 
##  Topic 9: mummi, yes, girl, hello, alright 
##  Topic 10: tickl, well, yes, clever, christoph 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 46 (approx. per word bound = -4.816, relative change = 1.884e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 47 (approx. per word bound = -4.816, relative change = 1.983e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 48 (approx. per word bound = -4.816, relative change = 1.887e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 49 (approx. per word bound = -4.816, relative change = 1.402e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 50 (approx. per word bound = -4.816, relative change = 1.453e-05) 
## Topic 1: darl, yes, come, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, got, good, girl 
##  Topic 4: come, now, know, alright, want 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, tell, come, can, stori 
##  Topic 7: yes, got, tell, two, one 
##  Topic 8: hey, smile, mummi, look, hannah 
##  Topic 9: mummi, yes, girl, hello, alright 
##  Topic 10: tickl, well, yes, clever, christoph 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 51 (approx. per word bound = -4.815, relative change = 1.471e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 52 (approx. per word bound = -4.815, relative change = 1.221e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 53 (approx. per word bound = -4.815, relative change = 1.442e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 54 (approx. per word bound = -4.815, relative change = 1.921e-05) 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 55 (approx. per word bound = -4.815, relative change = 1.180e-05) 
## Topic 1: darl, yes, come, bath, clean 
##  Topic 2: look, want, can, like, nice 
##  Topic 3: dear, come, got, good, girl 
##  Topic 4: come, now, alright, know, want 
##  Topic 5: hello, hmm, boo, gillian, matter 
##  Topic 6: yes, tell, come, can, stori 
##  Topic 7: yes, got, two, tell, can 
##  Topic 8: hey, smile, mummi, look, hannah 
##  Topic 9: mummi, yes, girl, hello, alright 
##  Topic 10: tickl, well, yes, clever, christoph 
## ...................................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged

Note: “The default is init.type =”LDA" but in practice researchers on personal computers with vocabularies less that 10,000 can utilize the spectral initialization successfully." And spectral initialization is better, so you should do that. If you have a very large dataset, read up on how to correctly use the LDA option in the stm vignette.

Yay! Let’s look:

labelTopics(fit0)
## Topic 1 Top Words:
##       Highest Prob: darl, yes, come, bath, clean, nice, put 
##       FREX: clean, bath, wee, pet, darl, dirti, dri 
##       Lift: rid, took, vesti, armi, bobo, petal, soapi 
##       Score: bobo, juli, darl, clean, pet, bath, wee 
## Topic 2 Top Words:
##       Highest Prob: look, want, can, like, nice, littl, make 
##       FREX: lewi, make, bit, foot, rattl, play, push 
##       Lift: bite, cupboard, cupsi, lew, lewi, newspap, pour 
##       Score: juli, newspap, lewi, foot, charl, boy, joseph 
## Topic 3 Top Words:
##       Highest Prob: dear, come, got, good, girl, windi, littl 
##       FREX: dear, windi, sleepi, bless, whoop, head, downstair 
##       Lift: unhappi, milki, gracious, hurri, downstair, sock, dear 
##       Score: juli, unhappi, dear, windi, sleepi, head, fatti 
## Topic 4 Top Words:
##       Highest Prob: come, now, alright, know, want, get, got 
##       FREX: keith, shut, readi, bib, attent, bottl, first 
##       Lift: cake, dessert, dream, five, self, allow, attent 
##       Score: self, keith, crocodil, alright, attent, first, juic 
## Topic 5 Top Words:
##       Highest Prob: hello, hmm, boo, gillian, matter, want, cheeki 
##       FREX: boo, gillian, cheeki, hmm, hello, ticki, monkey 
##       Lift: pinch, poke, cheeki, gillian, boo, mayb, moan 
##       Score: cheeki, juli, hello, boo, gillian, hmm, ticki 
## Topic 6 Top Words:
##       Highest Prob: yes, tell, come, can, stori, mum, good 
##       FREX: quack, stori, thumb, tell, finger, suck, scalp 
##       Lift: bodi, cmon, concentr, pretend, scalp, silent, sober 
##       Score: concentr, juli, quack, tell, scalp, stori, clap 
## Topic 7 Top Words:
##       Highest Prob: yes, got, two, tell, can, one, bad 
##       FREX: cold, two, pretti, nose, walk, happi, bad 
##       Lift: color, nuddi, cardigan, cold, mother, punch, tale 
##       Score: cold, juli, pram, yes, two, nose, pretti 
## Topic 8 Top Words:
##       Highest Prob: hey, smile, mummi, look, hannah, hold, yes 
##       FREX: shh, hey, smile, lambchop, chou, hold, hannah 
##       Lift: contempl, fine, juli, wiggler, shh, botti, friend 
##       Score: juli, wiggler, hey, hannah, lambchop, shh, smile 
## Topic 9 Top Words:
##       Highest Prob: mummi, yes, girl, hello, alright, hannah, big 
##       FREX: girli, mummi, sweetheart, alright, girl, iron, stretch 
##       Lift: globe, pick, prove, rubber, scream, splish, wobbl 
##       Score: juli, splish, hannah, hello, mummi, girl, alright 
## Topic 10 Top Words:
##       Highest Prob: tickl, well, yes, clever, christoph, kick, boy 
##       FREX: tickl, christoph, well, bash, clever, parrot, boy 
##       Lift: ahhphroooowp, tickl, bash, plenti, kitten, music, chris 
##       Score: ahhphroooowp, juli, tickl, well, christoph, clever, boy

For more information on FREX and high probability rankings, see Roberts et al. (2013, 2015, 2014); Lucas et al. (2015). For more information on score, see the lda R package. For more information on lift, see Taddy (2013).

plot.STM(fit0, type = "labels")