Topic Modeling Example with Support Conversations
Big ups to Kara Woo for pointing me to this blog post tutorial. I followed it very closely, I'm mostly just adding more words around it.
What even is Latent Dirichlet Allocation?
I'm going to try to explain what's going on, but a) I'm going super high level and aiming for the big idea b) I'm mostly basing this on the Wikipedia entry, so you may just want to read that.
Our Model is: every document is a mixture of multiple topics. (The sum of the weights of the topics = 1.) Within each topic, each word has a certain probability of appearing next. Some words are going to appear in all topics ("the"), but we think of a topic as being defined by which words are most likely to appear in it. We pick the number of topics.
We only see which words appear in each document - the topics, the probability of each topic, and the probability of each word in a topic are all unknown and we estimate them.
"Latent" because we can't directly see or estimate any of these.
We can describe Our Model as the interaction of a bunch of different probability distributions. We tell the computer the shape of Our Model, and what data we saw, and then have it try lots of things until it finds a good fit that agrees with both of those.
The Beta distribution is what you assume you have when you know that something has be between 0 and 1, but you don't know much else about it. The Beta is super flexible.
Turns out, the Dirichlet distribution is the multi-dimensional version of this, so it's a logical fit for both the distribution of words in a topic and the distribution of topics in a document.
This is a pretty common/well-understood/standard model, so all the hard part of describing the shape and telling the computer to look for fits is already done in sklearn for Python (and many other languages. I've definitely done this in R before.)
Getting the Computer to Allocate some Latent Dirichlets
High Level:
1. we turn each document into a vector of words
2. we drop super common words (since we know "the" won't tell us anything, just drop it)
3. we transform it to use term frequency-inverse document frequency as the vector weights
4. we choose a number of topics
5. we hand that info over to sklearn to make estimates
6. we get back: a matrix with num topics rows by num words columns. Entry i, j is the probability that word j comes up in topic i.
7. which we can: describe the topics and make sure they make sense to us, see how each document breaks down as a mixture of topics
Let's step through doing all this in python.
import pandas as pd
import os, os.path, codecs
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import numpy as np
ENGLISH_STOP_WORDS is a known list of super common English words that are probably useless.
I have a "data" dataframe that has an "id" column and a "clean" column with my text in it.
With the help of these libraries, we do steps 1-3:
tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, lowercase=True, strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)
A = tfidf.fit_transform(data['clean'])
A has a row for each conversation I'm looking at, and a column for each word. Entry i,j is the number of times word j appears in conversation i, weighted by the number of conversations j appears in.
model = decomposition.NMF(init="nndsvd", n_components=9, max_iter=200)
W = model.fit_transform(A)
H = model.components_
fit_transform is where we actually tell it to use the data in A to choose good parameters for our model
H is that topics by words matrix in step 6. We can look at the largest values of any row in H to see which words are most important to the topic represented by that row.
In Python:
num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
terms[ tfidf.vocabulary_[term] ] = term
then we look at what appears in H
for topic_index in range( H.shape[0] ):
top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
term_ranking = [terms[i] for i in top_indices]
print ("Topic %d: %s" % ( topic_index, ", ".join( term_ranking ) ))