Topic Modeling

TOPIC MODELING RESOURCES

Topic modeling is an excellent way to engage in distant reading of text. Topic modeling is an algorithm-based tool that identifies the co-occurrence of words in a large document set. The resulting topics help to highlight thematic trends and reveal patterns that close reading is unable to provide in extensive data sets.

Research questions you might consider are:

How does a topic evolve over a corpus?
How does a topic change based on authorial gender or identity?
What are the key words or subjects in a given document or set of documents?
What words are most closely related to one another in a document set?

GETTING STARTED

Web-based solutions
Voyant, jsLDA

Desktop solutions
MALLET (standard), MALLET (GUI), Stanford Topic Modeling Toolbox

Programming solutions
R (for a tutorial on how to use R for topic modeling, see this tutorial)

Topic Modeling: identification of words grouped together across a large document set
Topic: cluster of words that frequently occur together
Corpus: a collection of written texts
Command Line: place to manually type commands into a computer instead of any sort of graphic user interface
Semantics: linguistic analysis concerned with meaning
Stop Words: the most common words in a given language which are filtered out before natural language processing

The Chymistry of Isaac Newton

https://chymistry.org/

PROCESS OVERVIEW

If any of your research questions are similar to the ones above, topic modeling is an excellent method to use. A general topic modeling project begins like this:

Gather your data. Topic modeling works best with a large data set. So whether you have four extremely long works, or thousands of shorter works, the more data you have to work with, the better.

Structure your data. Most topic modeling platforms require you to save your data in .txt files. This could mean having separate .txt files for each document, or combining all of your documents into one massive file. If you're going the individual file route, be sure to maintain a consistent file naming schema to keep your data organized.

Choose your topic modeling platform. As topic modeling is based on an algorithm, what platform you use determines the amount of control you have over your topic model. Voyant, an online text analysis platform, allows you a fair amount of control over your topic distribution and frequency. MALLET's GUI interface offers little to no manipluation of topics, while its command line-based software offers several ways to edit the topic model command.

Run your topic models!

TUTORIALS + OTHER RESOURCES

Voyant is an online program that, in addition to topic modeling, provides other analysis results of a single text or corpus of documents, including frequency, trends, and correlations. Texts may be inputted either as URLs, copied plain-text, or uploaded files or folders. While the processing time depends on the number of documents, the resulting analyses is presented in five panels. There are additional hidden analysis panels that you can select from depending on what information you are looking for. The topic modeling information is found when you click on the little window icon above one of the current panels. Hover over "Corpus Tools" and select the "Topics" option. Not only will you see the distribution of topics, but you can also select the number of topics and words included in the topics. You can also edit the stop word list by hovering over the "Options" toggle and selecting "Edit List."

MALLET is a software package run through the command line on a machine. The statistical formulas required for topic modeling run on the back-end, so users don't have to create and manage the formulas on their own. However, you are more than welcome to do so if you like.

The Programming Historian has a great tutorial here. As a summary of the tutorial:

MALLET needs to be installed on the C:/ drive of your computer.
Files need to be stored within that MALLET folder. You can import your own folder and name it whatever you want.
Using command line prompts (as identified in the Programming Historian tutorial), ensure that MALLET can find your folder.
Run the topic model using command line prompts.

You have the ability to customize the number of topics, data sets, and stop word list depending on what information you are looking to get out of your topic models.

It's important to remember, however, that the data derived from Voyant, MALLET, or any topic modeling tool you use is unstructured. The computer doesn't provide annotations which suggest semantic meaning. The analysis of topics and themes comes from the researcher conducting the analysis.

Topic Modeling

TOPIC MODELING RESOURCES

GETTING STARTED

Commonly Used Tools

Key Terms and Concepts

Example Projects

PROCESS OVERVIEW

TUTORIALS + OTHER RESOURCES

Topic Modeling with Voyant

Topic Modeling with MALLET (Command Line)

Unstructured Data