Text Analysis

The Basics

This topic can get pretty difficult pretty quickly, and the intent of this tutorial is to take what seems like scary programs and make them approachable. Hopefully by the end you will at least start to see the possibilities such that you can take these steps and apply them to your own research. We will be using a web-based tool to produce the output for this tutorial. These are your word frequencies, word distribution graphs, word clouds and keywords in context. Once this output is discussed, we will be moving on to topic modeling, which is what makes text analysis difficult, but Voyant, the web-based tool, can also do. Topic modeling shows consistent key words in context and gives a distribution of the topics in the entire corpus of text.

Topic modeling can be hard to interpret and since it is a probabilistic model, the results will not be the same each time the same body of text is ran through. This tutorial will take you through the basics, but it’s main objective is to dive into the math and the behind the scenes to further understand the overall aspects of topic modeling. Once these topics are discussed and shown in Voyant, we will have instructions for some more complicated tools.

Voyant Tools

Voyant Tools is a web-based tool that all it needs is for you to paste the body of text, or upload a directory of files. Voyant likes to break text up into sections, mainly for the distribution graph, so if you want to control for the interval, you may want to have the chapters (or books) in different text files. Since there is nothing to download, we will go over the output. Voyant is a simple interface that can get less simple if you are not sure what you are doing. So a brief description of the tool, Voyant has 5 different boxes on their page. Each box can be changed to a different visualization and each box will allow you to export said visualization in multiple formats. Each visualization is interactive inside Voyant, but once it is exported, it loses that capability. The examples on this page are just the basic png(picture) exports, but they offer web-based exports too. These examples are of the text of Mody Dick. Each section of Voyant has the export button and the button with the four blue squares is the button to change the visualization in that section.

Distribution graph of the 5 most used words in Moby Dick

This is the distribution graph of 5 different words in the text Moby Dick, which is the text that will be used for examples throughout this tutorial. The way to read this graph is to look at each different segment and it tells you the frequency of said word within that segment. This is where it may become important to try and control where a segment begins and ends. If you do not care specifically about the segments themselves, then Voyant will decide the segments for you.  Although the key to tell which color is which word doesn’t export with the graph, voyant will tell you which word corresponds with each line. So by mapping the word frequencies across the segments, this can maybe open up some interesting findings.

Word clouds are used in more places other than text analysis, they can usually offer some nice visualizations to your research, but other than that they do not quite add much quantitative information. Yes, we can see which words are the most used, but that is the only use for word clouds. There are some websites that allow you more control over customizing the word clouds, so if you want these words in a particular shape, color scheme, and or design then search for a word cloud generator and those will allow you more control. Voyant only allows you to change the number of words that go into the word cloud.

Key words in context shows you at each time a word is found in the text, what are the 5 words before and what are the 5 words after. The export of this would be a table of each time the word is in the text. So Voyant allows you to change the number of words before and after, 5 is the default. It can be good to analyze patterns you see that come before and after certain words of interest. This is a gateway into looking at topics in the text.

The links of topics in Moby Dick

This links graph is good to show the most frequent words and words they are most likely to show up near. So it can show you almost what the words in context table can, just in a visual way.

The is the bubble line of the words ahab, head, and sea in Moby Dick

The bubble lines graph is good to show where in the text are the words being used. Similarly to the trends, or word distribution, graph. This will ultimately depend on your preference of visual and ease of interpreting. Only one or two words to each line because the more words added, the harder it is to see everything. The dropdown box will allow you to pick and choose the words for the bubble lines. The currently displayed words should be at the top of that section highlighted. That is where you can chose to remove a word from the bubble line.

Topic Modeling

LDA is the version of topic modeling used by these tools mentioned later. LDA is a probabilistic model that takes the occurrence of words and their frequency within a certain range to form topics. The formula for LDA is conditional probability, thus what is the chance of x happening given that y has already happened. So because of this each time the same bodies of text run through these programs, the output is different. When it runs through each file it will randomly find words to start building the topic, so each time it could first pick different words to begin with. Once it finds certain words, it will start to look at surrounding words. The output will be a table of words contained in the topic and also a corresponding probability. Another way to think about it as a proportion. This topic is x% of the entire body of text.

Prerequisites for Topic Modeling

You will need to be familiar with how to use the command line for you computer. They are slightly different for mac users versus PC. The Stanford tool really only uses it once, but Mallet is all command line. So the things to learn is how to change in and out of directories, how particular commands work, and how to move things around. Yes some of these things can be done by dragging the particular file where it needs to go, but its good to know how it can be done on the command prompt.

So here are quick videos on how to work the command line for each user: PCs and Mac

Mallett

Mallett is a topic modeling tool that uses the command prompt (terminal for Mac users) to run the LDA models on your bodies of text. The creators of Mallett have created a great thorough tutorial on how to use it. It discusses download, commands, and data formats.

TutorialDownload Mallett, and lastly Download Java Developer Kit.

Here are some examples of topic modeling and how the output was used Martha Ballards’s Diary and The Machinery of Suspense.

Text Analysis in R

If you need to download and understand some of R’s syntax, I highly recommend you check out Joey Stanley’s tutorial on downloading and the basics of R. The DigiLab has attached an R script that will help you compute the word frequencies and a word cloud that corresponds with those frequencies. See the attached below.

Basics of R Tutorial

Now the R Script has descriptions of each line and what it is doing. Just follow the order of the script. For the script, the data must be in a particular format. A template is linked below. The id column is just the number of the document. The metadata column is just random information that you can collect for each document (i.e. year or anything that you could later sort topics by) which can be more than one column. Lastly, the text column is the full body of text that corresponds with the metadata. Topic Modeling will be able to handle each document and their metadata.