Text Analysis

The Basics

This topic can get pretty difficult pretty quickly, and the intent of this tutorial is to take what seems like scary programs and make them approachable. Hopefully by the end you will at least start to see the possibilities such that you can take these steps and apply them to your own research. We will be using Voyant, a web-based tool, to produce the output for this tutorial. These are your word frequencies, word distribution graphs, word clouds and keywords in context. Once this output is discussed, we will be moving on to topic modeling, which is what makes text analysis difficult, but which Voyant can also handle. Topic modeling shows consistent keywords in context and gives a distribution of the topics in the entire corpus of text.

Topic modeling can be hard to interpret and since it is a probabilistic model, the results will not be the same each time the same body of text is ran through. This tutorial will take you through the basics, but its main objective is to dive into the math and the behind-the-scenes to further understand the overall aspects of topic modeling. Once these topics are discussed and shown in Voyant, we will discuss some more complicated tools.

Voyant Tools

Voyant Tools is a web-based tool that just needs you to paste your text or upload a directory of files. Voyant likes to break text up into sections, mainly for the distribution graph, so if you want to control for the interval, you may want to have the chapters (or books) in different text files. Since there is nothing to download, we will go over the output.

Voyant is a simple interface that can get less simple if you are not sure what you are doing. So, as a brief description of the tool, Voyant has 5 different boxes on their page. Each box can be changed to a different visualization and each box will allow you to export said visualization in multiple formats. Each visualization is interactive inside Voyant, but once it is exported, it loses that capability. The examples on this page are just the basic png(picture) exports, but they offer web-based exports too. These examples are of the text of Mody Dick. Each section of Voyant has the export button and the button with the four blue squares is the button to change the visualization in that section.

Distribution graph of the 5 most used words in Moby Dick

This is the distribution graph of 5 different words in the text Moby Dick, which is the text that will be used for examples throughout this tutorial. The way to read this graph is to look at each different segment and it tells you the frequency of said word within that segment. This is where it may become important to try and control where a segment begins and ends. If you do not care specifically about the segments themselves, then Voyant will decide the segments for you.  Although the key to tell which color is which word doesn’t export with the graph, Voyant will tell you which word corresponds with each line. So by mapping the word frequencies across the segments, this may open up some interesting findings.

Word clouds are used in more places other than text analysis, they can usually offer some nice visualizations to your research, but other than that they do not quite add much quantitative information. Yes, we can see which words are the most used, but that is the only use for word clouds. There are some websites that allow you more control over customizing the word clouds, so if you want these words in a particular shape, color scheme, and or design then search for a word cloud generator and those will allow you more control. Voyant only allows you to change the number of words that go into the word cloud.

Keywords in context shows you each time a word appears in the text, with five words preceding and following it. The export of this would be a table of each time the word is in the text. Voyant allows you to change the number of words that occur before and after a keyword; five is the default. It can be good to analyze patterns you see that come before and after certain words of interest. This is a gateway into looking at topics in the text.

The links of topics in Moby Dick

This links graph is good to show the most frequent words and words they are most likely to show up near. So it can show you almost what the words in context table can, just in a visual way.

The is the bubble line of the words ahab, head, and sea in Moby Dick

The bubble lines graph is good to show where in the text important words are being used, similarly to the trends, or word distribution, graph. Which one you choose will ultimately depend on your preference of visual, and which one is easier to interpret for your data. Simply choose the words to place on the bubble lines from the dropdown box. It will start with the most common words in the text, but you can find others by beginning to type them, and remove any of thecurrently displayed words by clicking where they appear at the top of the bubble lines graph section.  As a word of caution, only one or two words should be included on each line, because the more words you add, the harder it is to see everything. If you want to compare multiple words, Voyant has an option to place them all on different lines.

Topic Modeling

LDA is the version of topic modeling used by these tools mentioned later. LDA is a probabilistic model that takes the occurrence of words and their frequency within a certain range to form topics. The formula for LDA is conditional probability: the chance that X will happen given that Y has already happened. Because of this, each time the same bodies of text run through these programs, the output is different. When it runs through each file it will randomly find words to start building the topic, so each time it could pick different words to begin with. Once it finds certain words, it will start to look at surrounding words. The output will be a table of words contained in the topic and also a corresponding probability. Another way to think about it as a proportion. This topic is x% of the entire body of text.

Prerequisites for Topic Modeling

You will need to be familiar with how to use the command line for your computer. They are slightly different for Mac users versus PC. Since Mallet relies heavily on the command line, before you start you will need to know how to change in and out of directories, how particular commands work, and how to move things around. Yes some of these things can be done by dragging the particular file where it needs to go, but it’s good to know how it can be done on the command prompt.

So here are some quick videos on how to work the command line for each user: PCs and Mac.

Mallett

Mallett is a topic modeling tool that uses the command prompt (terminal for Mac users) to run the LDA models on your bodies of text. The creators of Mallett have created a great thorough tutorial on how to use it. It discusses download, commands, and data formats.

TutorialDownload Mallett, and lastly Download Java Developer Kit.

Here are some examples of topic modeling and how the output was used Martha Ballards’s Diary and The Machinery of Suspense.

Text Analysis in R

If you need to download and understand some of R’s syntax, I highly recommend you check out Joey Stanley’s tutorial on downloading and the basics of R. The DigiLab has attached an R script that will help you compute the word frequencies and a word cloud that corresponds with those frequencies. See the attached below.

Basics of R Tutorial

Now the R Script has descriptions of each line and what it is doing. Just follow the order of the script. For the script, the data must be in a particular format. A template is linked below. The id column is just the number of the document. The metadata column is just random information that you can collect for each document (i.e. year or anything that you could later sort topics by) which can be more than one column. Lastly, the text column is the full body of text that corresponds with the metadata. Topic Modeling will be able to handle each document and their metadata.