Posted on Oct 24, 2016
— Joey Stanley, PhD student in Linguistics and DH Grad Assistant
I recently presented on how to build a digital humanities project from scratch, going from primary sources to visualizations of data. I also showcased a piece of software called JMP (“jump”) that I like to use for quick-and-dirty visualizations. Before we get to the fancy software though, we have to step back and consider where our information is coming from.
Primary sources vary considerably. In my own research I’ve drawn from audio recordings, census records, digitized books, ethnographic observations, and online material, but you might also use newspapers, images, maps, or archeological records in your own research. It all depends on your field, your project, and the questions you would like to get answered. Step one is to find the primary sources that are best for this project.
Since we’re talking about digital humanities, it’s important to figure out how to best use computers to draw information from these sources. It’s not simply enough to digitize your data: scans offer no benefit over original records other than portability. You might want to type everything up, which does allow records to be easily searched (using control+F). Your sources may already be scanned but there’s still a wall between it and the answer to your research question.
The key part is that you need to find the structure in the source, and to do this you have to think of it as a potential spreadsheet. These potentially massive tables are organized into rows and columns, with one “observation” for each row, and one “property” per column. This allows you to look at the information for each observation by scanning the cells in each row, while also letting you look at the various properties of all the observations by looking down each column.
For example, in Scott Nesbit’s historical preservation course in spring of this year here at UGA, they looked at runaway slave ads from the early 19th century in Georgia. Interested in the people mentioned in the ads, they decided to set up their spreadsheet with one person per row with the information about each person (name, description, escape location, ad location, etc.) in columns. In my own research on vowel sounds in the Pacific Northwest, I might set up my spreadsheet with each vowel sound in a recording as its own row, with various acoustical measurements and information about neighboring sounds and words as columns.
While you’re setting up your spreadsheet, you have to keep in mind the different data types, because the way we think about the price of a house is different than how we think about its color. The two main data types are quantitative variables and categorical variables. So a house worth $300,000 is twice the value of one worth $150,000. But a house painted white is not inherently better than one painted blue. Here, the value of the home is a quantitative variable because it’s numeric, and the difference in values is meaningful. The color is a categorical variable because the values are arbitrary and there’s no meaningful difference between them.
Sometimes with categorical data, you’ll have way too many different observed values. If you surveyed a thousand houses for example, you might find white, eggshell, baby powder, white smoke, cream, ivory, and ghost white. An architect or designer might be interested in these nuanced differences, but if you want to see if white houses sell faster, it might make sense to collapse all these down to just “white.” The key here is to have a good reason for doing so, something that makes sense in your project. Convenience is not a good enough reason, and neither is forcing the statistical model to work. Whatever choices to make to smooth out the distinctions in your data must be documented and reported in your final write-up.
Once a spreadsheet has been set up, and filled with all the information you’d like to study, it’s time to start visualizing and analyzing the data. This is why data types are important: the kinds of things you can do with your data depend on the data types you have in your spreadsheet. For example, if you want to do a chi-squared test, you’re going to need some categorical variables, but if you want to do a regression analysis, you’ll need quantitative variables. Similar for visualizations: a scatterplot or box-and-whisker plot needs quantitative variables while a bar plot can be used with categorical variables. I’ve helped students in the past who wanted to do some sexy visualization or fancy statistical method they saw in a paper one time, but once I looked at their spreadsheet I had to break it to them that they just didn’t have the right kind of data to do it.
Something to keep in mind. It’s always easier to go from more specific data to more general. Just like with the shades of white houses, you can usually collapse a categorical data type into fewer groups. You can even turn numbers into categories, such as height measurements going from feet and inches to simply “tall”, “average”, and “short.” However, keep in mind that you can’t go the other way, from more general to more specific, without going through your primary data again. If all you wrote down was that the houses were “white,” and later you found out that cream colored houses are important, you’re going to have to go past the houses again and rerecord their color. So in my opinion, always be as specific as possible when collecting your data, because it’s a heck of a lot easier to generalize your data than it is to go back and regather new information.
The critical part of this is that in the end, after you’ve collected your data and you’ve got a spreadsheet set up and you’ve explored what’s going on using visualizations, you’ve got to be the human and the researcher and explain what’s going on. Computers are great and they can help us out a lot, but they ultimately can’t do the most important step: interpreting and explaining your data. That’s where you need to put on your thinking cap and apply all this knowledge you took out student loans to get.
Using computers in humanities research is fantastic. They can save you an incredible amount of time, and they allow you to do things you never thought possible, opening your eyes to new research questions. Great care has to be put into a digital humanities project every step of the way, but the payout is well worth the effort.
View the slides from the presentation.