What is Network Analysis
Networks are composed of 2 elements, nodes and edges. Nodes represent the objects of interest, while the edges represent the connection between them. Nodes can be of the same or different category within a network graph. With that said, a network can show relationships between people, while at the same time networks can show connections between pieces of literature with their author, an influence, and/or the content between the pieces themselves. Accordingly, the edge connecting the two nodes together represents how that specific connection is made. Networks can be thought of as a map of relationships. The purpose is to study the path, a collection of edges, that connect the two distant nodes, and to discover other paths that were not so obvious before. Caleb Crumley has composed a tutorial for the DigiLab that breaks down some theorems to show the capabilities of Network Analysis and how it applies to Digital Humanities.
Centrality can be really important to look at in network analysis, especially for humanist. There are different dimensions of centrality, but the one talked about the most is betweenness centrality. What this term measures are the shortest paths between the nodes. For example, let us say we have a network of people and we are recording their interactions by the type of relationships they have with each other. So imagine a bit of information that we follow through the network. If we try to map the distance of this information, how fast does it get to everyone? That is what betweenness centrality measures. This bit of information itself might not be important, but it is a good way to use it in this analogy. You might be thinking isn’t this the same as counting all of the edges that are connected to that person, which is called degree centrality. Well that has something to do with it, but each edge might lead to different groups. Specifically, if one person, or node, has an edge with a cluster of nodes, then this connection is weighted heavier than an edge that only connects to one other person. Knowing this, it can be good to have the nodes sized by betweenness centrality. This option will be discussed as it becomes available.
How should your data be structured?
The DigiLab has created a template for Network Analysis, and this page will discuss how to use that template according to each programs’ needs. Download the template here: Nodes and Edges. Both sheets are needed regardless for your records, but some programs will not need as much information. The spreadsheets are labeled accordingly, meaning the nodes csv will have all of the information needed for the nodes. If your network is about the connections and relationships between different people, then the nodes csv will contain information about these people you would also like to collect. The minimum columns that are required are the ID and the Label variables (or columns) the rest of them are up to you but are recommended to keep as much possible. The edges csv only needs to contain the relationship between each node. The way we do that is we have a source and target. These two have a directional relationship. Imagine 101 is the source and 201 is the target. The relationship can be expressed as 101 is a friend of 201 but 201 is not necessarily a friend of 101. It is normal to have an edge going the other direction between the same nodes. Assuming that the relationship is also a friend, and then it is safe to say 101 and 201 are friends. In most cases the edge between 201 and 101 is some other relationship defined in your research. In each row of the edges sheet will have a record of each interaction. For each interaction there needs to be a source, target, and weight. The weight will be given a numerical value that represents a type of relationship. The number itself is not necessarily important, just as long as you are consistent with the ID numbers and their weights. What is meant by that is, the number 1 can be assigned to friend and spouse can be 2. Just because spouse has a higher number does not mean that a spouse is a stronger relationship than friend. A way to think about it is assigning each relationship their own ID which we will call them weights. Weights in this type of network do not matter thus we could arbitrarily number them. The main point is the weight has no association with the type of relationship. Target, Source, and Weight are the three necessities for this csv. The only extra variables that could be added are the names for each ID and the names of the weights. The provided spreadsheets should have examples of what typical data for the nodes are collected and what is needed for the edges. There is wiggle room with the nodes sheet but not so much with the edges. The best way to approach this page is to skim through and see which program might best suit your needs and skills. Then read which columns you need and how picky that particular program is BEFORE you start collecting data.
Let’s Start Off Simple
The programs listed in this section will be internet tools that are good for quick graph building. There are pros and cons to these and these will be discussed.
Palladio is a tool created by the Humanities and Design team at Stanford University. Once you click start, the page will be asking for you to upload/paste your data. For Palladio, the only data needed is the target and source data from the edges sheet. Palladio is not picky with your input but it needs to be able to identify the source and target from it. When making the graph, it is easiest to read the network when the names appear instead of the IDs. That is your preference, but Palladio only needs those two columns to create the graph so it will label the nodes depending on the data uploaded. So for example, if you wanted to have the IDs on the network, then upload the two columns for source and target that have their ID numbers instead of their names. Vice versa to have their names pasted. Once the data is uploaded, click on the graph tab on the top of the screen to create your network. Assign the source (main topic) and target (how the main source is related). It might be easier to just copy those two columns and paste this into the box, instead of having everything else from the edges table. Once that is done, it will basically ask you to confirm that you are aware of these types of occurrences in your data, indicated by the red dots. Just click through those special characters and then click graph on the top of the page. Then indicate your source and target, which should be labeled source and target accordingly. You should see the network graph appear. Then, the only thing else useful, which ranges in the scope of Palladio, would be to size nodes according to number.
Connect The Dots provides more useful data about the nodes whereas Palladio can provide better visuals. Connect The Dots needs the exact same data that Palladio needs, but is not as flexible with the upload as Palladio is. Connect The Dots wants ONLY the source and target columns in their own sheet. So if push comes to shove, you may need to copy and paste just those two columns into a separate file to upload into Connect The Dots if you have more than source and target columns in your original spreadsheet. When it comes to graph labels, the same rule applies to this tool as it did with Palladio. If you want the names to be displayed instead of the ID numbers, upload the names instead. The output between the two tools is slightly different too. If you play around with Connect The Dots network graph, you will see there are some functional ties to it, meaning there are some additional features when you run your cursor over a specific node. Palladio does not compute the degree and centrality of each node, but their graph is more interactive. Palladio will let you size by the number, which is nice, but does not have as much meaning as when we size in the programs further on. So each program is good, but some excel in different areas. If you decide to go with either of these programs keeping up with the IDs is not necessary. It will be necessary later on, but in order to create the graph, the input needed to create the graph will suffice. IDs are usually recorded because they are less likely to cause any error with the data collection. These programs are case sensitive, meaning “DigiLab”≠”Digilab”≠”Digi Lab.” Whichever you choose, be aware of the places for error and be diligent to prevent them.
Both of these export as your typical picture, svg file. This will do the job depending on how you plan to embed this network but the programs discussed later on will allow for more options to export. Palladio itself can do more than just networks just like Data Basic can do more than just Connect The Dots. These programs are great and very useful for quick network building. There is hardly any customization with these, and it is hard to make a program that can customize as you please and keep the interface very simple. So as we move on to other programs, keep in mind that these will not be as easy. Not to scare you off, but to just to point out that with more power in the hands of the user, the more difficult certain functions can get.
Let’s Increase the Difficulty
The programs discussed in this section will require a few more steps, but the customization control is worth it. These programs are the reason the csv is structured in a particular way. They also require more data. The spreadsheets are linked here again for convenience Nodes and Edges. These programs will allow you to customize the network and to do some deeper analysis.
Cytoscape is an open source software that is predominately used for bioinformatics. On this website you can also download plugins through their app store. We will be discussing the upload, customizations, and export of the network. If you desire further knowledge, then Cytoscape’s tutorial and manual are a recommended starting point. First you will need to download. Once the program is opened, it will have a pretty blank interface, but you should at least see this on the left side:
The easiest way to upload your file will be to just drag it here. Once it is uploaded, this window should appear:
and you will need to have each column be assigned to a particular “meaning” displayed below. A meaning is what Cytoscape attributes to our columns. So in the picture below, this drop down menu will appear for each column. The point of this step is to assign each of our columns a particular meaning.
The edges spreadsheet for Cytoscape is very similar to what is needed for Palladio and Connect the Dots. It is necessary to discuss the slight differences now, so it will make sense on how to match the “meanings” for each column once it is uploaded into Cytoscape. The edges sheet is structured particularly for this program. This is the only program where the interaction column, or relationship, needs to be defined in words, not a quantitative value. The other programs mentioned on this page do not need the interaction column. The interaction column is the word associated with that particular weight. So for example, if you give a weight of 1 to represent a friend, 1 would be in the weight column while friend would be in the interaction column. After dragging your data into Cytoscape, it will ask you to verify that Cytoscape has labeled your columns correctly, illustrated above. What is meant by that is, Cytoscape reads the column headers in your csv, and it tries to match up the “meanings” it needs. It knows though that it can be wrong sometimes depending on the column header, thus it gives you the opportunity to confirm and or correct it’s assumptions. Based on the edges sheet that the DigiLab has provided, this table below will tell you how to match them up. It is recommended that you at least structure your edges sheet similar to our template because Cytoscape only needs a max of 6 “meanings.” The first column will be your spreadsheet columns and the right will be the corresponding “meaning.” If you have more in this file, it will display all columns but there can only be one Source, Target, and Interaction. The rest will have to be some attribute or not imported. Cytoscape specifically does not require IDs, so the Source and Target columns will be classified as the “Not Imported” meaning. The way you know which meaning icon is which, just hover the cursor over the icon and the name should pop up. This is also illustrated in the picture above as well.
|Headers in the edges template||→||Corresponding Cytoscape “meaning”|
Once all of the meanings are assigned, Cytoscape will build the network. It will be a generic looking network to start with. The next step will to customize. So the first step is to analyze the network. At the top of cytoscape, go to Tools → NetworkAnalyzer → Network Analysis → Analyze Network → treat network as directed. Now that we have the network data, we can use it to customize our network before it is exported. This may open up a separate window or a different section on your current window of Cytoscape. This can be predominantly ignored for we are only concerned with customizing and most of that data would be useful for bioinformatics networks. On the control panel there are going to be 4 steps to sizing the nodes by centrality. The steps are displayed below:
- To customize your network, the style tab must be selected.
- Since we are trying to size the nodes, make sure you are on the nodes tab at the bottom. Here is where you can also color the nodes and customize other features as well.
- Next, this drop-down menu is for certain styles. Play around and figure out which one suits you the best.
- Click on the arrow next to size and you want to match up what is in the picture. So column must be set as betweenness centrality and the mapping needs to be discrete mapping to assign each centrality its own size. You could also choose continuous as your mapping and cytoscape will proportion the sides of the nodes to the best it can. Sometimes the leaf nodes (the ones usually with 0 centrality) might be too small. In that case discrete is the way to go so you can control the difference in sizes.
- If you decided to choose discrete mapping, then here is where you will insert the sizes. 10 is a good starting spot for the 0s. From there try to scale according to the difference in centrality. It is needed to know that if the difference is small between two different centralities, it might not be necessary to treat them as a different centrality. Completely up to you, just play around with them and see what looks the best for your graph.
- Make sure you are now on the edges tab because now you will be coloring the edges according to the different interactions
- This step is pretty similar to the nodes sizing. Continuous will not work in this case, so discrete will be the only option.
- Here will be where you select the different colors for each interaction.
Now that your network is scaled and colored, it is easier to see just by looking at the network who is connected to more clusters than leaf nodes (betweenness centrality) and the different relationships between the nodes. Remember it is one directional so it just because there is a friend edge between two nodes, that does not mean that both consider each other friends. There could be two separate edges connecting to two nodes. They could be the same color or different, either could be expected. Now depending on your network, cytoscape might have clustered it together or not. Once you are finished customizing the design, all that is needed to spread the network out is to locate the layout button on the top menu. From there, there are some different layouts that cytoscape offers, but the best for these types of projects will be Apply preferred layout in the layout menu.
Exporting from cytoscape can go two different ways. It can export as normal picture file, and the steps are to make sure the entire network is viewable in the window before clicking on these buttons File → Export as Image. Now it should be asking you to zoom in, basically how big you want the picture to be. Cytoscape can also export the network as a webpage File → Export as a webpage and make sure you have the simple viewer for current layout before you select OK on the export popup window. What this provides is a zip file that will need to be extracted. Once it is extracted, it is your basic html layout and easily embeddable into your webpage. What is good about this is it is still interactive after export. The only point of caution is zooming. IT IS RELATIVE TO YOUR MOUSE. So be very careful because it is easy to lose the network. Our example is embedded below, and if you want zoom in and out and see what happens. Maybe the previous statement will make more sense now that you had a chance to see it live.