Relational visualization on heroic stories


As an enthusiastic in video games, we have done several projects that are directly or indirectly related to the video games, especially RPG games with the fact that RPG games tends to have interesting plots than other kinds of video games. For the final project, based on our interests in RPG games, we are more willing to discover a little deeper on this sort of topic: as most RPG games are talking an epic story with the media of video gaming, we are going to figure out how epic stories changes over time. By making relational graphs over the epic stories with attributes like genders and camps (Or, the “role” they are playing in the plot, either protagonist or antagonist). We are going to explore how factors changes over time, or trying to see if any generalizations can be made based on the visualizations. The result of our exploration would consists two parts:

  1. A website which contains interactive visualizations and some historic background
  2. An datasheet that applies graph theories to our data (Shared using google drive)

Data Collection and analysis

Row Data Collection:

The very first thing to do for our final project is to collect raw data for future processing. Fortunately all of the books that we are desired to measure have a modern English translation, therefore we could use the text analysis software that we are familiar with. The way we are organizing our data is to group them into five different time periods, which are: pre-1850, 1900-1950, 1950-2000, 2000-present, and then, we start choosing books that are published in each periods. We decided to do three books per period; for detail, the website has all the names of the books we are using there. Then, after seeking the raw texts from internet, we apply further mechanism to make the raw text useful, and there comes our magic tool: jigsaw. We use jigsaw for analyzing the frequency of connections in different texts, and export them to csv spreadsheets which looks like this:

in which we have the names of the characters in one line, and repeat them n times in the sheet, in which n is how many connections they have that is analyzed using jigsaw. We decided that for pairs of character, they are connected if they coexists in a small portion of text. Although this would not be very accurate, it is a good approximation for general relational analysis.

After our relational csv file is done, we starts to make characters’ attributes. The attributes including the gender and the goodness(whether is playing against the good wills in the story or not) of the character, and make them into a csv sheet.








Data Processing Using Gephi
With the powerful platform Gephi, we are going to further polish our data. Since we have the connections data and attribute data in separate file, we are using Gephi to perform a natural join to make the two separate data merge together. Therefore, each characters in the Gephi data would contain both the attributes and the connections in it. We following the common relation-edge and character-nodes schema, and uses the Gephi built-in methods for analyzing the influences with graph theories, which includes the degrees, the centrality and the modularity of each character. After processing with respects to each pervious variables, we found out that the modularity is relatively low with our data set, means that each individual characters are connected all together in someway, with little exception like the Paradise Lost, in which the modularity value is somehow higher than other stories. That is predictable since in the actual story, the goods and the bads are somehow separated. By our knowledge of graph theory, we decided to use degree to rank the nodes of the characters. The reason why we are not choosing the modularity as the size factor of the nodes is that it is not an unbiased variable, as well as it is not a very good way to represent influence, or, not direct in measuring influences. Therefore, we choose degree and the average weighted degrees as the main factor for ranking different nodes. For grouping the nodes in the final visualization, we then decided to use the centralization to group the nodes. In the visualization setup, we used “Yifan Hu proportional” with the optimal distance of 1000. The following picture shows the general appearance of our visualization, and shows that grouping using centralization is a good idea.

(The Wonderful Wizard of Oz Character visualization)



Data Visualization and Its Presentation

We choose the pervious visualization with detail as the illustration of our visualizations. Capture2

We could see that Dorothy, which is the view point of the story, shows the most connections and centralization in the whole visualization. We could also see that the green girl, which is located in the top left corner in the left figure, shows a character with low degree of connection and a low centralization, which means this character does not have much influence over the whole story. This screenshot is taken when the user clicked on the node that represents the character Dorothy, and as a result, a very detailed information including the statics of the character is displayed on the left of the screen. Of course the user could click any of the nodes and get the same kinds of information for different nodes/characters. This might be considered as overly informed to the user, however we decide to keep them in order for people to conclude their own answers to their or our questions when exploring our website, as a good reader-driven visualization there. To group all our visualizations in a meaningful way, we choose the timeline JS platform to gather them on a uniform timeline, and put the five periods into the it. For each time period, we first gives the user a historical background in quotes and provides links to each visualization that we made. We also highlighted a visualization for each period so that the user could directly interact with it.

(Screen shot of our website)



Influence Dataset with Graph Theory Analysis

We developed a method for analyzing different types of characters in epic stories. I would introduce this method at the beginning of this section. We used the same dataset to perform this analysis.

Step1: Influence data with respect to each book

Based on our research objective, we decided to get the influence data on each books first. Using a python script, we traverse over the relational and the character sheets, and assign the important factor degree to each characters. The following is the screenshot of part of our script. The influence data is calculated using the script and stores in a separate file.


We know that different books have different sizes; generally, a foot thick book would provide a larger number of connections between characters than a inch thick book. To utilize the data and to compare the characters of different books with each other, we standardized the data to perform comparison.

Step2: Standardize Data

In statistics, there is a concept called standardization, in which is a proper solution to the size problem of the books. By using a factor of the (X-Average)/StdDev, the influence data of each book is standardized. The following screen shot shows a piece of the data we collected and processed, in which id is the name of the character, InfluRaw is the unstandardized data and the Influence column is the standardized data, in which it represents the number of connections compared to other characters.


Stpe3: Combine Data Altogether

After we standardized data, we are going to combine them together. Therefore, we merged the numerical data collected before and the categorical raw data from different time periods Capture4together. Since they have a common name associate with them, it is an easy combination by using Gephi’s natural join utility. As we see from the right, we connected the attributes of the characters and their numerical influential data that we made previously.

Step4: Compare data from different era

By using another python script, we got the overall count of the of different categorical data and the average influential value per era. By grouping them into a comprehensive spreadsheet, we could perform one final analysis.

(Raw data and plots)


For pervious plots, the top two are the plots based on the raw data, whereas the bottom ones are based on the standardized data. As we can clearly see there, simply using the raw data could clearly generate a very different result, and it also shows the influence on the size of the books. For the standardized plots, as mentioned, are representing the average influence of each types of character. We are not analyzing specific characters but rather their types, which includes their gender and the goodness of the character.

As we could see, for the gender analysis, the women’s influences (the gray line) significantly increased in 1900-1950 era, which may corresponds to the wars at that time, in which people find women figures are more attracting. Interestingly, although the famous Feminist movement is happened in the 1950-2000 era, the women’s influence is dramatically decreased, and the males’ influences(the orange line) dramatically increased, and is the first time that the men’s influence rule over the women’s. It might because more women figure exists in the story and making the average influence decrease. Unfortunately, we could not see any influential trends in the camp data, which is clearly one failure of our project and I’ll talk about it in the next session.


Success and Failure


  1. It seems that the gender’s trend plot is a meaningful one. We could expand that topic and do some more researches to reveal more facts.
  2. A clear relational visualization, and have potential for future uses.
  3. Website, thanks to the timeline JS platform, is visually rich.
  4. Applied graph theory and generated good results on our visualization.
  1. Big dataset, and we even plans to do five books per period, which makes us unable to dig more into the data (even if we used three books per period).
  2. The “goodness” analysis is not showing a clear trend, which one problem that dividing a character into either black or white is pretty difficult, as each character might show two conflicting characteristics in the book. Maybe we need to come up with a better division method; maybe not, it could also be the problem that we only uses three books per period; although that is a good amount of work for us, it is not that representative overall. Or, it might be the problem for our period division; maybe 50-year division doesn’t correspond the development of the epic stories.

Character Relational Analysis of The World of Warcraft Novel (Collaborate w./ Jiayu Huang)

For this assignment, we decided to come back to RPG games. As for network visualizations, one plausible way of constructing the networks would be the relationships between characters. Therefore, we chose the novels of the World of Warcraft(WoW), as the source to perform visualizations, since it could draw more detail than our memory of the game’s plot. We finally chose one of the famous series of the official novel for the game, namely War of the Ancients Trilogy, as the source of our analysis.


Data Construction

Before we could do anything analytical, we need to construct the base data for our visualization. Since we decided to analyse the relationships of the characters of the novel, the very first datasheet would be simple: if two characters have a connection in the novel, we would have a line with both names separated by comma. If the characters have multiple connections, we would repeat the lines multiple times. In this assignment, since our raw data is from raw text, we used Jigsaw for help. The method is simple: import the texts and a custom entity of character names, and let Jigsaw to analyse their connections in the text. Then, with a list view, we could examine the who and how often does two character connects, and store the information into a spreadsheet in which the first two columns are the names of the connection, and the third column is the frequency of that connection. (Figure 1) Then, I wrote a simple script to repeat the lines n times, where n comes from the third column of the pervious spreadsheet (Figure 2). Also, with the help if the WowWiki, we have make a simple identity list of each characters in which it contains his/her/its race, gender(if plausible), and the affiliation in the novel.

CaptureFigure 1

Capture12Figure 2

Analysis in Gephi

First complaint: it is beyond my imagination that Gephi does not support spaces between input texts. Therefore, before we could do any analysis, what we have to do is to replace all spaces with “_”, and it could be done in multiple ways. Anyways, after we have imported all necessary data, with some modifications, the initial graph looks like the following:




















This initial visualization is useful, but hides too much information. Although we could draw some conclusions from it, it could not show any result clearly, so we process the visualization further. One important step is that we have to connect the identity list with the connection list, since otherwise we have no useful meaning other than a beautiful network graph, and have the ability to compare with our pervious RPG analysis. With the ability of performing natural join, we could easily combine the two sets of our data and to reveal the relationship of race, gender and affiliation.

Gender analysis


Capture5Gender problem, specifically, the problem of the domination of the male characters, also exists. In the right figure, the male nodes are colored by green, while the female nodes are colored as red. For the unknown gender (either a non-human character or a character that with almost no gender information), is colored in blue. As we can see here, the green dominates the screen, in which it reveals that fact that there are only few female characters mentioned in the text. Furthermore, among the females, only two red nodes have a significant connections with other nodes in the graph, and the number of their connections, for the one female character that is in present of the central nodes), are comparably weaker than those green ones. If we tries to group the data (left figure), magic happens: we have our predictable giant green dot, and a small red tringle lies lonely on the 2 o’clock direction. Oh, and the dust-like unknown genders lies on the 4 o’clock directions, FYI.

From the previous visuals, we could see that one general problem with the RPGs, the male-domination problem, still exists in the WoW series.


Affiliation Analysis

Capture2Then, we choose to analyse the affiliation of each characters: they could either be red, which means they are for the common good of people; black, which means that evils lives in their deep heart, and the green colors which means that they barely picked a side from either the good or the bad. Contradicting to our pervious analysis that the villains tends to perform a influential role in those games, we could see that green nodes are barely connects to the black ones; and there exist only a few connections between the villains, while strong connections is present between the good and the bad. Thus, we could make some conclusions about the plots: those natural characters, are acting more like a background characters: since they have little to do with the villains, it seems that they aren’t really involved in the conflict between the reds and the blakcs; instead, they might be the common teacher of someone, or those be loved by the side characters, furthermore, they could be a poor victim of the villains, in which they haven’t got a change to pick a side. As we see the strong relationships between the reds and the blacks, we could also conclude that the conflict between them is a big one: and it is, because they are at war (as the title suggests). Therefore, we could draw conclusions that the bad people, in the story, is a very clichéd characters that they are totally evil or BLACK, and it is important to think of them having some bright points.


Race Analysis

Capture3There are plenty of races involved in the story. In the figure, That the relationship strength reveals the node’s level/status in the story and the size of the node reveals the node’s level of loneliness. As we could see, the most mentioned race is the night elf, and the most influential one (connects to the most races), is the red wyrm. There have been barely any mentions of dragons in the WoW series except the King/Queen of the species is mentioned, which explains why those nodes are small, although it connections almost around all species. As a background, the red wyrm mentioned in the text is one of the main characters.








Gephi analysis with graph theory

Capture7Gephi is far more than a data visualizer. It can generate statistical information such as distribution, shape, and the density. If we want to know how well the characters are connected with each other, we can let Gephi to generate some numbers for analysis. The right figure is the statistical data we got from Gephi, like the average degree, the network diameter, the graph density, and the average path length is provided. Of course, the average path length is much smaller than the one for the real world: 6, since the story is told in one particular character’s view point, which results that all connections are closer than it should be.








Tools comparison:


Gephi, which is similar to the two previous tools, Palladio and Google Fusion table, is a network visualization tool. Therefore, it makes sense to compare them. The left graph is the result of the same data set visualized by Palladio, and the right one if that of the Google Fusion table. First thing to mention, since huge updates have been done for Palladio, it seems that the responsiveness of it is much faster than before. (I noticed!) As a comparison, Google Fusion table and Palladio are more like subsets of Gephi, in that the most features that the former two support are also supportsed by Gephi, while there are some features missing in Palladio or Fusion table which Gephi has. For example, it is difficult to do data management in the Fusion table or Palladio, especially for natural joining, while in Gephi, data management is a piece of cake. Also, the visualizing tools in Palladio/Fusion Table only have Force Atlas layout, while there are multiple layouts present in Gephi. And most importantly, they cannot generate numbers based on graph theory, while Gephi can do that easily. The only advantage is that both Fusion Table/Palladio could link actual maps with the nodes.



Visualizing Networks: Wars in Old Days (Collaborated with Huang, Jiayu)

After we’ve learned how to visualize networks using two different tools: Google Fusion Table and Palladio. After all, both are very good tools. They both provided useful ways to do the visualization, despite the fact that they perform and feel differently with each other.

Instead of visualizing plain texts, both Palladio and Google Fusion Table using csv spreadsheets as raw data input. In some sense, those raw data are more solid in a way that those spread sheets may only contain facts; unlike essays, blog posts, etc., it would and should provide a more predictable result instead of the possible surprises that text analysis may provide. And interestingly, Voyant and Jigsaw are local software while Palladio and Fusion Table are applications that based on a browser, so it seems that network visualizing software need networks :D.

Raw data collection:

Ignoring the fact that different people may feel differently based on the fact that one have to build their own csv file to work with those tools, they are very interesting tools for network visualization, especially they have maps build-in, and that’s why we choose to step out of our RPG analysis previously since they have limited location associated with them. What we finally decided to visualize is the warfare that happens between 1900 and 1950, in which the two most famous ways lies in the period: WWI and WWII. What we have in our data includes: start time, end time, name, the country involved, and the result(indicating who won the war and who lost it).


Palladio, a Beautiful Platform with Potential:

First thing first, different from what my partner’s value (and him being a Google fanboy), I think the beauty of a tool should be valued more than it is valued currently, especially for a humanity tool; after all, I cannot believe that a tool would generate a stunning visualization without itself being beautiful also. With that in mind, I really love Palladio; although for the performance issue we choose to make only one visualization from it, I still love it. Its modern and simple design really catch my heart when I first opened it. However, since it is really slow when processing more than 300 lines of data (which shouldn’t be a large data), it is a pain to use it. With such a digitalized world and such a large flow of information everyday, it is not that useful, and would put such a wonderful designed tool to a very embarrassing position. But since it is a fairly new tool, and hopefully still developing, it has a great potential since it has a very good start.

Put all the judgements aside, let’s look at the actual visualization:

Screen Shot 2015-10-05 15:05:44 +0000

The pervious figure is the visualization of the number of country involved in the war with respect to the time. It is called timeline in Palladio. From the figure, we could see that from 1916 to 1921 is the time that most countries involved in a war. Interestingly, WWI happens at a similar time. However, for the WWII time, which is from 1939 to 1945 approximately, much less countries were involved, even less than the number of countries involved in a war before WWI. Still, people remembers WWII more than other wars; it is not only because that the newest war that involves lost of countries in the worlds, but also because it has more damage to the humanity. With the massive use of tanks, machine guns, aircrafts and heavy artilleries, it is easier to kill people than ever before, not even mentioning the existence of the nuclear bomb at the end of the war. Such less country involved in the war could suggest that warfare at the WWII time moves rapidly with the existence of so many new weapons, and warfronts changes quickly.

Google Fusion Table, A Productive Tool Made by Google:

There are less thing to say about the Fusion Table. It does not means that it is not useful; to be honest, it is more useful than Palladio. But that’s only because it is more mature, including its look: it make me feel like using a super professional software back in 2008. It doesn’t looks bad, but just not very attractive after seeing the beautifulness of Palladio, and, as the fanboy says, it is super productive.

The very first visualization is the duration vs the time when the war starts. The upper part(with the white background), is the raw plot of the s time; while the lower part of the figure(with the blue background) is the value difference from the lowest point on the plot versus time. As we can see, there are sometimes that the war happens rapidly, and there are sometimes that only little war happens. Also, as mentioned, wars in old time tends to last longer than the newer wars, with all the new killing machine introduced to the world, especially when comparing WWI and WWII together: although WWII made humanity suffered more, individual wars last shorter than wars in WWI era.

Screen Shot 2015-10-05 15:23:34 +0000

The second visualization visualizes the country that was defeated(Figure b), and those won it. From that, we can see that most countries involved is in Europe, suggesting how bellicose that Europeans are at that time. Also, there are lots of points in South America, which could possibly be the revolution or colonialization wars, and, more interestingly, most countries lost that war(2:7, victory: defeat)

Figure a: countries that won the war

Screen Shot 2015-10-05 15:28:10 +0000

Figure b. Countries that lost the war

Screen Shot 2015-10-05 15:27:52 +0000

The last two visualizations are the relationship of the winning countries(labeled in blue) and the defeated countries(labeled in yellow). The more involved in the war, the bigger the dot is. What we can see is that China and US have large dots on the screen. For China, it is not surprised since it has lots of domestic or international conflicts at that time; but for the US, I am very surprised that it has involved in so many wars, in which I previously thought impossible since it is far away from the “main world”, Europe. Though, the most interesting fact is that many large dots on the graph are those who are involved in a civil war instead of the famous two World Wars. The reason for this fact might because that civil wars tend to make an influence on a vast area of the land, and tends to last longer as it is difficult to kill all the rebellions. Since that the scale of the war is not part of our raw data(especially how many people dies), this might seems misleading in the sense that large dots might not make people suffer more. But still, it can show things up; although in the future we might go and think more about making a visualization: to make it least misleading as possible.

As we can see, America involves in the war a lot, as it is a very big dot on the screen.

Screen Shot 2015-10-05 15:26:31 +0000

Screen Shot 2015-10-05 15:27:27 +0000