Visualize relationships in heroic story

Project Introduction:

Zhengri Fan and I are always interested in playing RPG games. We love to play RPG games because we like the story behind them. Instead of doing research on games like we did before, this time Zhengri and I decided to directly touch the “base”: hero stories. We would like to focus on relationships between characters in different heroic stories in different time period. From their relationship, we can know each character’s influence to the entire story. Then we categorize different characters according to their gender and camp (protagonist or antogonist) to know an overview of those specific types of characters’ influence to the heroic story. And we would like to explore if different types of character’s influence are changing due to time changing. Our results presentation consist of two parts:

  • An interactive website with character visualization and history background: Epic Story Visualization
  • An influence dataset collected and processed with graph theory: (On Google Drive shared files, can get accessed from Jiayu Huang, Zhengri Fan or Katherine Faull


Dataset Preparing:


The first step of doing this project is to select raw texts of epic stories for further research. Because we would like to find modern English text for better text analysis quality and time periods that can actually influence literature a lot, we choose pre-1850 (God, Royal and Traditional) , 1850-1900 (1st Globalization) , 1900-1950 (World War), 1950-2000(Technology and Coldwar) ,2000+ (Future) as our 5 major time slots to select books. for each time slot, we choose 3 books that can be seen as epic story. The full list of source data can be accessed on our visualization website. We got raw texts from open source library and website, the text files can be found in Google Drive raw dataset.

DATA Preprocessing:

In order tSample6o do visualization of relationship, we need to process the raw text to extract relationships between different characters. So, there are three very important phrases here: raw_text, relationships and characters. we got raw text now, then we need the characters names to help jigsaw do the dark magic trick for us to find relationships. For each book we selected, we created a character database with name, gender and camp (if he/she is good or bad)  of each main character in the story. Then, we split our raw text into several small pieces and input them into jigsaw. If two characters appear in one of the smaller piece, we claim that they are related. Because we don’t have super accurate way to let computer understand texts, this is a good approximation to get relationship data. Finally, the relationship dataset is like the picture left: for each line of the csv file, there is an undirectional relationship counted.




Data Processing w/ Gephi:

This time, the process is kind of same like our previous assignment, we do a natural join from our CharacterDB for each raw text and relationship csv file in Gephi. Gephi put character database information into Nodes database and relationship data into edge database. We create data following a commonsense schema: relation-edge and character-nodes. In order to present the character relationship embedded with their influence, we use Gephi’s built-in graph theory analysis tool to get each charactere’s degree (connections between this node to other nodes) , centrality (if this node is the center of this graph: it can be calculated by connected components and degree) and modularity (the strength of the graph that can be divided into different parts).  After we apply these tests to our data we find that modularity of relationship of characters in heroic story tends to be really low. That means all of the characters tend to be in one big group. i.e. they all connected in someway. Though, not all of the stories are like this. Paradise Lost’s modularity value is a little bit higher because two camps of characters are separated in a pretty discrete way. After some research on graph theory, we choose to use degree to rank the size of different character nodes. Modularity is a good concept, but it can only represent the grouping but not a good way to measure influence. Centrality, also, is good for determining the “core” in relational web. However, it is not a unbiased representor. It is good in dealing with centralized character. for example, Alice in Alice’s Adventure. But there will be a lot of bias when we are trying to use centrality to determine characters that is not that much centralized in the relationship. In heroic story, Degree and average weighted degree can truly give us a good simulation on the influence level. As a result of this concern, I rank node with different sizes according to different degree number. A character with higher degree can be seen as a more important character. However, I would like to show the fact of grouping and centralization. From visualization concern, I set layout as “Yifan Hu porpotional” with the optimal distance of 1000. This layout shows the fact of grouping (see paradise lost) and centralization very well. Here is a sample screenshot of a character relation visualization:Capture

Visualization Utilization/Presenting

On this visualization of “The Wonderful Wizard of Oz”, Dorothy is the centered character with high degree, high centralization. greengirl is a good example of a character with lower influence, either in degree or centralization consideration. Viewer Capture2is able to use lower right operator to zoom in and out of the graph. Also, if clicked on one of the node, a very detailed information about the node will be presented, with all collected information about the graph theory and the connected nodes. We decided to keep all of the graph theory analysis data on the visualization because we would like to utilize our visualization not only on current question about influence but also on future potential. As we promised in our previous presentation that the product will be a fully reader driven experience. That we present the data but how to use this relation visualization is due to our user. Of course, we are users ourselves.  In order to present our visualization in a natural and fluent way, we use TimelineJS to host our visualizations, for each time period, we have an introduction on the fact on history about that time period on the website and several visualizations on that time period. Then, we have links to our visualization website. On the background, there is a picture for representing the specific time era. Here is a screen shot of our user interface. When designing the interface, we tries different word font and design, and this is the final output that we are really proud to present:


Influence Dataset with graph theory analysis:

Before this section, it is a tool we created to do our research on different types of characters in epic story. This section I would like to talk briefly about how we use this tool to do actual research. This research is based on data we collected above and graph theory.

Step1: Influence Data for different Books

Because we would like to get information about the influence in different time period, before this, we decided to get the influence data on different books first. We export edges list (relationship) and nodes list (characters) from each gephi visualization data set. Then, by using a python script, we traverse thCode1e relationship list and assign the important factor degree to each characters. On the left is the sample source code. The influence data will be calculated and stored in another csv file. However, the influece data we collected here is just:

Number of Connection to specific node

However, if we would like to utilize those data to compare it with each other, a very serious problem occurs: How can we control the bias to data if a book text size is too big. I met this error when I am trying to use influence raw data to do the research. I called it the data bias effect, I will show a screen shot later to show what it is. But now, I would like to talk more about the solution.

Step 2: Standardize Data to utilize data for comparison

There is a concept in statistics called standardization. With this effect data can be processed to eSample7liminate the data bias effect but still have the comparative relationship between the data. It is (X-Average)/StandardDeviation. Therefore, I standardize the influence data for each books. And at last, the right screenshot is a sample data analysis chart for a sample book. Id is the name of character, InluRaw is the unstandardized influence value (The degree). Influence is the standardized version:

Number of Connection to specific node compares to other nodes

Step3: Combine Data from different books to different eras

After we got the data, we will try to combine the numerical variable data with categorical variables. Therefore, we combine the datasets (numerical data) we collected according to different era. Then we combined the connected Character database (categorical data) . The data can be connected harmlessly because they have the same schema. We then use gephi to just do a natural join to join character database and influence datasets according to Id. We then got 5 files (each file for one specific time period). The data in it will bCapture4e stored according to the left schema, with categorical data and numerical data connected.

Step4: Compare Data according to different era

Then we use another python script to get information about the overall analysis on different categorical variables with the average influence data calculated (It can be done by simply add because the influence is standardized data). The code can be found in RawConprehensive/Influence Per Era. Then, I created a comprehensive chart as the final analysis on the question I asked.


The four graphs are the influence analysis on Gender and Camp. The upper two use raw data and lower two use standardized data. The significant peaks on raw data analysis is a good example of data bias effect. Because of the significant length of The Lord of Rings, raw data is not trust-able in concerning with data-stabilization. On the Standardized data, we are not researching the effect of all characters of certain type to the story, but the average influence of this type’s instance. For example, we are not trying to explore the influence of Jenny + Herimone +… ‘s female’s influence to Harry Potter but their average influence. We are trying to know the singleton’s influence in the story. On the first graph we can see woman’s influence to the story raises significantly in 1900-1950 time era. This raise means a woman in the story might be more influential than a man in that story. From our previous research on war, that war comes with love, always. It might be the same effect, in a period of war time, people want get comfort from female character. Then, although there are woman right movements from  1950s-2000s, man’s position become more important in epic stories. And It is the first time man’s influence in the story is higher than woman’s. There might be an explanation to this that the data is an average value, there is not much woman characters before, so each one of them will be super important to the story. However, with more woman right concern, we have more woman in the story, so in average, they are not that much important. Those 2 are just my justification/explanation on my data observation. At least we can get a very clear pattern on the graph, that when man gets more influence, woman gets less. On other hand, if we take a look on Camp dataset, we will see a very blur pattern from the data, no significant peaks and no very much correlation. I will talk about this in my next section.

Success and Failure on Our Project:


  • We successfully get the woman’s influence on the book correlated to time. and it is correlated to man’s influence
  • We successfully get a relation graph for future research/use
  • Our design is awesome
  • Our data analysis part is very strict and convincing (in graph theory).
  • Our dataset is too big, we are unable to do very deep analysis (Maybe analysis on story plot) due to this issue. We find it is a problem too late
  • Our camp-analysis fails. There is a reason that we can not easily divide characters in one book into simply good, bad and nutural. We need to use a better categorized factor. Then It might be an issue in our assumption making period.

Relational Graph Analysis on Characters in World of Warcraft (Collaborate w/ Zhengri Fan)

Data Preparation (Zhengri)

Network Visualisations start with questoins.(Lima) So we start this project from our previous visualisation on RPG game’s topic. Our question is “Is our previous observation true in actual cases?” Therefore, we take the official novel (official story text) of World of Warcraft to help us answer this question. The book is named as War of the Ancients Trilogy. We create two datasets from the text: 1. the relationship between characters 2. the character’s identity. The first step is to find characters from the text, so we use the tool jigsaw to extract person name from the text. Then we use algorithms to build a relationship table. For the character identity table, we build a data scheme of Gender, Name, Affiliation and Race. We do this because our question is to explore our previous prediction’s influence in an actual case. Our previous prediction is mainly on gender of characters and the affiliation of characters in RPG games. I won’t talk about the data preparation in detail because it is mainly my colleagues work. If you would like to explore more about this pls visit his blog post. The data preparation should count as very important in our project because it is the most important basis.

 Data Analysis w/ Gephi (Jiayu)

I got the list of character’s relationship from Albert(my teammate) then I start to use gephi to visualize the graph data. Though Gephi has been updated since my first use last year, It can not support space between text. Therefore, before inputing the name data into Gephi to create relational graph, I eliminate space first usingCapture the code =SUBSTITUDE(row, ” “, “_”) in excel (I mention this because it might be really heapful for the future gephi user.)  After I input my data into Gephi, the output is like the graph on the left. (Well, it is not quite exactly the same, but the “DEGREE OF MEANINGLESS” matches.) It looks pretty but it reveals nothing. Though we can make the strong relationships look more significant and the influential nodes’ color darker, it tells nothing. Next step following our relational creation is to identify the identity on each node. Therefore, we combine the information in our identity list and our relational graph. each node in our dataset involved with its gender, affiliation and race. We choose those attributes to build our node identity scheme because we would like to continue our previous project on RPG games, which reveals the gender facts and affiliation fact. Gephi’s data managment feature works very well because it is able to do a natural join on two data set (It combine relations with the node’s attributes even if they are two separate dataset, the key(id) we use to do that is the name). So our futher analysis on this WOW character data is on 1. the gender 2. the affiliation 3. the race

Gender Analysis on WOW Data:


On the right is a character relation graph with partition coloring based on gender. The green color stands for male and red color stands for female. The blue one (yes, they do exist) is those in unknown gender (animals or just unknown type). The pattern is, well, very straight forward yet predictable. Male is dominating RPG game and story. Though there are still some red points with strong connections, It won’t change the fact that we don’t really need a female figure. Even if it is the most famous and legacy RPG game World of Warcraft. This pattern can be more Capture5shocking if we do a group to the data. (the graph on the right) The big green dot is of course the male character group and the poor unknown group are the small dot on the 4 clock direction (if you can not notice it at the first glance haha. you may read the post on the high definition webpage to find it.). The exploration on WOW proves our previous observation very well.



Affiliation Analysis on Gephi:

Another observation on our previous data analysis on RPG games are that the affiliation of bad guys, the villains, act as more important roles in RPG games. However, the graph tends to tell a different story for World of Warcraft. On the left is a visualization of the affiliation grouped graph. Red Color is the good guys and blackCapture2 tends to be the villain and green color is those characters that on the neutral side. Connections between those red points totally shut down our previous prediction on the affiliation influence towards RPG game. When we explore more on the most significant points (the big green(neutral side)), we find something interesting that may explain the mis-prediction. The neutral side characters, even if they tends to act neutral, didn’t have much connections with the villains. Then, the connections between villains and heroes are always very strong (the widest red line). And there is only few connections between different villains. So we can find a story behind this affiliation relationship. That because neutral characters have no connections with villains, we can say that they are more like “background NPC” instead of core characters. They don’t actually involve in the conflict and they are mentioned because the protagonist meet them. Then because the connections between villains are really weak, we can conclude they are truly THE VILLAIN. They are very strong and they are strong enough to conflict with the heroes without much cooperation. Then the strong connection between heroes and villain might lead to the fact on massive conflict or main story line. Then we can conclude, villains in our story are not the first glance in the graph, not important. They are just depicted as lonely villains. The strong connections between villains and heroes proves their importances. However, I would like to say that the villain’s figure is so cliche in World of Warcraft. It is just a very traditional Byronic Hero.

Race Analysis on Gephi:

The third analysis on Gephi we built is the partition based on race. I would like to prove my previous analysis on affiliation with my race analysis. That the relationship strength reveals the node’s level/status in the story and the size of the node reveCapture3als the node’s level of loneliness. This graph shows that in WOW novel, the race with most characters is night elf the big purple node on the graph but it is not the most important one. In fact, the red_wyrm (red dragon) puts influence the most in the story. There is only one red dragon in WOW’s world, so the size of the node is really small. However, the edges of it tend to be giant across the world. In actual story, red dragon is truly the most influential character btw. So It proves my previous assumption on how to read the graph.

Graph Theory Analysis on Gephi:

Not only infographic analysis, gephi can also do some very interesting data anCapture7alysis from the graph created. When we talk about graph theory, we are trying to use graph theory to find solution for some statistical consideration on the data. We would like to know about the distribution, shape, and the density. I.e. we would like to know how the characters are connected. Are they connected really tight or not. On the left is the theoretic graph analysis from Gephi. Average Degree is the average influence per character and graph density is how they are connected. From those number the most interesting number is the average path length. From the 6 degree theory, we can predict that in real world social network, the avg. path length ~ 6. However, in WOW, the average path length is 1.32. That means you, as a nobody in that world may connect with our villain in a degree of 2 step. It is really a tight relationship in RPG. In other world we can say, the social status and social barrier is very thin in RPG game’s world.


Compare to Our previous tool: Google Fusion Table and Palladio (Zhengri):


In this section, I would like to say conclusion first: Gephi is much more sophisticated than Google Fusion Table and Palladio because those 2’s feature are only subset of Gephi’s. (Palladio updates from 1.01 to 1.13 and It’s performance is very good now). Google Fusion table and Palladio is easy to use compared to gephi but the thing is things can be done by Google fusion table and palladio can also be done in Gephi but Google fusion table and palladio can not do what Gephi can do. The first feature that they are not able to do is the data managment feature provided by Gephi. Google fusion table and palladio can hardly do database operation like theta join or natural join to the dataset. so that the data scheme can not be added to the data relation. Then, the graphic model of Google fusion table and palladio is insufficient. The only visualization model they can use is force atlas.  At last, it is hard for them to do deep data analysis based on graph theory. The comparison between Gephi and them just like the comparison between Photoshop and Windows paint tool. Though they look the same, they live in totally different categories. (Professional productivity tool vs. Temporory tool for fun). They can do simple visualisations but they are not able to do some deep analysis. Capture8Capture10I should admit that at least they looks very nice.

Assignment 3: Analysis Modern War. (Cowork w/ Zhengri Fan)

This week, we are introduced to 2 new tools for visualisation: Palladio and Google Fusion Table. Both of them are very good at doing network visualisation and raw data analysis. It is a big different when I am using these tools comparing to the experience with Voyant and Jigsaw.  Different from text analysis, cleaning data is not the most important part in raw data analysis tool but building data structure really means a lot. Table based data visualisation provides different results comparing to the text based visualisation. It is more clear and clean in data but in the other hand, It is more predictable.

Raw Data Preparation & Data Structure Design:

Palladio and Google Fusion table are very good at doing network visualisation but in my opinion, as two data tools working specifically on tabular well organised data, I would categorise those two as comprehensive raw table data analyse tool instead of networking tool. So, we decide to do research on three interesting factors: time, space and relationship while we are comparing two tools. We find a database on wikipedia collecting war happening in the world from 1900->1950, the fast changing period that forge today’s world. In that database, we collected start time, end time, war name, victor and loser as our database scheme. Time, Space, Relationship are all together in the database for visualised.

We choose not to stick on our RPG research because we want to test all the features on those 2 platforms. Most RPG games are developed and sold on Japan and North America so It is really hard to do a spacial visualisation. Then, the relational visualisation is quite success in previous project.

Palladio: Beautiful yet low Performance

To speak frankly, I love the design of Palladio website. It is incredibly beautiful among the visualisation tools. Because I have experienced Jigsaw, Gephi and old version of Voyant, the modern simplicity design in Palladio is really catchy. However, the actual experience of using Palladio can be described as suffering. As a data analysis tool/network visualisation tool, Palladio can hardly process over 300 lines of csv data. Our database has around 300 lines of data and the network visualisation between Victory and Defeat can spend over 1 minutes while I use a quad core i7 processor + 16GB memory. So It falls into a really embarrassing situation that It can do very beautiful visualisation with small amount of data but network visualisation with small amount of data most of times is not really meaningful. However, one of its feature still catch me. It is called timeline. It is able to create timeline from the data scheme provided by me. Here is a visualisation of War Country Involved vs. Time.Screen Shot 2015-10-05 15:05:44 +0000In this visualisation, WW1 around 1910s to 1920s involved the most countries (The colour of the bar is kind of meaningless). And with the time changing, many countries leaved WW1. The change is more like a linear recess, that is the decrease in country involved is not rapid fall but a step by step path. However, If we take a look at WW2 from 1938s to 1945s, things get different. The countries involving tend to be a lot more rapidly. Then It is interesting that before WW1, countries involved in war is higher than WW2 but we all know WW2 damage really a lot. So, we can conclude that modern wars with higher damage and  rapid involvement, they deal more damage and end quicker. Countries get quickly involved and end(die) fast. For Palladio, I tried to use it as a data visualisation tool for networking but its performance stops me.

Google Fusion Table: Powerful Google Tool:

Cool Feature @ Katie : Automatic Geocode

Fusion Table might be the coolest tool I’ve tried ever, I made three visualisations with this tool. It is high performance and easy to use/with a lot of fancy features. Before introducing my visualisations, I would introduce you about a cool feature that might be helpful. @Katie as promised this is for you. This feature is called automatic geocode. If you input address in thScreen Shot 2015-10-05 15:12:36 +0000e table, you can change its datatype into: Location and Google fusion table will automatically using Google Map’s api to find the latitude and longitude data the address trying to locate. It will leave unsuccessful geocoded address as ambiguous and converting most of the address into a point on the map. It will help really a lot in data collecting and processing. 🙂

Visualisation 1: War Duration Time Vs. Time HistogramScreen Shot 2015-10-05 15:23:34 +0000

For this visualisation, I did a research on the duration time(the y axis) on the war starting time(the x axis). The upper side of the visualisation is the actual dataset while the lower side of it is the standardised graph of the relationship (only show the change in data but not reflecting actual data size). The first interesting fact is that there are a lot of wars in the world ends in one year. The point “on the ground” represent those wars. Then it also reflects my points made before, that war in old times (before 1940s) tend to take longer time (over 20 years) but modern war with higher damage always start fast and ends fast. The rapid change in technology and society also rapidly change war form. WW2 in this graph is not a very huge “tooth” but damages the world most.

Visualisation 2: Defeat Country Map Visualisation vs. Victory Country Map Visualisation:

Screen Shot 2015-10-05 15:28:10 +0000Screen Shot 2015-10-05 15:27:52 +0000

The first one is victory countries while the second one is defeat countries. (It is in Chinese because my operating system is is Chinese haha). From the graph, we can find a lot of points on Europe for either victory or defeat. That implies europeans in 20th century really love war. Then, from the defeat graph, we can see a lot of points on South America. it might lead to the fact that there is either revolution or de-colonialisation around 1900s to 1950s in South America. It is not a very “surprising fact” but still kind of interesting.

Visualisation 3: Relationship Map between Victory Group and Defeat Group:

Screen Shot 2015-10-05 15:26:31 +0000Screen Shot 2015-10-05 15:27:27 +0000

In these 2 visualisation, I label victory countries as blue and defeat countries as yellow. Size of the node leads to the involvement of this country in the war, either in time scale or in space scale. In order to make the visualisation clearer, I clean the data a little bit. (wiping out small wars and combining some countries with different names). United States is significantly huge in this graph that explains a little bit about its dominant position today. And China, also involved a lot of conflicts and wars during 20th century. But, a very interesting fact really catch me. It is that the biggest nodes in the graph are not those WW player but the civil war and regional conflicts players. It might be the reason that the length in time in the regional conflict tends to be a lot longer and the cost will be a lot less. Also, It provides us another insight looking over war that comparing to world war, maybe civil war and regional conflicts with a long lasting time can really suffer people the most. However, this “conclusion” are only involved with the scale of time without any analysis on the scale of war. It also provides us another insight to our visualisation problem that how can we know if our visualisation is not misleading. I would say this one is a little bit misleading but it also reveals something. But It still informs us to consider the problem of misleading visualisation when we are doing this kind of work.