Relational visualization on heroic stories


As an enthusiastic in video games, we have done several projects that are directly or indirectly related to the video games, especially RPG games with the fact that RPG games tends to have interesting plots than other kinds of video games. For the final project, based on our interests in RPG games, we are more willing to discover a little deeper on this sort of topic: as most RPG games are talking an epic story with the media of video gaming, we are going to figure out how epic stories changes over time. By making relational graphs over the epic stories with attributes like genders and camps (Or, the “role” they are playing in the plot, either protagonist or antagonist). We are going to explore how factors changes over time, or trying to see if any generalizations can be made based on the visualizations. The result of our exploration would consists two parts:

  1. A website which contains interactive visualizations and some historic background
  2. An datasheet that applies graph theories to our data (Shared using google drive)

Data Collection and analysis

Row Data Collection:

The very first thing to do for our final project is to collect raw data for future processing. Fortunately all of the books that we are desired to measure have a modern English translation, therefore we could use the text analysis software that we are familiar with. The way we are organizing our data is to group them into five different time periods, which are: pre-1850, 1900-1950, 1950-2000, 2000-present, and then, we start choosing books that are published in each periods. We decided to do three books per period; for detail, the website has all the names of the books we are using there. Then, after seeking the raw texts from internet, we apply further mechanism to make the raw text useful, and there comes our magic tool: jigsaw. We use jigsaw for analyzing the frequency of connections in different texts, and export them to csv spreadsheets which looks like this:

in which we have the names of the characters in one line, and repeat them n times in the sheet, in which n is how many connections they have that is analyzed using jigsaw. We decided that for pairs of character, they are connected if they coexists in a small portion of text. Although this would not be very accurate, it is a good approximation for general relational analysis.

After our relational csv file is done, we starts to make characters’ attributes. The attributes including the gender and the goodness(whether is playing against the good wills in the story or not) of the character, and make them into a csv sheet.








Data Processing Using Gephi
With the powerful platform Gephi, we are going to further polish our data. Since we have the connections data and attribute data in separate file, we are using Gephi to perform a natural join to make the two separate data merge together. Therefore, each characters in the Gephi data would contain both the attributes and the connections in it. We following the common relation-edge and character-nodes schema, and uses the Gephi built-in methods for analyzing the influences with graph theories, which includes the degrees, the centrality and the modularity of each character. After processing with respects to each pervious variables, we found out that the modularity is relatively low with our data set, means that each individual characters are connected all together in someway, with little exception like the Paradise Lost, in which the modularity value is somehow higher than other stories. That is predictable since in the actual story, the goods and the bads are somehow separated. By our knowledge of graph theory, we decided to use degree to rank the nodes of the characters. The reason why we are not choosing the modularity as the size factor of the nodes is that it is not an unbiased variable, as well as it is not a very good way to represent influence, or, not direct in measuring influences. Therefore, we choose degree and the average weighted degrees as the main factor for ranking different nodes. For grouping the nodes in the final visualization, we then decided to use the centralization to group the nodes. In the visualization setup, we used “Yifan Hu proportional” with the optimal distance of 1000. The following picture shows the general appearance of our visualization, and shows that grouping using centralization is a good idea.

(The Wonderful Wizard of Oz Character visualization)



Data Visualization and Its Presentation

We choose the pervious visualization with detail as the illustration of our visualizations. Capture2

We could see that Dorothy, which is the view point of the story, shows the most connections and centralization in the whole visualization. We could also see that the green girl, which is located in the top left corner in the left figure, shows a character with low degree of connection and a low centralization, which means this character does not have much influence over the whole story. This screenshot is taken when the user clicked on the node that represents the character Dorothy, and as a result, a very detailed information including the statics of the character is displayed on the left of the screen. Of course the user could click any of the nodes and get the same kinds of information for different nodes/characters. This might be considered as overly informed to the user, however we decide to keep them in order for people to conclude their own answers to their or our questions when exploring our website, as a good reader-driven visualization there. To group all our visualizations in a meaningful way, we choose the timeline JS platform to gather them on a uniform timeline, and put the five periods into the it. For each time period, we first gives the user a historical background in quotes and provides links to each visualization that we made. We also highlighted a visualization for each period so that the user could directly interact with it.

(Screen shot of our website)



Influence Dataset with Graph Theory Analysis

We developed a method for analyzing different types of characters in epic stories. I would introduce this method at the beginning of this section. We used the same dataset to perform this analysis.

Step1: Influence data with respect to each book

Based on our research objective, we decided to get the influence data on each books first. Using a python script, we traverse over the relational and the character sheets, and assign the important factor degree to each characters. The following is the screenshot of part of our script. The influence data is calculated using the script and stores in a separate file.


We know that different books have different sizes; generally, a foot thick book would provide a larger number of connections between characters than a inch thick book. To utilize the data and to compare the characters of different books with each other, we standardized the data to perform comparison.

Step2: Standardize Data

In statistics, there is a concept called standardization, in which is a proper solution to the size problem of the books. By using a factor of the (X-Average)/StdDev, the influence data of each book is standardized. The following screen shot shows a piece of the data we collected and processed, in which id is the name of the character, InfluRaw is the unstandardized data and the Influence column is the standardized data, in which it represents the number of connections compared to other characters.


Stpe3: Combine Data Altogether

After we standardized data, we are going to combine them together. Therefore, we merged the numerical data collected before and the categorical raw data from different time periods Capture4together. Since they have a common name associate with them, it is an easy combination by using Gephi’s natural join utility. As we see from the right, we connected the attributes of the characters and their numerical influential data that we made previously.

Step4: Compare data from different era

By using another python script, we got the overall count of the of different categorical data and the average influential value per era. By grouping them into a comprehensive spreadsheet, we could perform one final analysis.

(Raw data and plots)


For pervious plots, the top two are the plots based on the raw data, whereas the bottom ones are based on the standardized data. As we can clearly see there, simply using the raw data could clearly generate a very different result, and it also shows the influence on the size of the books. For the standardized plots, as mentioned, are representing the average influence of each types of character. We are not analyzing specific characters but rather their types, which includes their gender and the goodness of the character.

As we could see, for the gender analysis, the women’s influences (the gray line) significantly increased in 1900-1950 era, which may corresponds to the wars at that time, in which people find women figures are more attracting. Interestingly, although the famous Feminist movement is happened in the 1950-2000 era, the women’s influence is dramatically decreased, and the males’ influences(the orange line) dramatically increased, and is the first time that the men’s influence rule over the women’s. It might because more women figure exists in the story and making the average influence decrease. Unfortunately, we could not see any influential trends in the camp data, which is clearly one failure of our project and I’ll talk about it in the next session.


Success and Failure


  1. It seems that the gender’s trend plot is a meaningful one. We could expand that topic and do some more researches to reveal more facts.
  2. A clear relational visualization, and have potential for future uses.
  3. Website, thanks to the timeline JS platform, is visually rich.
  4. Applied graph theory and generated good results on our visualization.
  1. Big dataset, and we even plans to do five books per period, which makes us unable to dig more into the data (even if we used three books per period).
  2. The “goodness” analysis is not showing a clear trend, which one problem that dividing a character into either black or white is pretty difficult, as each character might show two conflicting characteristics in the book. Maybe we need to come up with a better division method; maybe not, it could also be the problem that we only uses three books per period; although that is a good amount of work for us, it is not that representative overall. Or, it might be the problem for our period division; maybe 50-year division doesn’t correspond the development of the epic stories.

Final Project Reflection

For the final project, Bobby and I chose to analyze the transcripts of Presidential and Vice Presidential debates from 1958 to 2012. While viewers may be familiar with the names of Presidential candidates, it can be difficult to keep up with the political stances of all of them, regardless of which political party one associates with most. Though debates can take place over a course of multiple hours, it remains difficult for the average viewer to fully grasp a candidate’s stance on a number of issues. Topics may range from domestic affairs – education and healthcare – to foreign policy matters – terrorism, cyber threats and drugs – and candidates are rarely allotted the necessary time to eloquently express their opinions. Rather than assess a candidates competency on the aforementioned topics, debates serve as a platform to evaluate how they perform under the pressure of the national spotlight. As an alternative to watching long debates or reading lengthy transcripts, data visualization platforms provide the reader the opportunity to quickly expose themselves to topics addressed by individual winning and losing candidates, and examine vocabulary terms utilized by them. After conducting structural and vocabulary analysis of the most recent Presidential debate, Mitt Romney vs. Barack Obama, it was fascinating to learn how the two stressed different ideas in future debates in comparison to previous ones. It was at this point that we decided this was the avenue we wished to continue to explore further.

Prior to continuing, it is important to acknowledge the deficiencies within these data visualization platforms. Visualizations fail to display a candidate’s demeanor when speaking, or how they present themselves to the audience. Illustrating the importance of image, in the 1960 John F. Kennedy vs. Richard Nixon debate, Nixon the ultimate loser stated, “I should have remembered that a picture is worth a thousand words.” Nixon’s comments reveal his regretful attitude for not taking the stage in a presentable manner. In addition to visualization platforms failing to account for facial expressions and body language, they also do not provide the reader context pertaining to the state of the country at the time of the debate. As such, Bobby and I sought to compensate for these faults by providing the reader with analysis across three different visualization platforms, Jigsaw, Gephi, and Voyant, and context concerned with the time period of each debate, collectively providing different analytic perspectives. The majority of our analysis was a result of Voyant of Gephi outputs. The interactive visualizations allowed us to organize the vocabulary used by the winners and losers of each election. The reader would only be hurting himself or herself if they attempted to analyze the visualizations prior to contextualizing the debate. In order to stress the importance of contextualizing debates, all one has to refer to is the most recent Democratic Primary Debate between Hillary Clinton, Bernie Sanders and Martin O’Malley. Prior to the November 14, 2015 debate, terrorist attacks were carried out in Paris, France. As a result, the focus of the debate centered on terrorism, gun control, and number of foreign policy issues in which Hillary Clinton was indisputably the most knowledgeable on. Had the reader attempted to compare this debate to a primary debate in the past or future, the emphasis on these issues would stand out considerably.

On our website, we created four tabs, including an explanation of our iterative research process and the construction of our visualizations, our home visualization page, and a works cited section. We believed it would be most informative if a combination of interactive and static visualizations were created. Within the iterative research process section, it was important to explain how the direction of our research drastically changed from our original plan. The structure of the visualization consists of a timeline that the reader may explore at his or her own pace, with information including notable events that provide context to the debate. In addition, below the descriptions of the debates are links to the Gephi visualizations, permitting the reader to refer back to either the Voyant visualizations, the background information, or view all three options simultaneously. The transcripts have been parsed and separated by candidate, omitting any language that may indicate who is speaking and any comments relayed by the moderator. Afterwards, the transcripts were analyzed in Jigsaw with the intent to discover fascinating trends in relation to sentiment and entity analysis. Next, transcripts were uploaded into Voyant to analyze sentence structure and word frequency. Lastly, the transcripts were inputted into Gephi for the construction of a network visualization of the vocabulary used, separating vocabulary spoken by winning and losing candidates. The combination of Jigsaw, Voyant, and Gephi provided different forms of analysis, which in totality revealed information on not only the issues winning and losing candidates tended to focus on, but the specific vocabulary words they used as well.

One of the biggest dilemmas we faced was deciding if stop words should be implemented in the Voyant visualizations. On the one hand, it can be argued that by utilizing stop words, the debate transcripts have been tampered with. Conversely, it can be argued that applying the stop word feature does remove commons words that do not hold much overall significance. In addition to stop words, as stated on the website, Gephi could not produce useful visualizations when we tried to enter all of the winning or losing candidates data in at once, so we decided to separate them by individual candidates. Also, it  was important to acknowledge that not all visualizations can truly explain the reasons behind a candidates victory. Rather than word choice or attention dedicated to certain topics, some candidates, especially those who lost by a marginal amount, may credit a large part of the reason they lost to their demeanor and actions on stage. When this was the case, it was important to conduct outside research to discover other areas where candidates may have hurt themselves.

The combination of a Humanities and Computer Science student working together on this project ultimately worked out extremely well. Following the completion of this project, I have come to understand that in order to fully maximize the opportunities for discovery that data visualization presents, a team of scholars across different disciplines should be assembled, allowing different perspectives to be voiced. For example, while a Humanities student may not be as technologically adept to particular software, computer shortcuts etc., they are able to contribute a comparative analysis of texts across different disciplines, visualizations, and frame their argument persuasively. On the other hand, the ability for Computer Science students to manipulate and organize information visually in a multitude of way will always serve as a strong asset to any form of text analysis in any research area. Had a political science student joined the team, their expertise may have contributed to taking the project in new directions.

Assignment #6 – JZ

Our research topic is “How president Obama’s speeches changed over the time period from 2008 to 2012”, and we would like to do linguistic analysis on his speeches based on different dimensions including location, time, audience and topic.
Our dataset is created based on the speeches of Black Obama’s speeches from 2008 to 2012. We decided to choose this time period because a lot of events happened during these five years including economic crisis, the election of president, violence of Libya, and others. In the meanwhile, since we want to use Gephi to do the visualization of all the words that Obama used in speeches, five years’ amount of words may be the most suitable one. We first copied and pasted each speech in to individual txt files to create the corpus, and saved them separately based on the years. We also classified each txt file into groups based on locations to do the map, and based on audience to do word usage analysis.
Then we browsed each speech and concluded every single speech’s location, topic, audience, and year. The locations and years are easy to find, but the topics and audiences are not. We found the topics are broad and the audiences are hard to identify during the process of classify speeches. In case of not gathering useful information due to too many categories, we defined topics into groups of Economic, Social, Security, Political and Military. We also defines groups of audience as Public, Student, Military, and Politian.

The first analysis we decide to do is the word usage of the speeches with Voyant. We put each year’s corpus into Voyant, and made five word clouds to figure out what the key words are during the five years. Looking at the word clouds we found that there are some clues that show Obama’s speeches are closely relevant the social issues within each year. For example, in the word cloud of 2008 and 2009, we can see words like economy, work, and crisis that related to the economy crisis. In 2011, Obama’s speeches are more about security and rights because of the violence of Libya. In 2012, the election of president takes an important role, so we can see words like president, Romney, and governor.

We also did the word clouds of speeches classified by audience. What we found interesting is that Obama’s speeches that are made toward students are mostly about economy. We can see the words “economy” and “financial” appear a lot of times.


Therefore, we did the analysis of relationship between topic and audience using Google Fusion Table. In the visualization we did, Blue dots represent audience, while yellow dots represent topics. If we focus on the blue dot labeled Student, and the yellow dots that have been connected with it. The line that connects student and economic is much stronger than other lines. So we can further prove that when Obama is doing speeches facing students, he will discuss more about economic issues.


In the meanwhile, we count the number of vocabularies Obama used targeting different groups of audience, and found that he used more vocabs when talking to Politicians and students than talking to minority and military. However, there may exist bias in this conclusion because the truth is that there are more speeches toward the first two groups of people, which offers a larger base of vocab usage.

We also used Google Fusion Table to make the map of the locations that Obama made speeches during 2008 to 2012. We found that Obama had never made speeches in Africa and some East Asian countries including China. We found it interesting because it is a little surprise to us that Obama has avoided making speeches in these countries when global networking is much stronger nowadays.

Then we moved our concentration to the speeches made inside U.S. We concluded all the words in different states and use the corpus of words in Voyant to figure out the most frequent word Obama used in each state. We believed that the words that appeared the most times will best represent the relationships between topics and locations. We were inspired by the visualization in the Dubois show, and created the map that is similar to the one Dubois created. To avoid the overlapping of words, we did the states which have more speeches like D.C. first, and disregarded words like people which are the most frequent word in a lot of speeches. To better understand the relationship between location and topic, we used Google Fusion Table to create the chart. In the visualization, blue dots represent location, and yellow dots represent topics. When we looked at the two visualizations together, we found they can be consist with each other. For example, the key word in New York is Romney, and if we take a look at the yellow dot connect with Politic, many of them are cities in New York. That’s because a lot of speeches on the election of president happened in New York in 2012. Despite showing a little relationship between topics and locations, we see the map extremely interesting and attractive because through the words located in the map, we can see some beautiful stories. For example, through the word “Father” located in Indiana, we felt the happiness of president Obama being the father and making the speech as a father.

wxid_u97vx2nsf90a41_1450231403398_25 wxid_u97vx2nsf90a41_1450231451306_77
The last visualization we did is the one consist of all the single word. We used Gephi to make the round pattern. The more close to the center of the round, the more frequent the word Obama used. We also used different colors to indicate words used in different years. The degrees of thickness can represent how many words each year used. For example, the thinnest ring belongs to the year 2008 because we only found 4000 words in 2008, compare to the number of more than 60000 words in 2011, the ring of 2008 is hard to see. We also see an interesting element that there are several words has only been used once during five years, and they are located outside the main round pattern.

We created the website ( including the visualizations we did to show the audience a more organized and clear process of our research on the code-switching strategy that Obama used. Through this project, we think our visualizations somehow solve our research questions on Obama’s speeches. We better understand the using of vocabularies and the choices of topics made by President Obama based on locations and audiences.