Final Project Reflection

For the final project, Bobby and I chose to analyze the transcripts of Presidential and Vice Presidential debates from 1958 to 2012. While viewers may be familiar with the names of Presidential candidates, it can be difficult to keep up with the political stances of all of them, regardless of which political party one associates with most. Though debates can take place over a course of multiple hours, it remains difficult for the average viewer to fully grasp a candidate’s stance on a number of issues. Topics may range from domestic affairs – education and healthcare – to foreign policy matters – terrorism, cyber threats and drugs – and candidates are rarely allotted the necessary time to eloquently express their opinions. Rather than assess a candidates competency on the aforementioned topics, debates serve as a platform to evaluate how they perform under the pressure of the national spotlight. As an alternative to watching long debates or reading lengthy transcripts, data visualization platforms provide the reader the opportunity to quickly expose themselves to topics addressed by individual winning and losing candidates, and examine vocabulary terms utilized by them. After conducting structural and vocabulary analysis of the most recent Presidential debate, Mitt Romney vs. Barack Obama, it was fascinating to learn how the two stressed different ideas in future debates in comparison to previous ones. It was at this point that we decided this was the avenue we wished to continue to explore further.

Prior to continuing, it is important to acknowledge the deficiencies within these data visualization platforms. Visualizations fail to display a candidate’s demeanor when speaking, or how they present themselves to the audience. Illustrating the importance of image, in the 1960 John F. Kennedy vs. Richard Nixon debate, Nixon the ultimate loser stated, “I should have remembered that a picture is worth a thousand words.” Nixon’s comments reveal his regretful attitude for not taking the stage in a presentable manner. In addition to visualization platforms failing to account for facial expressions and body language, they also do not provide the reader context pertaining to the state of the country at the time of the debate. As such, Bobby and I sought to compensate for these faults by providing the reader with analysis across three different visualization platforms, Jigsaw, Gephi, and Voyant, and context concerned with the time period of each debate, collectively providing different analytic perspectives. The majority of our analysis was a result of Voyant of Gephi outputs. The interactive visualizations allowed us to organize the vocabulary used by the winners and losers of each election. The reader would only be hurting himself or herself if they attempted to analyze the visualizations prior to contextualizing the debate. In order to stress the importance of contextualizing debates, all one has to refer to is the most recent Democratic Primary Debate between Hillary Clinton, Bernie Sanders and Martin O’Malley. Prior to the November 14, 2015 debate, terrorist attacks were carried out in Paris, France. As a result, the focus of the debate centered on terrorism, gun control, and number of foreign policy issues in which Hillary Clinton was indisputably the most knowledgeable on. Had the reader attempted to compare this debate to a primary debate in the past or future, the emphasis on these issues would stand out considerably.

On our website, we created four tabs, including an explanation of our iterative research process and the construction of our visualizations, our home visualization page, and a works cited section. We believed it would be most informative if a combination of interactive and static visualizations were created. Within the iterative research process section, it was important to explain how the direction of our research drastically changed from our original plan. The structure of the visualization consists of a timeline that the reader may explore at his or her own pace, with information including notable events that provide context to the debate. In addition, below the descriptions of the debates are links to the Gephi visualizations, permitting the reader to refer back to either the Voyant visualizations, the background information, or view all three options simultaneously. The transcripts have been parsed and separated by candidate, omitting any language that may indicate who is speaking and any comments relayed by the moderator. Afterwards, the transcripts were analyzed in Jigsaw with the intent to discover fascinating trends in relation to sentiment and entity analysis. Next, transcripts were uploaded into Voyant to analyze sentence structure and word frequency. Lastly, the transcripts were inputted into Gephi for the construction of a network visualization of the vocabulary used, separating vocabulary spoken by winning and losing candidates. The combination of Jigsaw, Voyant, and Gephi provided different forms of analysis, which in totality revealed information on not only the issues winning and losing candidates tended to focus on, but the specific vocabulary words they used as well.

One of the biggest dilemmas we faced was deciding if stop words should be implemented in the Voyant visualizations. On the one hand, it can be argued that by utilizing stop words, the debate transcripts have been tampered with. Conversely, it can be argued that applying the stop word feature does remove commons words that do not hold much overall significance. In addition to stop words, as stated on the website, Gephi could not produce useful visualizations when we tried to enter all of the winning or losing candidates data in at once, so we decided to separate them by individual candidates. Also, it  was important to acknowledge that not all visualizations can truly explain the reasons behind a candidates victory. Rather than word choice or attention dedicated to certain topics, some candidates, especially those who lost by a marginal amount, may credit a large part of the reason they lost to their demeanor and actions on stage. When this was the case, it was important to conduct outside research to discover other areas where candidates may have hurt themselves.

The combination of a Humanities and Computer Science student working together on this project ultimately worked out extremely well. Following the completion of this project, I have come to understand that in order to fully maximize the opportunities for discovery that data visualization presents, a team of scholars across different disciplines should be assembled, allowing different perspectives to be voiced. For example, while a Humanities student may not be as technologically adept to particular software, computer shortcuts etc., they are able to contribute a comparative analysis of texts across different disciplines, visualizations, and frame their argument persuasively. On the other hand, the ability for Computer Science students to manipulate and organize information visually in a multitude of way will always serve as a strong asset to any form of text analysis in any research area. Had a political science student joined the team, their expertise may have contributed to taking the project in new directions.

Assignment 5

According to Lima, network visualization must start with a question. For this assignment, I reverted to studying Elizabeth Warren. Specifically, I sought to compare the text analysis data I gathered from Assignment 2 to her Facebook page likes. Through a comparative analysis, I hoped to discover a relationship between the speeches relayed to her audience, and what types of organizations it drew to liking her Facebook page. Google Fusion Tables and Gephi were the two platforms I used to analyze her Facebook likes. After several attempts, Palladio would never upload my dataset. However, I can imagine the product being similar to the first visualization Gephi produces when I upload my dataset. Clearly, Gephi is superior because of the several ways it allows you to visualize the graph in different ways (layout, size of nodes, centrality, degree, color, etc.)

GoogleFusion Tables is very useful in that it is excellent at summarizing and inputting large amounts of data into rows similar to excel, or a unique feature, “cards.” I found the cards to be the most useful aspect of Fusion Tables when analyzing the Senator’s Facebook likes. In fact, it reminded me of a baseball card, a summarization of a players statistics, team, batting average, etc. Below is a screenshot of the cards feature”


Visual 1

As visualized, the card includes absolutely everything you need to know about each node in the Gephi visualization that pertain to this dataset. It is important to know the name of the node and the industry or organization it belongs to, as it allows the audience to see what type of audiences the Senator attracts. In addition, one can see how many likes the node/page that has liked Senator Warren’s Facebook page has, allowing the reader to see how popular that page is. The cards feature is a great introduction to what Gephi will visualize.

Each node represents a Facebook page that has liked Senator Warren’s Facebook page. As of today, the Senator has 1,649,577 likes. The beginning stages of my data visualization:

Visual 2

I selected the Fruchterman Reingold layout as it is most appealing to the human eye, and appropriate for this dataset. Lima states, “ the aim is not to merely create an algorithm capable of sustaining copious amounts of nodes and links. But also to select the most appropriate scheme based on well-founded design principles and appropriate interactive methods,” (Lima 95). In conjunction with his comments, to analyze particular nodes, communities, and generally make the visualization more informative, I ran a series of tests, including ranking nodes by degree, color, modularity, average path length, size, adding labels, and filtering out nodes that the repulsion strength of the visualization deemed they were outliers. These outliers have the highest eccentricity. The totality of all of these configurations resulted in the graph below:


Visual 3


communities-size-distribution Closeness Centrality Distribution Eccentricity DistributionBetweenness Centrality Distribution

The larger the nodes, the more connected they are to other nodes. In other words, the largest nodes have been liked most by other Facebook pages that have also liked Elizabeth Warren. The five largest nodes, to no surprise, are Harvard Law School, where the Senator served as a professor, Harvard University, Harvard T.H. Chan School of Public Health, EPA, and the Department of State. There are three large communities, colored red, purple, and green; and one small community, called teal. The communities represent Facebook pages that have liked each other the most. Colored communities allow the human eye to easily interpret which groups are most connected. Based on the visualization, the follow statistics have been obtained: 37.63% of Facebook pages that have liked Senator Warren’s page are government organizations; 12.9% Education related organizations; 11.83% Non-profit; 7.53% Universities. When I filtered out the nodes that are least connected to other nodes (TV Networks 1.08% and Books 1.8%) it made the visualization more intact. By clicking on the average degree, one can see which node has the most likes. The final product, according to Lima, “should always be a useful depiction able to fulfill its most fundamental promise of communicating relevant information,” (95).

I’ve learned that Gephi is a great platform to create network visualizations with two or three categories. After that, it get’s very tricky and frustrating. In the future, if possible, it could be interesting to compare the Senator’s top donors, the industry they belong to, and the number of likes those respective industries have on her Facebook page.

Assignment 3

While Voyant and Jigsaw serve as excellent methods of visualizing large amounts of transcription, the two platforms are unable to organize large amounts of statistical data. For this, one may choose to utilize Google Fusion or Palladio to depict datasets with multiple variables.


The following dataset was collected from NASA’s Socioeconomic Data and Applications center. It depicts population exposure estimates in proximity to nuclear power plants and provides approximations of total, urban and rural population in proximity to these plants. This importance of this dataset is illustrated in the fact the nuclear plants serve as a source of sustainable energy, while also posing much harm to society and the environment such as a nuclear waste and spills, and the establishment of nuclear weapons.


Google Fusions constructed the visualization below. Though the visualization is static, I prefer it to Palladio’s version, as it is more aesthetically pleasing. In Johanna’s terms, this visualization would serve as a representation rather than knowledge generators. The dots on the countries symbolize a nuclear plant being present.


Screen Shot 2015-10-06 at 7.44.30 PM

According to the graph, aside from a single plant in South Africa, no other African nations are home to nuclear plants. As a result, they lack any legitimate sources of accessible and renewable energy. On the contrary, Westernized countries are home to many nuclear plants, illustrating their development and endeavor towards being more environmentally friendly. As evident below, developed countries seem to have the resources to sustain multiple reactors within their country.Screen Shot 2015-10-06 at 5.11.29 PM


Though I enjoyed the graphical visualization, one of my favorite features of Google Fusions was the table element. It very simply organized the information, and rather than having to meticulously search through the excel document, I could easily scroll or search to see the organized dataset I am curious about.


Screen Shot 2015-10-07 at 9.55.11 AM


The visualizations constructed below in Palladio illustrate the number of nuclear plants present in each continent. They serve as representations as well.



Screen Shot EUROPE


Europe & Africa:

Screen Shot AFIRCA


North America:

Screen Shot Noerh AMERICA


As a result of North America having a number of nuclear plants, a significant amount of people are estimated to be recipients of renewable energy, and exposed to potential harms of the nuclear plant. Below is a depiction of the number of people exposed to nuclear plants in correspondence with the visualization above.


Screen Shot 2015-10-07 at 12.09.55 AM


Though Google Fusions and Palladio are both extremely useful platforms of organizing statistical data, both serve merely as representations rather than knowledge generators. It is not the labels or numbers that are of great importance in the visualization, but the lines and connecting nodes. The lines and nodes can be argued to be knowledge generators as they create a sense of understanding of the visualization, and show the importance of utilizing both platforms. Overall, I still find illustrating data via Google Fusions and Palladio to be helpful as they are very simple, yet informative visualizations.