Blog Post 6: Viewing The Political World Through The Spoken Word

When beginning this project, we were tasked with discovering a topic of research which would be able to accommodate both my background in Computer Science and natural language processing with Adem’s own work in the analysis of political speeches and the possible connections they might have with both the individual speaker’s reception as well as possible influences on specific topics of the speeches themselves. It only felt natural that we would work on analyzing political language in some form. After some exploration into different avenues of political language, we ended up settling on the examination of language used by debaters contending to be the President of the United States. This decision was made both due to the availability of the debate transcriptions, as well as our own genuine interest in the sort of findings that might be discovered by studying this particular data-set.

Due to the size and nature of the data-set we had accumulated, deciding which particular avenue we wished to explore in terms of analysis proved to be initially difficult. In order to get a better idea of possible aspects of the transcriptions we could delve into, we began with an initial exploration of our data using the Jigsaw platform for both entity analysis as well as sentiment analysis for our transcriptions of those who both won and lost their respective elections. This proved to be very useful in our brainstorming phase and allowed us to form an idea of how to approach our project’s main research question.

After completing our initial research, we eventually settled on our primary research question: during United States presidential debates, what do winning and losing candidates tend to focus on, and how does their individual vernacular choice affect their outcome in the election? While their individual talking points may be related to events of the time, is there a clear connection between language use and the elections’ outcome? We decided upon this question after conducting sentence structure analysis in the Jigsaw platform and by noticing the obvious topics (in this case entities) regularly covered by the candidates. From here we decided to zoom in on individual elections and see the way in which specific events surrounding the elections themselves would affect the topics discussed and how this related to the outcome of the election.

After we had decided on a specific topic of research, we began our exploration of vocabulary usage and sentence analysis using the Voyant platform. Using this platform, we were able to see immediately tangible results in relation to the text transcriptions. One feature in Voyant that we found quite helpful was its word-cloud creation tool. While word-clouds themselves have major issues in the realm of Digital Humanities in terms of validity in the field of research, they are used by almost every news outlet which chooses to make visualizations of political speech. In our preliminary research into scholarly sources of data visualizations of political debates, by far the most commonly used form were word-cloud visualizations. It’s easy to understand why this is the case, due to the ease of creation of these visuals as well as their ability to give a brief snapshot of a speech. But as far as lasting conclusions and individual generation of knowledge, these types of visuals don’t offer much. Often they are merely scratching the surface in terms of studying a set of text. Because of this, we decided on using individual word-clouds of each candidate’s dialogue for every debate. We believed that, while this wasn’t the end-all in terms of visual exploration of the debate transcripts, this served as a quality static jumping-off point for viewers. They could view these visuals, get a quick understanding of some of the topics that were highly used by particular candidates and then could delve deeper into the research themselves, following a martini-glass structure of project presentation as described by Edward Segel and Jeffrey Heer.

In order to create a display a visualization which not only is presented in a pleasing and approachable way, but also generates knowledge much in the way that Tanya Clements describes in her work, we knew we needed to make something that the user would be able to interact with. To do this, we first had the task of taking our transcriptions and combining them with the Gephi platform in order to create a network visualization of the individual nomenclature used by the winners and losers of the presidential debates our transcripts were associated with. We chose a network design due to my own previous experience in creating language-based network designs earlier in the course. But, in creating the network visualizations we had the issue of deciding whether to make multiple networks, one for each election, or to make one large all-encompassing visualization of overall vocabulary usage split between winners and losers. We decided that due to the importance of the time in which these debates took place, scrapping the temporal component of our data entirely would entail losing a large amount of information. But we had the problem of how to display this time information effectively. Gephi has a timeline tool which allows the user to mark nodes and edges with time intervals to be presented in a dynamic  display. The only issue with this, other than the finicky nature of the Gephi platform itself, is that the size of our data-set is so large that it would be difficult for Gephi to render it effectively, and even then it would be difficult to interpret as a viewer. So, instead of this, we decided to make one overall network of all debates, and then to make a visualization for each individual election, ordering them all chronologically for the reader to discover in or out of order. By using the TimelineJS interface, we were able to not only post links to all of our interactive visualizations created using the Gefx-JS web-viewer, but were also able to add additional context for each election in the form of events surrounding each time period. This gave us the ability to frame each visual in a way which would allow the user to draw more educated and informed conclusions from our data.

After constructing our website containing all of our visuals, with links to our Voyant-created ones as well as our interactive network ones, we were able to get a better idea of just how our research question might be answered. While looking over each set of visuals, it became abundantly clear while some overall terms might be more associated with election winners than others, such as talk of the future and community, the vocabulary that led to success was largely a factor of the time in which the debate took place. Whether the world was in the middle of a bout of political or economic turmoil, or if the nation was in the middle of a period of prosperity, the winning set of terminology varied. This makes sense upon further reflection. What the American people want/need to hear at a given time may be very different than what they require at another. This project has given me an interesting look into the United States political system and how we, as individuals, view those we put in positions of power. Our collective consciousness has a way of jumping from intense focus on certain topics such as liberty and security after we’ve been badly beaten, and more on social issues when we are given the time to look inwards at our own national needs. But no matter what, what we say and what we want to hear can directly impact how we view the world. That’s why we must acknowledge our needs consciously, and choose our words carefully.

Assignment 5 or “The Importance of Being Gephi”

As with Assignment 3, I utilized the CHILDES corpus of child speech in order to create the data-sets used in my visualizations displayed below. As before, the data I pooled from had already been separated into categories based off of the syntactical level of the speaker. The specific pool I utilized was from children of the lowest level of capability (and thus, age as well). Getting the extra information required for my visualizations consisted of parsing through the original data and organizing it into sets based off of vocab use, age, and gender. Due to the unorganized structure of the CHILDES corpus (there being only loose structural guidelines for it, with few contributors actually making use of all the data tags supplied) getting this information together didn’t prove easy, but ended up being quite fruitful.

Age Vocab Visualization

This first visualization was created utilizing the Palladio platform and resulted in an immediately interesting (if complicated) visualization of age versus vocabulary use. As can be observed above, the largest number of utterances came from those who were 1 or 2 years old (as represented by the larger nodes on either side of the central structure). Though there are a good amount of words connecting these two age sets, as well as to the other ages at the top of the central section, the truly interesting thing about this data depiction is just how many words are not connected by common usage among age groups (as shown by the large clusters of nodes on the proximity of the visualization). But, due to the fact that Palladio has trouble with this much data, as well as the sheer amount of it, trying to inspect it closely proves difficult, if not impossible. This is still a much more interesting visualization than the ones that I was able to create using Google Fusion Tables, which were structurally unclear and muddied, but nonetheless this picture still leaves something wanting.

Gender Vocab Visualization

This second visualization, depicting the correlation between spoken vocabulary and gender, is immediately more interesting than the previous one due to its clarity in relationship between each subsection. The structure of the visual is as intriguing as it is logical. The three large nodes with a majority of connections represent Male, Female, and Unspecified genders. Each are connected to each other as well as having their own collection of unique words not shared with the others. What is interesting about this is just how many more unique words the Female gender has than the others. But, even though there are a lot clearer conclusions to be drawn from this visual, it’s still lacking in the way of interaction and close inspection.

This is where Gephi comes in.


age visual gephi

The above picture is a static image of a visualization of the same age versus spoken word data used in the first Palladio visual. While it is readily apparent just from this still alone how much cleaner the Gephi visual is to its Palladio counterpart, I suggest that you follow this link (Age vs Word Visualization) that will take you to an interactive version of this visual to see the true power of a Gephi visualization.

This interactive visualization was created by first making the graph of connections in Gephi, and then loading into the Gexf-JS Web Viewer program created specifically for viewing interactive Gephi visualizations online. I originally tried using the Sigma JS exporter instead, but the platform was unable to handle the size of my data set. Gexf-JS gives many of the same interactive benefits of Sigma JS, while also being a bit cleaner in its search interface as well as its responsiveness and drawing capabilities.

This interactivity gives a user the ability to not only see overall connections and structure of a network, but also the individual connections between items in the network itself. Nodes are colored based off of the connection grouping that they are most affiliated with. For instance, the word “write” was most used by those of 1 year of age and is colored blue, while “next” is green because it’s mostly connected to the 2 years old age group. From there, the visual is structured by the Fruchterman Reingold layout, meaning that nodes with more connections are centrally located and those along the perimeter are the least connected. As can be seen with the age visualization, this leads to some very interesting layouts which can tell a user a lot more about a set of data than Palladio was able to, especially in the case of a large data set that is as hard to read statically as this one. But, once you play around with the visual for a little bit, you can make some interesting conclusions about word use and overall connectivity than you might be able to make otherwise. As well, this is all for a visualization which is inherently muddied in structure. When you get something more structured, things get even more interesting.
gender visual gephi

This above picture shows a static version of the interactive visualization which can be found at the following link (Gender vs Word Visualization). In the visualization, red is associated with Female, blue Male, and purple is Unspecified genders. While this may be somewhat similar to the visualization made via Palladio, it’s ability to display the centrality of nodes, and even the overlapping connections of each gender group, are made much more readily apparent. We see a lot of overlap between Unspecified and female (which may suggest that this data may have been spoken by female children) as well as strong central connections between all groups. This layout makes this exploration much easier, as well as allowing for much clearer conclusions as well.

Overall, I would have to say that my visualizations with Gephi (especially after exported to a web interface allowing for interactivity) are far superior to those created using Palladio. While Palladio did a good job with showing the degree of node connections with the sets being studied, organizing the data so it centered around the subsections being measured, these structures didn’t allow for much exploration or even insight beyond the top-layer of general connectivity. The Gephi visuals not only are easier on the eyes, but also give the degree of each individual word once selected, as well as all connectivity and the relative orderings between each node. So while the general eigenvector may have been comparably similar, what each platform was able to say was drastically different.

This whole process, from data collection, to database construction, to platform testing, has shown just how important iterative design can be. While Gephi is definitely my platform of choice for my data, if I hadn’t done the work prior of testing the data on other platforms, not only would I have had nothing to compare it to, but I also wouldn’t have had as strong a grasp on my data itself. I believe Lima would have supported this process and even actively encouraged it. After all, it helped produce a visualization which not only looks good in its own right, showing some general conclusions about my data sets, but also allows for the user to go out and discover their own conclusions. Because, at the end of the day, visualizations mean nothing without context. So giving the viewer the keys to their own conclusions opens up a new realm of discoveries, even beyond what the original creator could have imagined.



Assignment 3: Networks and the Inter-Connectivity of Child Vocabulary

After my initial findings from assignment 2, I was curious to see what other conclusions could be drawn from my corpus of children’s speech. The most interesting finding from that assignment was the ability to plot the relative relation of individual word usage by both frequency and spatial relation. It became apparent from these findings, somewhat intuitively, that mapping the spoken utterances to the actual individuals who spoken them would be the next step in visualizing the data set. Instead of merely mapping the words with each other, determining the relation between words spoken and characteristics of the speaker is something which has the potential to lead to some interesting, and hopefully enlightening, conclusions.

Determining which aspects of the speakers to map to word-usage (and how exactly to do this) was initially a challenge, especially in converting the data into a csv data format. I contemplated whether or not individual word frequencies would be a useful metric for analysis, or if dividing up my given word data into sub-categories for various aspects of speech would prove more fruitful. As far as speaker characteristics, I decided that two of the most general (but also most insightful) factors would be individual age and individual age. After parsing back through my original data set in order to map this gender and age data, I realized that individual word categories might not be as informational as using a mapping of all word-utterances in relation to speaker characteristics instead. While breaking up the words into parts of speech or by noun types might have been interesting, seeing the connection between overall word-usage appeared to be indicative of a stronger visualization as a whole.

Age Vocab Visualization

This first visualization maps vocabulary usage to age of individual speaker. The highlighted nodes represent different ages while the remaining nodes represent the actual words uttered by individuals of the ages which connect them. This visualization is very interesting in mapping the intersecting nature of vocabulary and word-usage among different age groups. We see a large concentration of words branching off of the tow lower-most age nodes (representing the ages of  1 and 2), but also a large number of intersection between the two. As well, as the age goes up, the interconnectedness of vocabulary only grows, with higher age groups clustered together higher above the lower age groups. If this wasn’t so hard for Palladio to render on its own, I’d be very interested in increasing the data size with an increased vocabulary and number of age groups to see just how extensive this age-related connectivity really is.

Gender Vocab Visualization

My second visualization maps vocabulary usage to the recorded genders of the individual speakers. I find this visualization to be particularly interesting in how clearly it is able to convey the obvious differentiation between vocabularies of the various genders. While one might intuitively assume that essentially all, if not at least a majority, of vocabulary should be spread evenly between speakers of each gender, we can see that this doesn’t appear to be the case. The three recorded gender subsections (male, female, unknown) map together to have a good deal of intersection between them, but an even greater amount of bisection in unique vocabulary usage. From the network, we can analyze the varying ways in which individuals of different genders form vocabularies and where they overlap.

Both of these visualizations, though capable of spawning analysis and conclusions, are more representations than they are knowledge generators. This is largely due to the fact that despite the various lines denoting connections between nodes, the actual spatial relation between nodes doesn’t carry in meaning in itself. It is the connections themselves which have the meaning. Because of this, we are able to look upon these visualizations and see a particular mapping of information, but aren’t able to use the mappings themselves to discover some vastly different amount of information. The current arrangement of nodes and connections was done automatically by the Palladio system in order to better display the central nodes and more clearly represent the connections between each branching path. Nodes on opposite ends of the mapping are no more unrelated than the node unconnected in its immediate vicinity. To view networks we must not think in terms of place, but in terms of connection.