Curriculum Visualization in Gephi

For this assignment, I once again chose to visualize Bucknell’s database of course information for the Fall of 2015, so that I can most effectively compare Gephi to Google Fusion Tables. I modified my original dataset to work in Gephi by creating a CSV file containing nodes and a corresponding edges file to draw largeViewGephidirected edges between the nodes. Courses and (College Core Curriculum) CCC requirements are represented by nodes and edges are drawn from a course node to a requirement node. This involved algorithmically generating the edge list to link nodes, which Google Fusion Tables did for me automatically. Although there was additional overhead to develop input data that is suitable to Gephi, it came with the added benefit to having directed edges and the ability for me to specify weight to each edge (which I chose to be the number of sections of a course that fills the requirement). I chose to run the Früchtermann Rheingold algorithm on my data because Force-Atlas created a large clump in the center due to the CCC nodes being heavily linked to other nodes.

gephiwithlabelsAfter partitioning the nodes by department, running the Modularity analysis on my graph, and reducing the edge size, I was able to create a very attractive visualization. The Früchtermann Rheingold algorithm placed courses that do not meet any CCC requirements around the outer edge and beautifully interweaved the remaining data in the center of the graph. The center consists of clusters of high-degree CCC nodes with large numbers of course nodes directed toward them. Due to the structure of my input data, the CCC nodes with the highest degree are also the nodes with largest betweenness sgephilabelszoomince every edge comes from a course node and ends at a CCC node, which results in the maximum path length being one. The eigenvector yields a similar result, as my input data is not complex enough to yield insight beyond in-degree and out-degree. In future iterations, I plan on modifying the input data to visualize these useful metrics.

After turning on labels for nodes, the aesthetic appeal is reduced dramatically because unlike Google Fusion Tables, Gephi does not selectively label nodes based on their size. Google Fusion Tables is able to dynamically resize labels and toggle their visibility based on your zoom level. Overall, although Gephi was able to create a more aesthetically pleasing visualization, I found that Google Fusion Tables made it easier to explore the data and its connections, especially with its filtering abilities. However, I believe I can create a more functional Gephi visualization with some modification to the input data so that I can better make use of the software’s advanced analysis tools. I would like findFusionECON ways to resize nodes and distinguish CCC nodes from course nodes, which I was able to implement in Google Fusion Tables. One downside I experienced with Google Fusion Tables was that it would automatically hide nodes it deemed to be insignificant, as a way to provide a more organized view of the data. Gephi offers the modularity to keep all nodes and reorganize them as needed.

I believe that this visualization meets some of Lima’s requirements for networks. This is a new, unique visualization since we typically see course data in a table view, so it creates the potential to generate new insights into the Bucknell curriculum. The graph clarifies our understanding of relationships between nodes by drawing the relevant edges, color coding nodes by department, and using network algorithms gephizoom2such as Früchtermann Rheingold to create organized clusters of nodes. These graph makes it easy for people to see the outliers around the outer edge as well as the high degree nodes in the center. Both types of nodes can have high levels of significance, so it helps that Gephi keeps all nodes in the visualization, even if they may not seem important. I did not find that this visualization greatly expanded my knowledge on the data, since it mostly provides similar information to what was already discovered in preview network graphs of the data. On the other hand, Gephi does a fantastic job of creating aesthetically pleasing visualizations that look like art. I’m excited to expand on my work and create more complex graphs that unlock additional insight to my data.

Assignment 5 or “The Importance of Being Gephi”

As with Assignment 3, I utilized the CHILDES corpus of child speech in order to create the data-sets used in my visualizations displayed below. As before, the data I pooled from had already been separated into categories based off of the syntactical level of the speaker. The specific pool I utilized was from children of the lowest level of capability (and thus, age as well). Getting the extra information required for my visualizations consisted of parsing through the original data and organizing it into sets based off of vocab use, age, and gender. Due to the unorganized structure of the CHILDES corpus (there being only loose structural guidelines for it, with few contributors actually making use of all the data tags supplied) getting this information together didn’t prove easy, but ended up being quite fruitful.

Age Vocab Visualization

This first visualization was created utilizing the Palladio platform and resulted in an immediately interesting (if complicated) visualization of age versus vocabulary use. As can be observed above, the largest number of utterances came from those who were 1 or 2 years old (as represented by the larger nodes on either side of the central structure). Though there are a good amount of words connecting these two age sets, as well as to the other ages at the top of the central section, the truly interesting thing about this data depiction is just how many words are not connected by common usage among age groups (as shown by the large clusters of nodes on the proximity of the visualization). But, due to the fact that Palladio has trouble with this much data, as well as the sheer amount of it, trying to inspect it closely proves difficult, if not impossible. This is still a much more interesting visualization than the ones that I was able to create using Google Fusion Tables, which were structurally unclear and muddied, but nonetheless this picture still leaves something wanting.

Gender Vocab Visualization

This second visualization, depicting the correlation between spoken vocabulary and gender, is immediately more interesting than the previous one due to its clarity in relationship between each subsection. The structure of the visual is as intriguing as it is logical. The three large nodes with a majority of connections represent Male, Female, and Unspecified genders. Each are connected to each other as well as having their own collection of unique words not shared with the others. What is interesting about this is just how many more unique words the Female gender has than the others. But, even though there are a lot clearer conclusions to be drawn from this visual, it’s still lacking in the way of interaction and close inspection.

This is where Gephi comes in.


age visual gephi

The above picture is a static image of a visualization of the same age versus spoken word data used in the first Palladio visual. While it is readily apparent just from this still alone how much cleaner the Gephi visual is to its Palladio counterpart, I suggest that you follow this link (Age vs Word Visualization) that will take you to an interactive version of this visual to see the true power of a Gephi visualization.

This interactive visualization was created by first making the graph of connections in Gephi, and then loading into the Gexf-JS Web Viewer program created specifically for viewing interactive Gephi visualizations online. I originally tried using the Sigma JS exporter instead, but the platform was unable to handle the size of my data set. Gexf-JS gives many of the same interactive benefits of Sigma JS, while also being a bit cleaner in its search interface as well as its responsiveness and drawing capabilities.

This interactivity gives a user the ability to not only see overall connections and structure of a network, but also the individual connections between items in the network itself. Nodes are colored based off of the connection grouping that they are most affiliated with. For instance, the word “write” was most used by those of 1 year of age and is colored blue, while “next” is green because it’s mostly connected to the 2 years old age group. From there, the visual is structured by the Fruchterman Reingold layout, meaning that nodes with more connections are centrally located and those along the perimeter are the least connected. As can be seen with the age visualization, this leads to some very interesting layouts which can tell a user a lot more about a set of data than Palladio was able to, especially in the case of a large data set that is as hard to read statically as this one. But, once you play around with the visual for a little bit, you can make some interesting conclusions about word use and overall connectivity than you might be able to make otherwise. As well, this is all for a visualization which is inherently muddied in structure. When you get something more structured, things get even more interesting.
gender visual gephi

This above picture shows a static version of the interactive visualization which can be found at the following link (Gender vs Word Visualization). In the visualization, red is associated with Female, blue Male, and purple is Unspecified genders. While this may be somewhat similar to the visualization made via Palladio, it’s ability to display the centrality of nodes, and even the overlapping connections of each gender group, are made much more readily apparent. We see a lot of overlap between Unspecified and female (which may suggest that this data may have been spoken by female children) as well as strong central connections between all groups. This layout makes this exploration much easier, as well as allowing for much clearer conclusions as well.

Overall, I would have to say that my visualizations with Gephi (especially after exported to a web interface allowing for interactivity) are far superior to those created using Palladio. While Palladio did a good job with showing the degree of node connections with the sets being studied, organizing the data so it centered around the subsections being measured, these structures didn’t allow for much exploration or even insight beyond the top-layer of general connectivity. The Gephi visuals not only are easier on the eyes, but also give the degree of each individual word once selected, as well as all connectivity and the relative orderings between each node. So while the general eigenvector may have been comparably similar, what each platform was able to say was drastically different.

This whole process, from data collection, to database construction, to platform testing, has shown just how important iterative design can be. While Gephi is definitely my platform of choice for my data, if I hadn’t done the work prior of testing the data on other platforms, not only would I have had nothing to compare it to, but I also wouldn’t have had as strong a grasp on my data itself. I believe Lima would have supported this process and even actively encouraged it. After all, it helped produce a visualization which not only looks good in its own right, showing some general conclusions about my data sets, but also allows for the user to go out and discover their own conclusions. Because, at the end of the day, visualizations mean nothing without context. So giving the viewer the keys to their own conclusions opens up a new realm of discoveries, even beyond what the original creator could have imagined.