Curriculum Visualization in Gephi

For this assignment, I once again chose to visualize Bucknell’s database of course information for the Fall of 2015, so that I can most effectively compare Gephi to Google Fusion Tables. I modified my original dataset to work in Gephi by creating a CSV file containing nodes and a corresponding edges file to draw largeViewGephidirected edges between the nodes. Courses and (College Core Curriculum) CCC requirements are represented by nodes and edges are drawn from a course node to a requirement node. This involved algorithmically generating the edge list to link nodes, which Google Fusion Tables did for me automatically. Although there was additional overhead to develop input data that is suitable to Gephi, it came with the added benefit to having directed edges and the ability for me to specify weight to each edge (which I chose to be the number of sections of a course that fills the requirement). I chose to run the Früchtermann Rheingold algorithm on my data because Force-Atlas created a large clump in the center due to the CCC nodes being heavily linked to other nodes.

gephiwithlabelsAfter partitioning the nodes by department, running the Modularity analysis on my graph, and reducing the edge size, I was able to create a very attractive visualization. The Früchtermann Rheingold algorithm placed courses that do not meet any CCC requirements around the outer edge and beautifully interweaved the remaining data in the center of the graph. The center consists of clusters of high-degree CCC nodes with large numbers of course nodes directed toward them. Due to the structure of my input data, the CCC nodes with the highest degree are also the nodes with largest betweenness sgephilabelszoomince every edge comes from a course node and ends at a CCC node, which results in the maximum path length being one. The eigenvector yields a similar result, as my input data is not complex enough to yield insight beyond in-degree and out-degree. In future iterations, I plan on modifying the input data to visualize these useful metrics.

After turning on labels for nodes, the aesthetic appeal is reduced dramatically because unlike Google Fusion Tables, Gephi does not selectively label nodes based on their size. Google Fusion Tables is able to dynamically resize labels and toggle their visibility based on your zoom level. Overall, although Gephi was able to create a more aesthetically pleasing visualization, I found that Google Fusion Tables made it easier to explore the data and its connections, especially with its filtering abilities. However, I believe I can create a more functional Gephi visualization with some modification to the input data so that I can better make use of the software’s advanced analysis tools. I would like findFusionECON ways to resize nodes and distinguish CCC nodes from course nodes, which I was able to implement in Google Fusion Tables. One downside I experienced with Google Fusion Tables was that it would automatically hide nodes it deemed to be insignificant, as a way to provide a more organized view of the data. Gephi offers the modularity to keep all nodes and reorganize them as needed.

I believe that this visualization meets some of Lima’s requirements for networks. This is a new, unique visualization since we typically see course data in a table view, so it creates the potential to generate new insights into the Bucknell curriculum. The graph clarifies our understanding of relationships between nodes by drawing the relevant edges, color coding nodes by department, and using network algorithms gephizoom2such as Früchtermann Rheingold to create organized clusters of nodes. These graph makes it easy for people to see the outliers around the outer edge as well as the high degree nodes in the center. Both types of nodes can have high levels of significance, so it helps that Gephi keeps all nodes in the visualization, even if they may not seem important. I did not find that this visualization greatly expanded my knowledge on the data, since it mostly provides similar information to what was already discovered in preview network graphs of the data. On the other hand, Gephi does a fantastic job of creating aesthetically pleasing visualizations that look like art. I’m excited to expand on my work and create more complex graphs that unlock additional insight to my data.

Curriculum Visualization with Palladio and Google Fusion Tables

For this assignment, I chose to visualize Bucknell’s database of course information for the Fall of 2015. Specifically, I wanted to see the relationships between CCC (College Core Curriculum) requirements and courses across different departments in the University.

datasetI created my dataset by generating a CSV file of course data scraped from Bucknell’s online course database. Each row consists of a course number and a CCC requirement filled by that course. This setup means that there are some duplicate entries in the table for classes that fill multiple requirements or have multiple sections. I used this structure because Palladio and Google Fusion Tables are capable of sizing the nodes by frequency of the number of sections or requirements filled. Unfortunately, this also means that there is loss of information as there is no visual representation of the courses that do not fill any requirements. Additionally, by choosing to PalladioTableECONscale the sizes of course nodes by the number of times they appear in the CSV, we lose the ability to scale nodes by the number of students enrolled in each course, which may also be of statistical significance. As this is a large dataset, I will be applying filters by department to show subsets of the data when displaying screenshots of my visualization.

In Palladio’s table view, we can see how the software combines the multiple requirements for PalladioGallerycourses into a single table row, but it doesn’t keep track of the number of sections or links between courses. In fact, this table essentially reproduces the data that is already available on Bucknell’s online database, so there is no knowledge generation or anything interesting or thought provoking about this view. Similarly, the gallery view does not provide any additional insight to the data, as it does not intuitively visualize the data I have provided.

 

 

The graph view, on the other hand, does an excellent job of relating the differePalladioCSCInt courses and their associated CCC requirements. Edges are drawn between courses and CCC requirements, and nodes are sized according to their frequency of appearance in the dataset. Because the view is interactive, you can click and drag nodes to look at them individually and see in detail how the edges are connected. Palladio also allows you to highlight one set of nodes, which I used on CCC requirements, as they tend to be difficult to find amongst all of the courses.

Although Palladio is able to produce visually pleasing graphs, I found that Google FusiPalladioECONon Tables was able to produce even more beautiful graphs with some of the fine-tuning that this robust tool offers. I was able to color nodes by their category, which is especially useful for the larger dataset which has 23 nodes representing the CCC requirements that need to be easily differentiated from the hundreds of course nodes. Additionally, Fusion Tables highlights the edges of the node that is currently being moused-over, allowing the viewer to easily see what nodes are connected that node by edges. Lastly, Google Fusion Tables is better at scaling the size of nodes to make it easier to see which nodes are larger than others.

For this assignment, I have included network graphs for the Economics and Computer Science departments at Bucknell.FusionCSCI From the complexity of the Economics graph, with its many nodes and edges, it is immediately evident that this department is much more diversely distributed between curricular requirements as compared to Computer Science. Different people will can look at these graphs and come up with different conclusions about the data. For example, from my perspective as a student, it makes sense that introductory courses have more links to CCC requirements because these courses are intended for the general student population, particularly those looking to fill their requirements with courses from other disciplines. As a result, this graph may be more useful to individuals more familiar with the intricacies of the curricular structure at Bucknell. They could use the full dataset of courses and CCC requirements to develop models and theories on top of this visualization to identify issues with the curriculum. As a result, it is important to consider the intended audience of what is being created, since data that might be intuitively understood to one person might not be as straightfFusionECONorward to another.

I believe that the above graphs meet most, but not all, of Lima’s functions for network visualizations. This system of relations has never been documented before, since we mostly see this type of course information in table form. The system clarifies our perspective of the information, since the graph representation allows for a better means for humans to understand the data by drawing edges between related nodes, sizing nodes by frequency, and color coding nodes by type. These visualizations allow individuals to look at the curriculum, either by specific subgroups or in its entirety, to find patterns in its structure such as potential gaps or overlaps in the University curriculum. The graphs, however, aren’t very good at showing multidimensional aspects to the data, due to the simplicity of the input data, since the input CSV file only had two columns. This issue partially stems from the limitations of Palladio, as the software has trouble supporting large datasets.

Assignment 2

I decided to use Donald Trump as the foundation of my corpus for this assignment. I did so by scraping text from the last 90 days of the most popular new articles about Trump (based on Google’s search rankings). This resulted in a diverse corpus of 16 articles consisting of recent events surrounding Donald Trump from major news outlets such as The New York Times, Politico, Business Insider, along with others. My decision to use Trump as a source of data was mostly based on the fact that he is a high-profile businessman and politician, with connections with many different individuals, organizations, and places (which will allow Jigsaw to identify these entities and connections). Additionally, the variety of news sources will allow for a comparison in wording across articles in Voyant as well as sentiment analysis in Jigsaw.

One program I uSummary View, Voyantsed to analyze my corpus is Voyant Tools, which offers multiple visualizations for texts. It allows for a comprehensive view of the text, views of individual texts within the corpus, and even searches for specific keywords. One view I found useful for my corpus was the summary view, which allows you to get a macroscopic view of the texts. In particular, one statistic from the summary view that is interesting is vocabulary density, which is a way of quantifying the diversity of words in a text. In the case of my corpus, which consists entirely of news articles, the vocabulary density could speak to the article’s writing quality. For this corpus, the articles with the highest vocabulary density were written by NBC News and Politico, while the ones with the lowest density were written by Gawker and Business Insider. More specialized news sources such as Politico tend to have a higher vocabulary density, since they use more technical terms, while mainstream media outlets such as Gawker tend to have lower vocabulary density.

Trends View, Voyant

Another view that proved to be insightful for this corpus was the trends view. This view allows you to see the relative frequencies of keywords you provide across each text. The trend view showed some intriguing relationships between words. One example is that almost all of the articles describe Donald Trump as either offensive or controversial, but do not use both. This view can help reveal the type of political biases the authors may have based on how they describe the same events or individuals.

 

Bubblelines view, VoyantOne visualization that could provide new insight is the Bubblelines view, which displays the use of certain words across each text by plotting bubbles on a line which relate to where the word was used in the text. For my corpus, however, I found the trends view to be more intuitive and insightful, since it provides the same data without showing where it appears in each text. In this case, it was easier to compare across a large number of texts using the trends view.

 

Screen Shot 2015-09-23 at 8.27.12 PMThe other program I used to analyze this corpus is called Jigsaw. Jigsaw uses advanced algorithms to identify entities through the texts, such as people, locations, and organizations. It then uses the identifications it makes to find links between the entities, allowing the user to investigate their corpus using the multiple visualization tools that Jigsaw offers. The view in Jigsaw that I found to be central to my analysis was the Document View, which provides an overview to the texts in your corpus. It does so by displaying all of your documents along with a highly accurate, algorithmically generated summary sentence. This, along with the other elements in the Document View, make it a useful point of reference while analyzing the other views.

Jigsaw list viewFurthermore, the List View in Jigsaw provided many useful insights to my corpus. This view displays all the connections made between entities by drawing lines that link selected entity to other entities. For example, it was fascinating to see that Jigsaw accurately linked Jorge Ramos (a Mexican-American journalist) to organizations such as American Latinos, Univision, Azteca, and Telemondo, even though Ramos was only mentioned in two articles.

Document Grid view, JigsawAnother tool I found useful in Jigsaw was the Document Grid view, which I used to display the sentiment analysis data from each article. This view made it easy to find the articles that berated Trump and his actions, the ones that were neutral, and the ones that supported him. This view also offers other sorting mechanism such as sorting by number of entities in the document or subjectivity (which I did not find to be as accurate as the sentiment analysis).

Both Jigsaw and Voyant provide useful insight to textual data, but in different ways. Jigsaw, with its entity identification and advanced textual analysis, is useful for investigative inquiries. Jigsaw’s ability to view multiple tools at once allow you to start at one entity, or group of entities, and find how it is linked to another entity. This has made Jigsaw very helpful for law enforcement, who can use the tool to analyze hundreds of documents at a time along with Jigsaw’s many tools simultaneously to find the information they need in an investigation.

Voyant, on the other hand, provides textual analysis from a high-level view of the documents. It does so by mostly comparing words that appear within the document instead of attempting to analyze the text. While Voyant is capable of providing interesting results, it is up to the user to provide the right search parameters to find interesting patterns in the corpus.

What I learned from this experience is that it is impossible to gain a holistic view of a corpus using just one tool. On the one hand, we gain great insight by using tools such as Voyant and Jigsaw to see statistics and connections that would have been very difficult to compute by hand. On the other hand, it is important to realize that all the analysis we look at are abstractions and simplifications of the real information. Behind each of these visualizations is an algorithm and various design decisions. The visualizations we see come with the biases of the designer or developer and so we might unknowingly be looking the data the way they intended us to see it, not they way it was intended to be seen. Therefore, while digital tools can provide great insight to text, it is important to take the analysis with a grain of salt, and even verify a result using multiple tools.