Bucknell Curriculum Vizualization

When high school students begin the college search, they are repeatedly bombarded with the same information about class size, department strength, learning goals, etc. from every university they encounter.  Each institution, in the interest of attracting students to apply, wants to put its best foot forward.  Understanding this motive behind the information Bucknell (as well as other colleges) makes publicly available on its website invites further scrutiny: does the information change once students commit to Bucknell?

The Bucknell mission statement, learning goals, college core curriculum goals, and department summaries are available to anyone on the university website.  All of this information essentially communicates the same thing: enabled by a Bucknell education, students grow into more mindful, critically thinking, capable, creative, and culturally aware contributing members of the global community.  Does the information only available to people with a Bucknell login, such as course descriptions and the specific classes that fill particular CCC requirements, is the carry the same content and cadence?  Is the public face of Bucknell, constructed through its publicly accessible website information, representative of a Bucknell student’s educational reality?

My personal stake in this research has to do with the difficulty I had selecting a major.  Every adviser tells incoming freshmen to take their time exploring, start by filling general education requirements before settling into a major.  I was told I had plenty of time to decide, but when the time came to declare a major I didn’t feel as though twelve credits-worth of experience was enough to go off of.  Coming from a fairly generic high school, I had no idea what it would mean to be an anthropologist, economist, creative writer, or comparative humanist because I had no experience and knew of no one who had experience in these fields.  If the publicly accessible department descriptions are not truly representative of the field, it puts more pressure on course selection in order for students to gain insight into a branch of knowledge.  But how can students be expected to choose courses they will enjoy and gain meaningful experiences from if the selection process is a gamble?

I began with a specific interest in the materials studied in the three comparative humanities core courses.  Visualizing genre and author/artist gender and ethnicity drew attention to the gaps in the courses’ coverage; specifically a lack of women and non-western authors.  (Visualizations below created in Palladio: on the left a graph view dividing the course materials based on gender, on the right a map view plotting the materials’ location of publication.)

palladio graph author sex     palladio map sized

From there, I became interested in broadening the scope of the visualization to the university as a whole.  Since I do not have access to all the syllabi in every department, I had to shift the focus of the visualization to a different, but related, set of data: course descriptions and requirements as seen in the online course catalog.  This data is especially intriguing because, although it is easily accessible for all Bucknell students making choices about which classes to take, its presentation (a glorified spreadsheet) is indigestible and makes comparison difficult.  My goal was to find a way to view all, or as much as possible, of the data at once in order to access a macro-perspective.  Initially I planned to use Stefanie Posavec’s “Writing Without Words” (below left) as a guide for the tree-like structure I wanted to create.  As “Writing Without Words” reveals Kerouac’s structural style in On The Road, I thought a similar design could reveal the structure of Bucknell’s course offerings.  After some experimentation, I realized my data appeared confusing and sloppy in such a format.  Instead, I borrowed Borris Muller’s circular structure of “Poetry on the Road” (below right) to give shape to my data.

writing without words                   poetry viz

The “Poetry on the Road” model enabled me to more closely follow Tufte’s principles of display architecture, which include: “(1) documenting the sources and characteristics of the data,” which the visualization accomplishes through its shape, designed to reflect the relationships between departments via CCC requirements; “(2) insistently enforcing appropriate comparisons,” made possible through the various options for node sizing; “(3) demonstrating mechanisms of cause and effect,” by the simple organization of data into the democratic, circular structure in which the viewer’s eye is not drawn to a particular area for any reason other than the concentration of edges; “(4) expressing those mechanisms quantitatively,” as I did by sizing and connection each node based on quantitative data from the course catalog; “(5) recognizing the inherently multivariate nature of analytic problems,” shown through the combination of variables such as node color, size, and location, and different CCC requirements; “and (6) inspecting and evaluating alternative explanations,” as we explore in Nadeem’s interactive network visualization for each department (Tufte 53).


Inspired by “Poetry on the Road,” I organized all of Bucknell’s academic departments into rings based on the size of each College/School (above).  The outer two rings, with nodes colored purple, represents the College of Arts and Sciences.  Since the College is so big, I split it further into an Arts and Humanities ring and a Science (hard and social) ring in order to make the visualization easier on the eyes.  The center ring, with red nodes, represents the College of Engineering.  The inner ring, with blue/green nodes, represents the School of Management.  In this particular visualization I chose to size nodes based on the number of unique courses offered in each department for the Fall 2015 semester.  For example, the music department has the highest number of unique courses (73) so it is represented by the largest node, and astronomy is one of the departments tied for the lowest number of unique courses (1) so it is represented by the smallest node.  I initially intended to make node size a variable for comparison by creating alternative visualizations with nodes sized based on number of total courses offered or the number of possible ways to fill CCC requirements in a particular department, but altering node size did not fit seamlessly into the narrative of the project as a whole.

circle.unique.allpub  circle.unique.CCQR.DUSCpub

Since my intention was to create a means to view as much of the course catalog information at once as possible, I first tried to map the edges for all the CCC requirements at once (above left).  Although it made for a decent website header image, the colorful quagmire is too cluttered to be analytically useful.  Even including as few as two CCC requirements on the same image does more harm in the clarity department than it does good for comparison purposes (above right, Quantitative Reasoning and Diversity in the US requirements pictured).

ARHC with nodes  ARHC

Although visualizing one CCC requirement at a time on top of the department nodes is simple enough to convey the data clearly, I decided to simplify even further by removing the nodes (Arts and Humanities requirement pictured above).  It became necessary to include a template of the nodes without any CCC requirements under the narrative tab in order for the visualization to make sense; but the visualization is still ledgible because the division of the different rings is intuitive enough to grasp without looking directly at the location of the nodes.  And the image is more visually impactful with just the edges.

macro–>  relationship –> micro

When it came time to combine the static and interactive aspects into a single visualization with a reasonably linear narrative, we decided to use the macro>relationship>micro view structure.  Starting with a macro view, a visualization will “facilitate the understanding of the network’s topology, the structure of the group as a whole, but not necessarily of its constituent parts” through a holistic view of the visualization, enabling users to see its overall pattern” (Lima 91).  Our macro view is located in the narrative (above left).  It offers both an overview of Bucknell’s academic structure through the listing of learning goals and college core curriculum design taken directly from Bucknell’s website, and a color-coded comparison of Bucknell’s learning goals to its CCC design.  This choice contextualizes the visualization for viewers who may not be familiar with Bucknell’s academic mission.  From the narrative tab, the viewer is prompted to select the college core curriculum tab to access the relationship view (above center), which “is concerned with an effective analysis of the types of relationships among the mapped entities” (Lima 92).  The edges of our static relationship view offer a perspective on the relationships between different departments through CCC requirements.  Finally, the user can click on a node to explore a singular department in more depth in the micro view (above right).  Although the micro view offers the most narrow perspective, it offers comprehensive, explicit, and “detailed information, facts, and characteristics on a single-node entity,” which helps to “clarify the reasons behind the overall connectivity pattern” (Lima 92).



Curriculum visualization (Nadeem Nasimi’s) http://nadeem.io/270/

Final project reflection


Our team’s research question is to investigate if Obama actually is using code-switching technique in his speeches when talking to audiences belong to different classes, race and ethnicity groups. I can’t help but feel obligatory to share this Youtube video with my fellow readers. Although it is an exaggerated version of how code-switching technique is used, it can still be an excellent example demonstrating how it can be adopted in real life.


President Obama drew public and media’s attention at the very first day he became the president of the United States since 2008. He becomes an embodiment of black culture as he being the first African-American president of the United States. The definition of code switching originally indicates  frequent and instant switching between two or more distinct languages (Wikipedia). However, in our project, we tend bring a more generic and broad definition of code switching.Now it also indicates subtle and reflexive changes of the way people express themselves encountering different situations. The project first performs general linguistic analysis and then attempt to find traces and evidence of cases which code-switching was used in his speeches.


Our project assumes audiences have no sociology and linguistic backgrounds. All terminologies and abstract ideas that are needed will be explained in a way that is understandable by everyone. All visualization will be digital and we post our work on a website, which is accessible for everyone in anywhere from the world. The whole website is designed in a  storytelling fashion that audiences will follow the exactly steps we took to reach the conclusion that we had. We believe this is a more persuasive way to let people really understand ideologies behind our work and also a more interesting way to express our idea at the same time.


Most visualizations are combinations of both interactive and static view. Most visualizations in Voyant, Gephi and Google Fusion Table have interactive features and allow audiences to explore by themselves. We chose to first post static snapshot of visualizations from Gephi and Voyant to let audiences have a general understanding of visualizations. Audiences can further play with them by clicking links behind snapshots.

All data we used, which are mostly speeches of the president Obama, come from this website. We first process all speeches to get metadata. Our metadata consists of locations, audiences and topics and time of all speeches. We think this could help us to analyze speeches from different dimensions, which enable us to perform more comprehensive analysis from different angles.


The first analysis we performed is word frequency analysis. This is done by Voyant. We first group data into different groups, classified by time, audiences and topics in specific. I took off some words from word clouds in order to give more representative results. Words such as ‘i’, ‘they’ and ‘god’ exist almost in all of his speeches and they do not have special meanings under different scenarios. An example of visualizations from Voyant looks like this:


Voyant Visualization of 2012

This is the word cloud for all speeches in 2012. We can see that one of the most distinguished words from it is “romney”. It makes sense since it was during midterm election and Romney was the strongest opponent at that time. At the right side of the visualization, we can also find the word “tax”. This also can be representative since Obama was proposing multiple reformation on taxation, such as increment of tax on high-income taxpayers and lower tax for startup companies and small businesses.


This is another visualization from Voyant. This word cloud contains all words under category ‘Military’, which are speeches that president Obama gave to military personnel. It is pretty self-explanatory that the most distinguish words are ‘iraq’ and ‘security’.In general, Voyant standalone cannot give us any useful conclusions. This is due to the nature of corpus. Word clouds only display words by frequency. There is no necessary correlation between the importance of a certain word and how many times it appears in corpus. Words like’ I’ mentioned above are not helping us to grasp the essence of speeches. Also, most words in word clouds are nouns. It is hard to find his attitudes from nouns. Verbs and Adjectives are more useful in this case and Voyant is not good at selecting words by their function. However, it is still helpful in some degree. Both of these visualizations prove that fact that he did use different sets of vocabularies in different situations. This further suggests that he is likely to use a different set of vocabularies to handle different scenarios.


Voyant Visualization for millitary personal


The next series of visualizations analyze the relationship between topic and location. Although once again, it is not providing direct prove of code switching, it shows us the fact that locations sometimes are specifically selected by president Obama and his team for certain topics. This is one of the visualizations:


Keywod classified by states

This visualization displays keywords of his speeches grouped by states. This visualization gives us some interesting result. For example, in states like Mississippi, Alabama.Georgia and South Carolina, where has relatively higher percentages of African American than those in other states. We can see that keywords are words such as ‘Hope’,’Change’ and ‘Affect’, which are all positive and all share one similar idea. Considering these locations, I do not think this is just a coincidence. I think president Obama and his team realize there are distinguished percentages of African American residents. He knows these words are exact the words that can excite African Americans and make them support him. From this example, we can see that code switching technique both depend on location and topics. Different locations have other concentrations of population. Such concentrations can be dominated by race, ethnic groups, class and etc. Different topics, at most of the times, are targeting specific group of the population. Combining both location and topics, different styles of speech are expected in order to satisfy specific groups of people.


The last visualization is done by Gephi:


This visualization consists of all words from speeches during five years(08-12). In this visualization, we can see that it look like annual rings of trees. In the center, where has most nodes condensed it, it means these words are used most frequently crossing five years. The concentration in the center suggests that there is a core set of vocabulary that used by president Obama in most speeches. In the outer area, we can see there are rings with different colors overlapped with each other. These are words appears mostly in a certain year but are not distributed evenly across five years. It is known from previous visualization from Voyant that president Obama focused on different topics each year. These words are most likely addressing these issues in particular. This is direct evidence of code switching. Those unique words that are only used in specific location, time and facing specific audiences can be best exemplified how code-switching is adopted by president Obama. We are definitely going to further investigate and test different visualizations in Gephi if we get a chance to do so.

In general, I think now it is fair to say president Obama is adopting code switching. There are several reasons when people choose to code switch, whether intentionally or not. One of the reasons is trying to fit in. We definitely can see this being demonstrated by the locations v.s. topics visualizations. We can see that president Obama is trying to fit in African American neighborhood by using different sets of vocabularies and selecting those topics can best bring concurrence from local audiences. Code switching can help president Obama and his team to better convey their thoughts to diverse audiences and attract voters from different backgrounds. Our project demonstrates this idea by multiple cases and examples and we hope our audience can also realize the fact that code switching technique is broadly used by president Obama during public speeches.


Visualize relationships in heroic story

Project Introduction:

Zhengri Fan and I are always interested in playing RPG games. We love to play RPG games because we like the story behind them. Instead of doing research on games like we did before, this time Zhengri and I decided to directly touch the “base”: hero stories. We would like to focus on relationships between characters in different heroic stories in different time period. From their relationship, we can know each character’s influence to the entire story. Then we categorize different characters according to their gender and camp (protagonist or antogonist) to know an overview of those specific types of characters’ influence to the heroic story. And we would like to explore if different types of character’s influence are changing due to time changing. Our results presentation consist of two parts:

  • An interactive website with character visualization and history background: Epic Story Visualization
  • An influence dataset collected and processed with graph theory: (On Google Drive shared files, can get accessed from Jiayu Huang, Zhengri Fan or Katherine Faull


Dataset Preparing:


The first step of doing this project is to select raw texts of epic stories for further research. Because we would like to find modern English text for better text analysis quality and time periods that can actually influence literature a lot, we choose pre-1850 (God, Royal and Traditional) , 1850-1900 (1st Globalization) , 1900-1950 (World War), 1950-2000(Technology and Coldwar) ,2000+ (Future) as our 5 major time slots to select books. for each time slot, we choose 3 books that can be seen as epic story. The full list of source data can be accessed on our visualization website. We got raw texts from open source library and website, the text files can be found in Google Drive raw dataset.

DATA Preprocessing:

In order tSample6o do visualization of relationship, we need to process the raw text to extract relationships between different characters. So, there are three very important phrases here: raw_text, relationships and characters. we got raw text now, then we need the characters names to help jigsaw do the dark magic trick for us to find relationships. For each book we selected, we created a character database with name, gender and camp (if he/she is good or bad)  of each main character in the story. Then, we split our raw text into several small pieces and input them into jigsaw. If two characters appear in one of the smaller piece, we claim that they are related. Because we don’t have super accurate way to let computer understand texts, this is a good approximation to get relationship data. Finally, the relationship dataset is like the picture left: for each line of the csv file, there is an undirectional relationship counted.




Data Processing w/ Gephi:

This time, the process is kind of same like our previous assignment, we do a natural join from our CharacterDB for each raw text and relationship csv file in Gephi. Gephi put character database information into Nodes database and relationship data into edge database. We create data following a commonsense schema: relation-edge and character-nodes. In order to present the character relationship embedded with their influence, we use Gephi’s built-in graph theory analysis tool to get each charactere’s degree (connections between this node to other nodes) , centrality (if this node is the center of this graph: it can be calculated by connected components and degree) and modularity (the strength of the graph that can be divided into different parts).  After we apply these tests to our data we find that modularity of relationship of characters in heroic story tends to be really low. That means all of the characters tend to be in one big group. i.e. they all connected in someway. Though, not all of the stories are like this. Paradise Lost’s modularity value is a little bit higher because two camps of characters are separated in a pretty discrete way. After some research on graph theory, we choose to use degree to rank the size of different character nodes. Modularity is a good concept, but it can only represent the grouping but not a good way to measure influence. Centrality, also, is good for determining the “core” in relational web. However, it is not a unbiased representor. It is good in dealing with centralized character. for example, Alice in Alice’s Adventure. But there will be a lot of bias when we are trying to use centrality to determine characters that is not that much centralized in the relationship. In heroic story, Degree and average weighted degree can truly give us a good simulation on the influence level. As a result of this concern, I rank node with different sizes according to different degree number. A character with higher degree can be seen as a more important character. However, I would like to show the fact of grouping and centralization. From visualization concern, I set layout as “Yifan Hu porpotional” with the optimal distance of 1000. This layout shows the fact of grouping (see paradise lost) and centralization very well. Here is a sample screenshot of a character relation visualization:Capture

Visualization Utilization/Presenting

On this visualization of “The Wonderful Wizard of Oz”, Dorothy is the centered character with high degree, high centralization. greengirl is a good example of a character with lower influence, either in degree or centralization consideration. Viewer Capture2is able to use lower right operator to zoom in and out of the graph. Also, if clicked on one of the node, a very detailed information about the node will be presented, with all collected information about the graph theory and the connected nodes. We decided to keep all of the graph theory analysis data on the visualization because we would like to utilize our visualization not only on current question about influence but also on future potential. As we promised in our previous presentation that the product will be a fully reader driven experience. That we present the data but how to use this relation visualization is due to our user. Of course, we are users ourselves.  In order to present our visualization in a natural and fluent way, we use TimelineJS to host our visualizations, for each time period, we have an introduction on the fact on history about that time period on the website and several visualizations on that time period. Then, we have links to our visualization website. On the background, there is a picture for representing the specific time era. Here is a screen shot of our user interface. When designing the interface, we tries different word font and design, and this is the final output that we are really proud to present:


Influence Dataset with graph theory analysis:

Before this section, it is a tool we created to do our research on different types of characters in epic story. This section I would like to talk briefly about how we use this tool to do actual research. This research is based on data we collected above and graph theory.

Step1: Influence Data for different Books

Because we would like to get information about the influence in different time period, before this, we decided to get the influence data on different books first. We export edges list (relationship) and nodes list (characters) from each gephi visualization data set. Then, by using a python script, we traverse thCode1e relationship list and assign the important factor degree to each characters. On the left is the sample source code. The influence data will be calculated and stored in another csv file. However, the influece data we collected here is just:

Number of Connection to specific node

However, if we would like to utilize those data to compare it with each other, a very serious problem occurs: How can we control the bias to data if a book text size is too big. I met this error when I am trying to use influence raw data to do the research. I called it the data bias effect, I will show a screen shot later to show what it is. But now, I would like to talk more about the solution.

Step 2: Standardize Data to utilize data for comparison

There is a concept in statistics called standardization. With this effect data can be processed to eSample7liminate the data bias effect but still have the comparative relationship between the data. It is (X-Average)/StandardDeviation. Therefore, I standardize the influence data for each books. And at last, the right screenshot is a sample data analysis chart for a sample book. Id is the name of character, InluRaw is the unstandardized influence value (The degree). Influence is the standardized version:

Number of Connection to specific node compares to other nodes

Step3: Combine Data from different books to different eras

After we got the data, we will try to combine the numerical variable data with categorical variables. Therefore, we combine the datasets (numerical data) we collected according to different era. Then we combined the connected Character database (categorical data) . The data can be connected harmlessly because they have the same schema. We then use gephi to just do a natural join to join character database and influence datasets according to Id. We then got 5 files (each file for one specific time period). The data in it will bCapture4e stored according to the left schema, with categorical data and numerical data connected.

Step4: Compare Data according to different era

Then we use another python script to get information about the overall analysis on different categorical variables with the average influence data calculated (It can be done by simply add because the influence is standardized data). The code can be found in RawConprehensive/Influence Per Era. Then, I created a comprehensive chart as the final analysis on the question I asked.


The four graphs are the influence analysis on Gender and Camp. The upper two use raw data and lower two use standardized data. The significant peaks on raw data analysis is a good example of data bias effect. Because of the significant length of The Lord of Rings, raw data is not trust-able in concerning with data-stabilization. On the Standardized data, we are not researching the effect of all characters of certain type to the story, but the average influence of this type’s instance. For example, we are not trying to explore the influence of Jenny + Herimone +… ‘s female’s influence to Harry Potter but their average influence. We are trying to know the singleton’s influence in the story. On the first graph we can see woman’s influence to the story raises significantly in 1900-1950 time era. This raise means a woman in the story might be more influential than a man in that story. From our previous research on war, that war comes with love, always. It might be the same effect, in a period of war time, people want get comfort from female character. Then, although there are woman right movements from  1950s-2000s, man’s position become more important in epic stories. And It is the first time man’s influence in the story is higher than woman’s. There might be an explanation to this that the data is an average value, there is not much woman characters before, so each one of them will be super important to the story. However, with more woman right concern, we have more woman in the story, so in average, they are not that much important. Those 2 are just my justification/explanation on my data observation. At least we can get a very clear pattern on the graph, that when man gets more influence, woman gets less. On other hand, if we take a look on Camp dataset, we will see a very blur pattern from the data, no significant peaks and no very much correlation. I will talk about this in my next section.

Success and Failure on Our Project:


  • We successfully get the woman’s influence on the book correlated to time. and it is correlated to man’s influence
  • We successfully get a relation graph for future research/use
  • Our design is awesome
  • Our data analysis part is very strict and convincing (in graph theory).
  • Our dataset is too big, we are unable to do very deep analysis (Maybe analysis on story plot) due to this issue. We find it is a problem too late
  • Our camp-analysis fails. There is a reason that we can not easily divide characters in one book into simply good, bad and nutural. We need to use a better categorized factor. Then It might be an issue in our assumption making period.