Assignment 2

I decided to use Donald Trump as the foundation of my corpus for this assignment. I did so by scraping text from the last 90 days of the most popular new articles about Trump (based on Google’s search rankings). This resulted in a diverse corpus of 16 articles consisting of recent events surrounding Donald Trump from major news outlets such as The New York Times, Politico, Business Insider, along with others. My decision to use Trump as a source of data was mostly based on the fact that he is a high-profile businessman and politician, with connections with many different individuals, organizations, and places (which will allow Jigsaw to identify these entities and connections). Additionally, the variety of news sources will allow for a comparison in wording across articles in Voyant as well as sentiment analysis in Jigsaw.

One program I uSummary View, Voyantsed to analyze my corpus is Voyant Tools, which offers multiple visualizations for texts. It allows for a comprehensive view of the text, views of individual texts within the corpus, and even searches for specific keywords. One view I found useful for my corpus was the summary view, which allows you to get a macroscopic view of the texts. In particular, one statistic from the summary view that is interesting is vocabulary density, which is a way of quantifying the diversity of words in a text. In the case of my corpus, which consists entirely of news articles, the vocabulary density could speak to the article’s writing quality. For this corpus, the articles with the highest vocabulary density were written by NBC News and Politico, while the ones with the lowest density were written by Gawker and Business Insider. More specialized news sources such as Politico tend to have a higher vocabulary density, since they use more technical terms, while mainstream media outlets such as Gawker tend to have lower vocabulary density.

Trends View, Voyant

Another view that proved to be insightful for this corpus was the trends view. This view allows you to see the relative frequencies of keywords you provide across each text. The trend view showed some intriguing relationships between words. One example is that almost all of the articles describe Donald Trump as either offensive or controversial, but do not use both. This view can help reveal the type of political biases the authors may have based on how they describe the same events or individuals.


Bubblelines view, VoyantOne visualization that could provide new insight is the Bubblelines view, which displays the use of certain words across each text by plotting bubbles on a line which relate to where the word was used in the text. For my corpus, however, I found the trends view to be more intuitive and insightful, since it provides the same data without showing where it appears in each text. In this case, it was easier to compare across a large number of texts using the trends view.


Screen Shot 2015-09-23 at 8.27.12 PMThe other program I used to analyze this corpus is called Jigsaw. Jigsaw uses advanced algorithms to identify entities through the texts, such as people, locations, and organizations. It then uses the identifications it makes to find links between the entities, allowing the user to investigate their corpus using the multiple visualization tools that Jigsaw offers. The view in Jigsaw that I found to be central to my analysis was the Document View, which provides an overview to the texts in your corpus. It does so by displaying all of your documents along with a highly accurate, algorithmically generated summary sentence. This, along with the other elements in the Document View, make it a useful point of reference while analyzing the other views.

Jigsaw list viewFurthermore, the List View in Jigsaw provided many useful insights to my corpus. This view displays all the connections made between entities by drawing lines that link selected entity to other entities. For example, it was fascinating to see that Jigsaw accurately linked Jorge Ramos (a Mexican-American journalist) to organizations such as American Latinos, Univision, Azteca, and Telemondo, even though Ramos was only mentioned in two articles.

Document Grid view, JigsawAnother tool I found useful in Jigsaw was the Document Grid view, which I used to display the sentiment analysis data from each article. This view made it easy to find the articles that berated Trump and his actions, the ones that were neutral, and the ones that supported him. This view also offers other sorting mechanism such as sorting by number of entities in the document or subjectivity (which I did not find to be as accurate as the sentiment analysis).

Both Jigsaw and Voyant provide useful insight to textual data, but in different ways. Jigsaw, with its entity identification and advanced textual analysis, is useful for investigative inquiries. Jigsaw’s ability to view multiple tools at once allow you to start at one entity, or group of entities, and find how it is linked to another entity. This has made Jigsaw very helpful for law enforcement, who can use the tool to analyze hundreds of documents at a time along with Jigsaw’s many tools simultaneously to find the information they need in an investigation.

Voyant, on the other hand, provides textual analysis from a high-level view of the documents. It does so by mostly comparing words that appear within the document instead of attempting to analyze the text. While Voyant is capable of providing interesting results, it is up to the user to provide the right search parameters to find interesting patterns in the corpus.

What I learned from this experience is that it is impossible to gain a holistic view of a corpus using just one tool. On the one hand, we gain great insight by using tools such as Voyant and Jigsaw to see statistics and connections that would have been very difficult to compute by hand. On the other hand, it is important to realize that all the analysis we look at are abstractions and simplifications of the real information. Behind each of these visualizations is an algorithm and various design decisions. The visualizations we see come with the biases of the designer or developer and so we might unknowingly be looking the data the way they intended us to see it, not they way it was intended to be seen. Therefore, while digital tools can provide great insight to text, it is important to take the analysis with a grain of salt, and even verify a result using multiple tools.

Assignment 2: Delving into the words of a child

My corpus is comprised of data collected and stored as a part of the Child Languages Data Exchange System database a part of the TalkBank system of collected speech transcriptions. The database is maintained by Professor Brian MacWhinney at Carnegie Mellon University since the 1990s, and has become one of, if not the largest single collection of spoken child utterances available. The data within the system dates as far back as the 1960s and is continually updated with additional transcriptions from more recent studies. This corpus was then analyzed using CHILDES’s open-source analysis software, CLAN, in order to divide the large pool of data into smaller subsets organized by Roger Brown’s Stages for Syntactical and Morphological Development. This model divides the different stages of a child’s syntactical speech progression into 5 stages representing the most basic child speech to more syntactically advanced sentence structures. One interesting thing to note here is that the data itself is not actually organized by the individual speaker’s age whatsoever, but merely by the various stages I have previously outlined. That being said, there are some general age mappings for Brown’s stages that happen to appear in the data present. For instance, simpler sentences are more likely spoken by younger children while more complex sentences are more likely spoken by older ones. These divisions have proven very interesting in visualization analysis for my accumulated corpora.

Child Utterance RelationsChild Utterance Relations             The above images display a scatter plot of my corpus’ word data created using the Voyant platform. The above visualization breaks down the 1000 most frequently used words from my corpus and then break them into clusters by relative usage (displayed via different colors). These different nodes are then placed on a plot in relation to each other based off of their relative use and connections within the text. This visualization is novel in its ability to display the interconnected nature of early speech sentence structure. All spoken utterances are cleanly related to their counterparts, branching off into three separate off-shoots from the main base of language. What surprises me the most about this visualization is how clean this relation is, and how geometric it is as well.


Word Cluster

family relation

The above visualization was created using the Jigsaw platform. The main take-away that this visualization presents immediately is the direct relation between words of urgency and between utterances of “mommy” or “daddy”. While this may appear obvious from a distance, seeing these rather dissimilar words that merely share the trait of urgency all having higher frequency of relations to utterances for parents is very interesting. As well, the sheer number of utterances of “mommy” compared to other types of names or actions is quite interesting to behold.

The obvious difference between the Voyant and Jigsaw platforms is in the way each handles the data that it processes. Voyant is more interested in word frequencies and relative originality of individual terms while Jigsaw is more focused on putting context into the words that it is given by dividing them into entities for analysis. Because of the nature of this context-based approach, Jigsaw isn’t very useful for large-scale text files that haven’t been properly parsed yet. For instance, Jigsaw is quite good at reading books or formal reports because of the way in which subjects are formatted and displayed within the texts. But for my corpora, I have a very large number of spoken utterances by children which aren’t always as syntactically literate as these. Because of this, I was tasked with defining my own set of entities based off of the components of my text that I wished to explore. Some examples of entities that I used are different pronoun forms, family members, and words of urgency. From there, I was able to make connections between these various groups using Jigsaw’s extensive document analysis and clustering tools. Voyant doesn’t allow you this much specific control. But, where Jigsaw succeeds in entity and core analysis, Voyant makes up for in large text analysis. Using Voyant I was able to make over-arching analytical conclusions about word usage which isn’t as clear when using Jigsaw. Both platforms are quite extensive in their offerings, as long as the data you are working with is tailored to what each platform provides.

The creation of this corpus, as well as the process of analyzing it with these two similar yet disparate platforms has yielded an interesting insight into what Clement was trying to get at in her piece on Analysis and Visualizations. On one hand, these images in front of me are displaying concrete information which was gathered from valid sources for analysis. Yet, all of this visualization is taking place in a completely virtual environment. None of it is physical, unless it were to be printed out or written down manually. This incongruity is interesting in the fact that it gives the researcher a reminder of the constraints of a virtual analysis process, while also appreciating that without the humanistic element to the analysis, no real conclusions could be drawn. We are simultaneously working as humanists and computer scientists in these moments, and are capable of making connections that neither could do alone.