2016 Candidate Word Clouds – The GDELT Project

As part of our ongoing experiments in how to represent the firehose of data that GDELT monitors each day, we've been creating word clouds of the language appearing around each mention of the 2016 presidential candidates since August 8th of last year (UPDATE: now through January 1, 2015). There are two word clouds generated each day, one reflecting online news coverage from the previous 24 hours and one reflecting television news coverage as monitored by the Internet Archive's Television News Archive two days prior (due to the 48 hour rolling embargo enforced by the Archive).

For each recognized candidate from both parties, we compile a word cloud visualization in the shape of their party's symbol, using Andreas Mueller's superb Python word_cloud library, which allows custom raster masks that allow us to create visually captivating displays.

For television, we take all words appearing within 10 seconds of all mentions of the candidate's name in the closed captioning stream of all American domestic stations monitored by the Archive during that day (as measured in UTC timezone). For web, it is all words appearing within a paragraph of all mentions of the candidate's name. The size of the word in the word cloud is based on how often it appears in context with mentions of the candidate that day, though the position and coloration of each word is unrelated to its mentions and is determined by the word_cloud library's layout algorithm to best fit all of the words within the visual.

At the bottom of this post you can see how American television coverage of Hillary Clinton stacked up against that of Donald Trump two days ago on January 22nd, 2016. Compare those with web news coverage of Clinton and Trump. The two are often subtly different, reflecting different news cycles and priorities.

These word clouds are produced every morning by 2AM EST and posted using the following URL format:

The date is in YYYYMMDD format and the candidate name is capitalized FIRSTNAME_LASTNAME. This is followed by either "television" or "web" denoting which source the word cloud reflects. Each morning the web word cloud is available for the previous day and the television cloud is available for two days prior. So, on January 24th, the most recent web word cloud is for January 23rd, while the most recent television word cloud is available for January 22nd. Note that UTC timezone is used internally, so late-breaking news that happens just before midnight in the US may be reflected in the following day's word cloud.

You can access these word clouds daily from August 8, 2015 (20150808) to present. Feel free to republish these word clouds in your own publications/news coverage/websites, but please credit them as "Image by GDELT Project (http://gdeltproject.org) using data from the Internet Archive's Television News Archive."

For those wishing to perform sentiment or content analysis on the word histograms underlying the word clouds, we are also making those available for the television word clouds (coming shortly for the web word clouds). To access the word histogram for Donald Trump's television word cloud for January 22, 2016, use the following URL:

http://data.gdeltproject.org/blog/uscampaign2016/wordclouds/20160122.DONALD_TRUMP.TXT

The word histogram is sorted in order of the number of times each word appeared and the word is repeated in the file as many times as it appeared in the closed captioning streams that day. In other words, if the word "trump" appeared 100 times, "republican" appeared 40 times and "candidate" appeared 10 times, the text file would contain the word "trump" repeated 100 times followed by "republican" repeated 40 times and finally "candidate" 10 times. Words are all lower case and separated by spaces with no punctuation. This format was chosen for being compatible with a wide variety of content analysis packages.

You can also access the latest set of television and online news word clouds in a handy summary page each day.

Please get in touch with any questions or comments! Based on the feedback from these word clouds we are planning to expand these visualizations to global heads of state and other prominent leaders to provide an ever-growing collection of visualizations and analytic interfaces to the GDELT firehose. Remember that you can also access the live candidate mentions timeline and its mobile version. Enjoy!