Campaign 2020: Getting Started With GDELT For Tracking The US Presidential Race

As the United States' 2020 presidential race begins in earnest, here are just a few of the ways GDELT can be used to track how the media is covering the candidates and the race as a whole.

Television Coverage

Television news coverage of the candidates and the broader race is most easily explored through GDELT Summary's Television Explorer, which provides a user-friendly interface to the underlying Television 2.0 API. The Television Explorer allows you to keyword search the raw closed captioning of the monitored stations, getting back a timeline of how much the keyword was mentioned, a comparison of how much each station covered it, a word cloud of common co-occurring words and a list of top clips matching the keyword.

The Television Explorer searches an archive of more than 5.7 billion words of closed captioning from 163 distinct stations spanning July 2009 to present, from major national stations like CNN, MSNBC, Fox News and Bloomberg to local affiliates like CBS, NBC and ABC to international stations like Al Jazeera, BBC News, DeutscheWelle and Russia Today (not all stations are monitored for the entire 2009-present period: see the stations list in the Television Explorer to see the monitoring range for each station). Note that there is a rolling 24 hour embargo, so the Television Explorer can only search up to the most recent 24 hours ago.

Example Analyses.

  • Candidate Attention. Plot how much attention a given candidate is getting across the major stations (typically CNN, MSNBC and Fox News are the most commonly used since they are general-purpose stations that go back to 2009). Comparing how much coverage a candidate is receiving on CNN vs MSNBC vs Fox News often offers a useful barometer of how partisan the issue is. Note that you should ALWAYS use normalized volume (the default), which reports the number of 15s blocks matching your keyword as a percentage of all 15s blocks monitored from the given station in the given time interval. Stations devote different amounts of time to advertising and non-news content, so normalization ensures results are normalized for this. Live Demo (Trump).
  • Issue Attention. Plot how much attention a given issue is receiving and which stations are focusing on it the most. Live Demo (Mueller).
  • Candidate And Issue Wordclouds. Use the Top Clips snippets to compute the words most commonly occurring within 15s of each mention of the candidate's name. Generating a time series of daily, weekly or monthly wordclouds makes it possible to see how co-occurring words are changing over time. Comparing which words appear more or less on each station can also offer powerful insights, looking at the language favored by CNN vs Fox News, for instance. Example.
  • Candidate Dashboard. One of the most powerful ways to leverage the Television Explorer is simply to run a query for each candidate and graph them all together on a single timeline, ranking them by how much media attention each is getting. Comparing candidate mentions on the same set of stations over the same time period is the only time when it is safe to use non-normalized raw result counts. In this case it is safe to return the raw number of mentions of each candidate and then rank each candidate by the percent of all mentions of all the candidates that were of them (or the percentage of GOP or DEM candidates). Reporting each candidate as a percentage of candidate mentions, rather than of television mentions can help normalize for the decreased political coverage during holidays. For examples of what these candidate trackers can look like, see Atlantic's 2016 Candidate Tracker and the interactive 2016 Candidate Tracker. Alternatively, trackers can look at how much coverage candidates got around key events like their kickoff announcements.

Data Access.

The easiest way to search television is just to interactively search the Television Explorer, using the results as-is or downloading them as CSV files to plot in Excel, import into statistical software, etc, or download programatically on a regular basis:

  • Manually Downloading Results. Often you want to download the results of an ad-hoc search to CSV or JSON for further analysis. When viewing the results of any search in the Television Explorer, you can click on the export button at the top right of any graph to download the results as CSV or JSON to import into Excel or any other statistical package or use the CSV and JSON links at the top of the Top Clips list. For example, when creating your search in the TV Explorer interface you can select whether you want daily, weekly, monthly or yearly aggregation and then download the resulting timeline as a CSV file for graphing in Excel either on its own or plotted against other data, like Google Trends. A simple way to graph multiple datasets alongside each other is to report them as Z-Scores, which in Excel is available as the STANDARDIZE() function.
  • Programmatic Scripting. For some applications, like prediction models or summaries, you may need to download the results on a regular basis, such as each morning or once a week, using a Python script or similar mechanism. To download any of the graphs on a regular basis, click on the Export button at its top right and choose "View URL." This will open a new browser tab displaying the URL of the underlying Television 2.0 API URL that generated the graph. Just copy-paste this URL and add "&format=csv" or "&format=json" to the end to get the URL that you can pass to your Python or other script to download each morning, each week, etc to get the latest results, updated to 24 hours prior. Similarly, for the Top Clips pane you can click on the "CSV" or "JSON" links at the top right of the clip gallery to get the URLs to download the latest listing on a regular basis.

Comparing Queries.

Television Explorer makes it easy to compare up to four queries at once with a built-in Comparison Visualization mode. Specify up to four queries to view them combined on a single graph or specify two queries to view one as a percent of the other (for example to see what percent of coverage Hillary Clinton mentioned her emails by using "clinton (email OR emails OR server)" as the first query and "clinton" as the second query).

The Comparison Visualization mode runs interactively inside your browser, running each query and combining the results dynamically. This makes it trivial to test out different ideas to look for interesting results. The final results can be downloaded as a CSV file for further analysis. Note that since the Comparison Visualization runs in the browser, there is no URL that can be scripted via Python/etc. Instead, to run a comparison on a regular basis programatically, you would simply fetch the individual queries yourself and combine their results using your own code.

Online News Coverage

GDELT Summary can also be used to report online media coverage of each candidate or issue very similarly to the Television Explorer (it is a wrapper around the DOC 2.0 API). At this time online coverage is searchable only back to January 1, 2017, but the full historical backfile will be available shortly, allowing you to compare online coverage over the same time period as television coverage.

  • Manually Downloading Results. Use GDELT Summary's Online News Summary. You can download the results in CSV and JSON formats just like with television news.
  • Programmatic Scripting. Programatically downloading the results of the Online News Summary is done the same as for the Television Explorer.

Emotion Mining

Using GDELT Summary's fulltext search to identify online news coverage mentioning a given candidate, it is possible to explore the deeper emotional currents surrounding coverage of that candidate by taking the list of matching URLs and cross-referencing them against the GDELT Global Knowledge Graph (GKG) 2.0. Within the GKG record for each article, extract the Global Content Analysis Measures (GCAM) field, which records thousands of complex emotions and topics. GCAM records emotions at an article level, rather than entity level, but nevertheless allows some exceptionally powerful analyses of the emotional undercurrents of the 2020 race.

Frontpage Coverage

Every hour on the hour since March 2018, GDELT has crawled the homepages of around 50,000 news outlets worldwide and recorded all of their links and link text in the order they appear on the page. This makes it possible to see which stories are making it to the world's homepages and how often a candidate is frontpage material.

Stealth Editing And Rewriting

GDELT recrawls each online news article it monitors after 24 hours and after one week, comparing it against the article's contents when it first saw it and recording any changes. Article deletions, redirects, title and body changes are all logged. This is a highly experimental dataset and additional filtering and manual review is recommended to confirm results for maximal accuracy, but can offer a powerful tool for identifying changing narratives around candidates.

Visual Analysis

Finally, a highly experimental approach to understanding the candidates can come from analyzing their visual depictions in online news coverage. Each day up to 750,000 randomly selected images are processed from all of the worldwide online news coverage monitored by GDELT and run through Google's Cloud Vision API deep learning algorithms. Among the available fields are Google's Web Entities field, which processes the caption of the image to identify captions that mention a particular public figure (Google does not perform facial recognition so it will only flag that a public figure was mentioned in the caption, it cannot identify whether they actually appeared in the image itself). This database, totaling nearly half a billion images and stretching back to December 2015 opens a huge number of possibilities for understanding the visual narratives of the race.