GDELT GKG GeoJSON Files Available

For those who want to create their own interactive GDELT maps in CartoDB like the three we've unveiled thus far (realtime news map, realtime happiest and saddest news animation, and food/water security, transportation, and health/disease map), or want to import GDELT into spatial analytics toolkits, we've made the underlying GDELT GKG GeoJSON files available for download.  The files below are updated every 15 minutes and reflect a rolling hour/24 hour window as described more below.  There are three files, the Hourly URL GeoJSON, the Minimized Non-US Hourly URL GeoJSON, and the 24 Hour Animation GeoJSON.  Read the details about each below.

If you have a CartoDB "John Snow" or above account, make sure to select "Sync every hour" when you import these files and they will live-update for you every hour, yielding the same constantly-updating maps that we've been creating!  If you are importing into your own system, just refetch each of these URLs once every hour to get the latest data.  For those with sufficient disk space, you could download the Hourly URL GeoJSON file every hour and append to a master table to create a master longitudinal time series map table.  In CartoDB you might do this by having a synced table and once an hour making a CartoDB SQL API call to copy the table into a master table (though you will run out of disk quite quickly unless you have an enterprise account).

 

Hourly URL GeoJSON

[Download GeoJSON]
This GeoJSON file contains all coverage that GDELT has monitored in the last hour.  It updates every 15 minutes and reflects all coverage seen in the preceding 60 minutes on a rolling basis.  Each row represents one location found in a given article.  If an article mentions multiple locations it will be listed multiple times, once for each location.  If a location is mentioned in multiple articles, it will be listed multiple times, once for each article.  The goal of this GeoJSON file is to enable the creation of clickable map layers that allow interactive spatial browsing of GDELT's geographic information.  NOTE that when loaded into a database platform like CartoDB it will freqently consume between 100 and 300MB, so requires a large amount of database space.

The fields contained in the file are as follows:

  • mentionednames: A semicolon-delimited list of all of the person and organization names associated with this location in a given news article. Note that this field errs on the side of inclusivity, so you will see a high false positive rate in this field, especially from translated articles.
  • mentionedthemes: A semicolon-delimited list of all of the themes associated with this location in a given news article.
  • name: The human-readable name of the location. This can be used to filter to only those locations from a particular country or first order administrative division.  Note that spellings can vary based on transliterations and occasionally there may be odd characters that appear in a name when its Romanized name does not yield a fully ASCII-compliant name.
  • url: The URL of the article mentioning this location. At this time only online news coverage is included in the CartoDB feed.
  • urldomain: The domain of the article mentioning this location. This allows you to filter for specific news outlets, such as comparing the New York Times versus the BBC.
  • urllangcode: This is the three-letter ISO639-2 code of the language the article was written in. This allows you to filter coverage by language.
  • urlnumamounts: This records how many precise numeric amounts were mentioned in the article. When it processes an article, GDELT looks for any mentions of precise numeric amounts, such as “22 injured” or “6 train cars derailed”.  The number of these precise amounts that it found in the article is recorded in this column.  This can be used to filter coverage for those with greater detail about a particular issue.  For example, you might make a map of Natural Disaster coverage, but only display coverage that has multiple precise numeric amounts, suggesting coverage that is highly detailed about the disaster.
  • urlsocialimage: This records the social sharing image provided by the article. Many news articles include an editorially-assigned image to be displayed when the article is shared via social networks or other channels.  Note that even news outlets that provide social sharing images may only include them for a portion of their coverage, so you should exercise caution when using this field to restrict your map to only coverage containing a social sharing image.
  • urltone. This offers an extremely coarse assessment of the overall positive/negative “tone” of the article.  It is not nearly as sophisticated as the broader array of emotions available through GCAM, but offers a decent rough approximation of the overall emotional emphasis of the article.  However, keep in mind that there are considerable unknowns about comparing the tone of articles across languages at this time, so be conservative with direct comparison of the tone of articles written in different languages.
  • urlwordcnt. This records the size of the article mentioning this location in words.  For non-English coverage, this records the size of its translation in English words.  This can be useful to filter for longer (likely more detail-rich or summary pieces) or shorter (likely breaking updates) coverage.

 

Minimized Non-US Hourly URL GeoJSON

[Download GeoJSON]
This file is identical to the Hourly URL GeoJSON with the exception that it excludes all locations in the United States and that it is capped at a maximum of 40MB and/or 40,000 entries, whichever is smaller.  This file is designed to offer a minimized version of the file, suitable for importing into free CartoDB accounts and other mapping or analytic platforms that have relatively small limits on data size.

 

24 Hour Animation GeoJSON

[Download GeoJSON]
This GeoJSON file contains a clustered version of all coverage that GDELT has monitored in the last 24 hours, collapsed by location+timestamp instead of by article.  It updates every 15 minutes and reflects all coverage seen in the preceding 24 hours on a rolling basis.  Unlike the Hourly GeoJSON files, in this file each row represents a single location in a single 15 minute interval instead of a single mention of a location in a single article.  Here, all mentions of a given location within a 15 minute interval are collapsed into a single row.  For the names and themes fields, all names and themes associated with any mention of the location in the given time interval are concatenated together, as is the case with the domains field.  For the other fields they are averages or summations of all mentions.  The name field is excluded from this file to conserve disk space.  The goal of this GeoJSON file is to enable the creation of animation map layers that show change over time, rather than supporting interactive clickable maps.  NOTE that when loaded into a database platform like CartoDB it will freqently consume between 250 and 450MB, so requires a large amount of database space.

The fields contained in the file are as follows:

  • allmentionednames: This is a semicolon-delimited concatenated list of all of the names associated with this location across all of the coverage mentioning that location during that given 15 minute interval. To conserve disk space, if a name is mentioned multiple times in an article it will only be listed here once.
  • allmentionedthemes: This is a semicolon-delimited concatenated list of all of the themes associated with this location across all of the coverage mentioning that location during that given 15 minute interval. To conserve disk space, if a theme is mentioned multiple times in an article it will only be listed here once.
  • allurldomains: This is a semicolon-delimited concatenated list of all of the domains from which articles were monitored mentioning this location during that given 15 minute interval. To conserve disk space, if multiple articles were found from the same outlet, the domain will still only be listed here once.
  • avgurltone: This is the average tone of all coverage mentioning this location during that given 15 minute interval.
  • sumtotalmentions: This is the total number of monitored articles mentioning this location during that given 15 minute interval.
  • sumtotalwords: This is the sum total number of words of all monitored articles mentioning this location during that given 15 minute interval.