Announcing Our First API: GKG GeoJSON!

Today we're incredibly excited to announce the official debut of our new GDELT API suite, with our very first API endpoint being a tool to generate GeoJSON files from the GDELT Global Knowledge Graph (GKG) 2.0!  Using this API, you can now create live maps, updated every hour, of any of GDELT's thousands of themes, of a particular person (such as a head of state) or organization, of a particular news outlet, of a particular language, or any combination therein – the sky is the limit!

For those of you who aren't familiar with the GDELT Global Knowledge Graph, it processes all of the worldwide news coverage that GDELT monitors every 15 minutes and compiles a list of all of the people, organizations, locations, themes, counts, and emotions, across 65 languages, into a powerful realtime metadata index over global society.  Of course, the problem with so much power is that it can be incredibly intimidating to try and actually use this massive firehose, so we've released our first API to make it possible to create quick maps of the world's news with just a few mouse clicks!

For those of you using CartoDB to map GDELT, this new API now gives you a URL to copy-paste into CartoDB's import dashboard to instantly create a new map from GDELT, updated to the last 15 minutes and covering up to the last 24 hours!  If you have CartoDB's "sync tables" feature enabled on your account (John Snow or greater accounts and all education/research accounts), you can simply click the "sync every hour" button in CartoDB when you import the table to instantly create a map that live updates every hour on the hour without you having to do anything!  We're incredibly excited by the awesome power of our debut API offering to enable a whole new way of creating live rich interactive animated maps of global society as seen through the world's news media, all with just a few mouse clicks – no programming needed!

The API allows you to filter the GKG by keyword/keyphrase of the themes and names fields, by source website domain, and by language, and any combination therein!  You can make a map of Arabic coverage of food security, a map comparing the BBC to the New York Times, or a map of today's coverage from a specific news outlet or on a specific topic!

To make your searches as relevant as possible, we do an incredible amount of processing behind the scenes.  Using the proximity information contained in the GKG 2.0 files, we assign every mention of a recognized theme, person, or organization to the location mentioned in the article closest to it, arbitrating in the case of multiple mentions in close proximity to multiple locations, and performing windowing and falloff filtering.  What does this mean to you?  In a nutshell, it means that when you search for the GDELT Theme "FOOD_SECURITY", the locations that are returned are those that were mentioned in closest proximity and context with the topic, meaning your map should have as few false positives as possible.  You will still find a certain number of false positives and this approach will eliminate some valid matches, but in the general case should ensure that you get highly relevant results from your searches!

Some Quick Examples To Get You Started

If you're eager to get started and don't care about the technical details, here are some simple queries to get you started creating your first maps!  If you're using CartoDB, just go to your CartoDB "Datasets" dashboard on cartodb.com and click on the big green "New Dataset" button in the upper right, paste in one of the URLs below, click "Submit", check off the option to sync every hour (if your account has "sync tables" enabled) and then click the big green "Connect Dataset" button and a few seconds or tens of seconds (depending on the query), you'll have yourself a live-updating table ready for mapping and which automatically updates every hour on the hour from now until the end of time!

Hopefully these queries have gotten you off to a quick start!  The rest of this blog post outlines all of the technical detail and the full capabilities of the API.

 

The Technical Details: How to Use the GKG GeoJSON API

NOTE: This section is for the technical folks that want to dig deeply into the API and understand how to use all of its features.

To use the new GKG GeoJSON API, you simply fetch the URL "https://api.gdeltproject.org/api/v1/gkg_geojson" into a tool like CartoDB, adding on the parameters you desire from below and a few seconds later it will return a GeoJSON file containing the requested results.  There is no authentication or fancy footwork needed!  For example, to search for coverage of the GDELT Theme "FOOD_SECURITY" over the past hour, just use the URL "https://api.gdeltproject.org/api/v1/gkg_geojson?QUERY=FOOD_SECURITY".  To search for only Arabic-language FOOD_SECURITY coverage, use the URL "https://api.gdeltproject.org/api/v1/gkg_geojson?QUERY=lang:Arabic,FOOD_SECURITY".  Its that easy!

You can paste the URL into your browser to see what the GeoJSON stream looks like and when you're happy, just paste into CartoDB using the "New Dataset From URL" option!  If you're wanting to download the GeoJSON into your own application, just fetch the URL above using any standard download tool (you access it via a standard HTTP GET).

There are two primarily output modes.  The "article" mode operates at the article level, with each record representing a location mentioned in a specific article.  This is most useful for creating clickable map layers where you want a user to be able to click on a location and get back a link to the article mentioning that location.  Conversely, the "location+time" mode is optimized for creating animation layers.  It collapses all coverage in a given 15 minute interval by location, with each record representing a specific location in a specific 15 minute time period.  Thus, if 50 articles all mentioned Paris, France in a given 15 minute interval, there will be a single record in the GeoJSON for "Paris, France" with that timestamp and details about all of the coverage that mentioned Paris during that time interval.  This can be used for clickable maps with creative SQL, but is primarily aimed at making minimized GeoJSON files highly optimized for animation use where the goal is to show change over time rather than creating a clickable interactive map layer.

The available parameters are listed below.  Note that they must be specified in all capital letters.

  • QUERY.  This is the actual query to use to filter the GKG.  If you leave this blank, all records are returned, which can rapidly generate extremely large files!  Individual query terms are separated by commas, words separated by spaces are treated as phrase matches (they should NOT be surrounded by quote marks).  Other than the special operators listed below, all other keywords/keyphrases are searched case insensitive on the list of themes and names for each article.  NOTE that this is NOT fulltext search, it only searches GDELT Themes and person and organization names for keyword/keyphrase searches.
    • "domain:"  This query operator allows you to search for coverage from a specific domain and can be combined with other operators.  Searching for "domain:bbc.co.uk" returns all coverage from the BBC's website.  Searching for "domain:bbc.co.uk,TERROR" returns all BBC coverage discussing terrorism.  Matches are case insensitive, but require an exact match.  Thus, "domain:postimees.ee" will NOT return coverage from "setomaa.postimees.ee" (see the "domainlike" for this type of search).  Only ONE "domain" operator is permitted per query.
    • "domainlike:"  This query operator allows you to search for a phrase WITHIN the domain name of an article and does NOT require an exact match.  Thus, "domainlike:postimees.ee" WILL return coverage from both "postimees.ee" and "setomaa.postimees.ee".  Only ONE "domainlike" operator is permitted per query.
    • "lang:"  This query operator allows you to search for coverage written in a specific language.  You can specify the language either in its English name (like "lang:Persian") or using its three-character ISO639-2 language code.  If you aren't getting the results you expect, try specifying the ISO code of the language(like "lang:fas").  Thus, searching for "lang:Arabic"  or "lang:nep" will return only coverage written in Arabic or Nepali, respectively.  Combining this with other operators allows you to perform searches like "lang:Arabic,FOOD_SECURITY" to find only Arabic-language coverage of food security issues.  Only ONE "lang" operator is permitted per query.
    • "geoname:"  This query operator allows you to filter the returned locations to a specific country or first order administrative division (ADM1).  For example, searching for "geoname:Syria" returns only locations in Syria, while "geoname:Texas, United States" returns only locations in Texas in the United States.  This is useful for creating minimized country-specific feeds in cases where disk space is of absolute essence, or where you are importing into other analytic tools and wish to restrict to a specific geographic locale.  Only ONE "geoname" operator is permitted per query.
    • "excgeoname:".  This is the opposite of "geoname" above, and allows you to filter the returned locations to EXCLUDE a specific country or ADM1.  This is often used to exclude matches from the United States to minimize the resulting file size when generating internationally-focused maps.  Only ONE "excgeoname" operator is permitted per query.
    • All other words/phrases are searched case insensitively against the THEMES and PERSONNAMES+ORGNAMES fields of the GKG.
  • TIMESPAN.  This specifies how far back in the GKG to search, up to the previous 24 hours, specified in minutes, from 15 minutes minimum to 24 hours (1440 minutes) maximum.  You can specify any value, but they will be rounded up/down to the nearest 15 minute resolution.  If not specified, this defaults to 1 hour (60 minutes).  This allows you to create a feed that only returns the most recent hour or most recent 6 hours or the entire latest 24 hours, etc.
  • OUTPUTTYPE.  The API supports two output formats (see earlier).  A value of "1" here generates "article" output, while a value of "2" generates "location+time" output.
  • OUTPUTFIELDS.  The specific list of fields to include in each record.
    • name.  This includes the human-readable name of the location.  In cases where a location has multiple common names/transliterations, this reflects the specific name used to refer to the location in the given article.  In the case of location+time output type, if there are multiple names used for the location, one of them is selected at random.  The value appears as "name" in the GeoJSON.
    • geores.  This includes an integer representing the geographic resolution of the location.  The values are 1=country, 2=US State / non-US ADM1, 3=city/landmark.  This can be used in your map to remove non-city-level matches or perform other steps on the data.  The value appears as "geores" in the GeoJSON.
    • url.  This is the fully-qualified URL of the article.  For article output it will appear as "url" in the GeoJSON.  For location+time output, this will be a tab-delimited list of all of the URLs of articles for that location+time and will appear as "allurls" in the GeoJSON.
    • oneurl.  Only available for location+time output.  In some cases you don't want all articles mentioning a location to be returned – you only want one example article.  It will appear as "oneurl" in the GeoJSON.
    • domain.  Returns the domain name of the article. For article output it will appear as "domain" in the GeoJSON.  For location+time output, this will be a semicolon-delimited list of all of the domains of articles for that location+time and will appear as "allurldomains" in the GeoJSON.
    • sharingimage.  Returns the social sharing image of the article.  For article output it will appear as "urlsocialimage" in the GeoJSON.  For location+time output, it will display the first valid social sharing image from any of the articles mentioning the location and will appear as "oneurlsocialimage" in the GeoJSON.
    • lang.  Returns the three-character ISO 639-2 code of the language the article is written in. For article output it will appear as "urllangcode" in the GeoJSON.  For location+time output, this will be a semicolon-delimited list of all of the languages of articles for that location+time and will appear as "allurllangcodes" in the GeoJSON.
    • themes.  Returns the list of themes appearing in closest proximity to the mention of this location in the article.  In the case that a theme is mentioned multiple times in context with a location mention, it will only appear once in this list.  For article output it will appear as "mentionedthemes" in the GeoJSON.  For location+time output, this will be a semicolon-delimited list of all of the themes appearing in proximity to all mentions of that location for that location+time and will appear as "allmentionedthemes" in the GeoJSON.  Each theme appears only once in this list regardless of the number of times it was mentioned within or across articles.
    • names.  Returns the list of names appearing in closest proximity to the mention of this location in the article.  In the case that a name is mentioned multiple times in context with a location mention, it will only appear once in this list.  For article output it will appear as "mentionednames" in the GeoJSON.  For location+time output, this will be a semicolon-delimited list of all of the names appearing in proximity to all mentions of that location for that location+time and will appear as "allmentionednames" in the GeoJSON.  Each name appears only once in this list regardless of the number of times it was mentioned within or across articles.
    • tone.  Returns the average "tone" of the entire article mentioning this location (tone is assessed at the article level, so does not distinguish the case where one location is mentioned favorably and other non-favorably in a given article).  By default this is the standard GDELT Tone score, but can be overridden below to instead reflect one of the thousands of rich emotional dimensions assessed by GDELT GCAM.  For article output it will appear as "urltone" in the GeoJSON.  For location+time output, this will be a numeric average (mean) of the tone values of all articles mentioning that location for that location+time and will appear as "avgurltone" in the GeoJSON.
    • wordcount.  Returns the total length in words of the article. For article output it will appear as "urlwordcnt" in the GeoJSON.  For location+time output, this will be a summed total of the length of all articles for that location+time and will appear as "sumtotalwords" in the GeoJSON.
    • numcounts.  Returns the total number of numeric counts of things mentioned in the article.  This is useful for flagging articles that contain substantial precise numeric detail versus those that are more vague.  For article output it will appear as "urlnumamounts" in the GeoJSON.  For location+time output, this will be a summed total of the number of count mentions across all articles for that location+time and will appear as "sumurlnumamounts" in the GeoJSON.
    • urlpubtimedate.  All records include a field called "urlpubtimedate" in the GeoJSON that records the JSON timestamp (at 15 minute resolution) of when the article was seen and processed by GDELT.  This field is always present and cannot be disabled.
    • sumtotalmentions.  This field is only enabled for location+time format and cannot be disabled for that format.  For each location+time it records the number of articles mentioning that location in that given 15 minute interval.
  • GCAMVAR.  If you have enabled "tone" as one of OUTPUTFIELDS, by default it will return the standard GDELT Tone score, which is a rough approximation of the "positive"/"negative" sentiment of document that is reasonably robust across the languages that GDELT monitors.  However, to explore more complex emotional landscapes through a geographic lens, you can chose from the entire library of thousands of emotions assessed by the 24 tonal packages used by GDELT GCAM.  Just make sure to include "tone" in the OUTPUTFIELDS list, find an emotional dimension of interest from the GCAM Master Codebook and specify the Variable name here.  For example, to use LIWC's "Anxiety" score (variable "c5.33"), specify "GCAMVAR=c5.33".  To use RID's "EMOTIONS/AGGRESSION" score (variable "c8.2"), specify "GCAMVAR=c8.2".  Since GCAM scores are recorded natively in GKG as word counts, they are automatically divided by the total number of words in the document to generate a density score suitable for mapping.
  • MAXROWS.  The maximum number of GeoJSON records to return.  This is useful to set a safety limit in case of large results sets and to cap files to stay under the filesize limits of your CartoDB account.  If OUTPUTTYPE is set to "1" (article resolution), this is the number of articles that are returned.  If OUTPUTTYPE is set to "2" (location+time resolution), this is the number of location+time entries to return.  The minimum value is 1 and the maximum is 250,000.  Large values can yield very large files.  The default value is 30,000.  You should be careful when adjusting this substantially higher to ensure that you do not accidentally generate a CartoDB table too large for your account.