GDELT DOC 2.0 API Debuts!

We are incredibly excited to announce today the debut of the new GDELT 2.0 DOC API, which is our full text search API. A year and a half after the unveiling of our first full text search API on Christmas Day 2015, our new 2.0 API builds upon all of the lessons we've learned from that first API and all of the requests we've heard from all of you.

Perhaps the two biggest changes are that the API now searches a rolling window of the last 3 months of coverage, rather than just the last 24 hours of the original API and now includes all of the images processed by the Visual Global Knowledge Graph (VGKG), meaning for the first time you can both perform near-term longitudinal analyses and search for images based on the objects and activities they depict! In a nod to the intense demand we've heard from all of you for more seamless integration with web-first workflows and visualizations, the new API also supports JSON and JSONP output formats!

Search A Three Month Rolling Window

One of the most-heard requests from all of you was for our search API to break the 24 hour barrier and enable you to search over much larger time periods. Thus, the new API now searches a rolling window of the last 3 months (quarter year) of all global online news coverage monitored by GDELT! You can narrow your search to any time range within the last 3 months meaning you can still search just the last 24 hours if you want, but for those analyses more interested in longitudinal trends, we are very excited to see what you are able to do with this new historical search capability!

Search The World's News Imagery

As with our new GEO 2.0 API, the DOC API offers seamless searching of the GDELT Visual Knowledge Graph (VGKG) deep learning global news imagery cataloging. Now you can search for all news images depicting fire or flooding or containing the Red Cross logo or mentioning Donald Trump in the caption and more! To our knowledge this new API represents the first global-scale deep learning-powered image search engine ever created, allowing you to explore the ever-more-critical visual narratives of the world's news coverage.

Search Across 65 Languages

One of the most powerful aspects of the DOC 2.0 API is that you can search across all 65 machine translated languages supported by GDELT using English keywords/phrases as your search terms. GDELT's Translingual infrastructure machine translates 100% of all monitored coverage in 65 languages comprising 98.4% of GDELT's daily non-English monitoring volume. To our knowledge this is one of the largest initiatives in the world to mass machine translate global news coverage in realtime. In short, GDELT monitors news coverage from across the world, machine translates all of the coverage it sees in 65 of those languages into English and then allows you to search those machine translations. This allows you to "look across languages" and find all global coverage of your topic regardless of the language it was published in – an absolutely critical element in allowing you to peer deeply into local narratives and perspectives.

Instant Embeddable Visualizations And JSON

Creating powerful interactive browser-based visualizations takes a lot of effort and so we've done all of the hard work for you and created advanced visualizations for each of the API's output modes that are custom designed to be dropped into your own web pages via iframe embedding. By just inserting an iframe into your page and setting its URL to this API you are able to instantly embed a live-updating advanced visualization that reflects global coverage from across the world in 65 languages using some of the most advanced machine learning and deep learning algorithms in the world. For those who want to create their own interactive visualizations and use the API just as a data source, we now support JSON and JSONP output formats, which also makes it trivial to import the API's data into most modern statistical and data mining toolkits for further analysis. We also set the CORS ACAO header to the wildcard "*" and add additional headers to make embedding as seamless as possible.

QUICK START EXAMPLES

Here are some really simple examples to get you started using the API!

FULL DOCUMENTATION

The GDELT GEO 2.0 API is accessed via a simple URL with the following parameters. Under each parameter is the list of operators that can be used as the value of that parameter.

  • QUERY. This contains your search query and supports keyword and keyphrase searches, OR statements and a variety of advanced operators. NOTE – all of the operators below must be used as part of the value of the QUERY field, separated by spaces, and cannot be used as URL parameters on their own.
    • "". Anything found inside of quote marks is treated as an exact phrase search. Thus, you can search for "Donald Trump" to find all matches of his name.
      • "donald trump"
    • (a OR b). You can specify a list of keywords to be boolean OR'd together by enclosing them in parentheses and placing the capitalized word "OR" between each keyword or phrase. Boolean OR blocks cannot be nested at this time. For example, to search for mentions of Clinton, Sanders or Trump, you would use "(clinton OR sanders OR trump)".
      • (clinton OR sanders OR trump)
    • -. You can place a minus sign in front of any operator, word or phrase to exclude it. For example "-sourcelang:spanish" would exclude Spanish language results from your search.
      • -sourcelang:spanish
    • Domain. Returns all coverage from the specified domain. Follow by a colon and the domain name of interest. Search for "domain:cnn.com" to return all coverage from CNN.
      • domain:cnn.com
    • DomainIs. This is identical to the main "Domain" operator above, but requires an exact match, allowing searching for common short domains like "un.org". For example, when searching for "domain:un.org" many other domains that end in "un.org" are returned like "catholicsun.org". Using this option you can restrict to a precise match, allowing you to return only articles from the "un.org" domain.
      • domainis:un.org
    • ImageFaceTone. Searches the average "tone" of human facial emotions in each image. Only human faces that appear large enough in the image to accurately gauge their facial emotion are considered, so large crowd photos where it is difficult to see the emotion of peoples' faces may not be scored accurately. The tone score of an average photograph typically ranges from +2 to -2. To search for photos where visible people appear to be sad, search "imagefacetone<-1.5". Only available in any of the "image" modes.
      • imagefacetone<-1.5
    • ImageNumFaces. This searches the total number of foreground human faces in the image. Typically only unobstructed human faces facing toward the camera and in the foreground of the image are counted – large crowd scenes will not be counted properly. Use this to identify images depicting a certain number of people in the foreground of the photo. You can search for "<" less than, ">" more than or "=" – searching "imagenumfaces=3" will identify images with three human faces, while "imagenumfaces>5" will return images with more than 5 human faces. Only available in any of the "image" modes.
      • imagenumfaces>3
    • ImageOCRMeta. This searches a combination of the results of OCR performed on the image in 80+ languages (to extract any text found in the image, including background text like storefronts and signage), all metadata embedded in the image file itself (EXIF, etc) and the textual caption provided for the image. To search for images of a specific event, such as "mobile congress" you would use this field, since that information would most likely either be found in signage in the background of the image, provided in the EXIF metadata in the image or listed in the caption under the image. The search parameter for this field must always be enclosed in quote marks, even when searching for a single word like "imageocrmeta:"zika"". Only available in any of the "image" modes.
      • imageocrmeta:"zika"
    • ImageTag. Every image processed by GDELT is assigned one or more topical tags from a universe of more than 10,000 objects and activities recognized by Google's algorithms. This is the primary and most accurate way of searching global news imagery monitored by GDELT, as these tags represent the ground truth of what is actually depicted in the image itself, whereas other fields like "imageocrmeta" and "imagewebtag" reflect metadata and caption information provided by others about the image. Always remember that these tags are assigned 100% by computer and thus you will always find some error in the results. You can find a list of all tags appearing in at least 100 images over the past year (Image Tag Lookup) – in addition the two special tags "safesearchviolence" and "safesearchmedical" can also be used. Searching for "imagetag:"safesearchviolence"" will return violent images, for example. Values must be enclosed in quote marks. Only available in any of the "image" modes.
      • imagetag:"safesearchviolence"
    • ImageWebCount. Every image processed by GDELT is run through the equivalent of a reverse Google Images search that searches the web to see if the image has ever appeared anywhere else on the web that Google has seen. Up to the first 200 web pages where the image has been seen are returned. This operator allows you to screen for popular versus novel images – searching for "imagewebcount<10" will search for relatively novel images while "imagewebcount>100" will return images that appear widely online. Note that this records only the number of pages that Google has seen the image on, not the number of sites, meaning that if, for example, CNN uses a single image widely in its reporting of a breaking news event and publishes many articles on the event with the same image, this count will be high for that image, even though it is a novel image. Only available in any of the "image" modes.
      • imagewebcount<10
    • ImageWebTag. Every image processed by GDELT is run through the equivalent of a reverse Google Images search that searches the web to see if the image has ever appeared anywhere else on the web that Google has seen. The system then takes every one of those appearances from across the web and looks at all of the textual captions appearing beside the image and compiles a list of the major topics used to describe the image across the web. This offers tremendous descriptive advantage in that you are essentially "crowdsourcing" the key topics of the image by looking at how it has been described across the web. Values must be enclosed in quote marks. Only available in any of the "image" modes. You can access a list of all tags appearing in at least 100 images (Image WebTag Lookup).
      • imagewebtag:"drone"
    • SourceCountry. Searches for articles published in outlets located in a particular country. This allows you to narrow your scope to the press of a single country. For countries with spaces in their names, type the full name without the spaces (like "sourcecountry:unitedarabemirates" or "sourcecountry:saudiarabia"). You can also use their 2-character FIPS country code (Country Lookup).
      • sourcecountry:france
    • SourceLang. Searches for articles originally published in the given language. The GEO API currently only allows you to search the English translations of all coverage, but you can specify that you want to limit your search to articles published in a particular language. Using this operator by itself you can map all of the locations mentioned in a particular language across all topics to see the geographic focus of a given language. Search for "sourcelang:spanish" to return only Spanish language coverage. You can also specify its three-character language code. All 65 machine translated languages are supported (Languages Lookup).
      • sourcelang:spanish
    • Theme. Searches for any of the GDELT Global Knowledge Graph (GKG) Themes. GKG Themes offer a more powerful way of searching for complex topics, since they can include hundreds or even thousands of different phrases or names under a single heading. To search for coverage of terrorism, use "theme:terror". You can find a list of all themes that have appeared in at least 100 articles over the past two years (GKG Theme Lookup).
      • theme:TERROR
    • Tone. Allows you to filter for only articles above or below a particular tone score (ie more positive or more negative than a certain threshold). To use, specify either a greater than or less than sign and a positive or negative number (either an integer or floating point number). To find fairly positive articles, search for "tone>5" or to search for fairly negative articles, search for "tone<-5".
      • tone<-5
    • ToneAbs. The same as "Tone" but ignores the positive/negative sign and lets you simply search for high emotion or low emotion articles, regardless of whether they were happy or sad in tone. Thus, search for "toneabs<1" for fairly neutral articles or search for "toneabs>10" for fairly emotional articles.
      • toneabs>10
  • MODE. This specifies the specific output you would like from the API, ranging from timelines to word clouds to article lists.
    • ArtList. This is the most basic output mode and generates a simple list of news articles that matched the query. In HTML mode articles are displayed in a table with its social sharing image (if available) to its left, the article title, its source country, language and publication date all shown. RSS output format is only available in this mode.
    • ArtGallery. This displays the same information as the "ArtList" mode, but does so using a "high design" visual layout suitable for creating magazine-style collages of matching coverage. Only articles containing a social sharing image are included.
    • ImageCollage. This displays all matching images that have been processed by the GDELT Visual Global Knowledge Graph (VGKG), which runs each image through Google's Cloud Vision API deep learning image cataloging. If your query does not contain any image-related search terms, this mode will return a list of all VGKG-processed images that were contained in the body of matching articles, while if your search included image terms, only matching images will be shown. Thus, this mode is most relevant when used with the various image-related query terms. Each image is provided with a link to the article containing it. Note that the document extraction system used by GDELT may on occasion make mistakes and associate an image with a news article in which it appeared only as an inset or unrelated footer, though this is usually rare. This mode is most useful for understanding the visual portrayal of your search.
    • ImageCollageInfo. This yields identical output as the ImageCollage option, but adds four additional pieces of information to each image: 1) the number of times (up to 200) it has been seen before on the open web (via a reverse Google Images search), 2) a list of up to 6 of those web pages elsewhere on the web where the image was found in the past, 3) the date the photograph was captured via in the image's internal metadata (EXIF/etc), and 4) a warning if the image's embedded date metadata suggests the photograph was taken more than 72 hours prior to it appearing in the given article. Using this information you can rapidly triage which of the returned images are heavily-used images and which are novel images that have never been found anywhere on the web before by Google's crawlers. (You can also use the "imagewebcount" query term above to restrict your search to just images which have appeared a certain number of times.) Only a relatively small percent of news images contain an embedded capture datestamp that documents the date and time the image was taken or created and it is not always accurate, but where available this can offer a powerful indicator that a given image may be older than it appears and for applications that rely on filtering for only novel images (such as crisis mapping image cataloging), this can be used as a signal to perform further verification on an image.
    • ImageGallery. This displays most of the same information as the "ImageCollageInfo" mode (though it does not include the embedded date warning), but does so using a "high design" visual layout suitable for creating magazine-style collages of matching coverage.
    • ImageCollageShare. Instead of returning VGKG-processed images, this mode returns a list of the social sharing images found in the matching news articles. Social sharing images are those specified by an article to be shown as its image when shared via social media sites like Facebook and Twitter. Not all articles include social sharing images and the images may sometimes only be the logo of the news outlet or not representative of the article contents, but in general they offer a reasonable visual summary of the core focus of the article and especially how it will appear when shared across social media platforms.
    • TimelineVol. This is the most basic timeline mode and returns the volume of news coverage that matched your query by day/hour/15 minutes over the search period. Since the total number of news articles published globally varies so much through the course of a day and through the weekend and holiday periods, the API does not return a raw count of matched articles, but instead divides the number of matching articles by the total number of all articles monitored by GDELT in each time step. Thus, the timeline reports volume as a percentage of all global coverage monitored by GDELT. For time spans of less than 72 hours, the timeline uses a time step of 15 minutes to provide maximum temporal resolution, while for time spans from 72 hours to one week it uses an hourly resolution and for time spans of greater than a week it uses a daily resolution. In HTML mode the timeline is displayed as an interactive browser-based visualization.
    • TimelineVolInfo. This is identical to the main TimelineVol mode, but for each time step it displays the top 10 most relevant articles that were published during that time interval. Thus, if you see a sudden spike in coverage of your topic, you can instantly see what was driving that coverage. In HTML mode a popup is displayed over the timeline as you mouse over it and you can click on any of the articles to view them, while in JSON and CSV mode the article list is output as part of the file.
    • TimelineTone. Similar to the main TimelineVol mode, but instead of coverage volume it displays the average "tone" of all matching coverage, from extremely negative to extremely positive.
    • TimelineLang. Similar to the TimelineVol mode, but instead of showing total coverage volume, it breaks coverage volume down by language so you can see which languages are focusing the most on a topic. Note that the GDELT APIs currently only search the 65 machine translated languages supported by GDELT, so stories trending in unsupported languages will not be displayed in this graph, but will likely be captured by GDELT as they are cross-covered in other languages. With the launch of GDELT3 later this summer, the resolution and utility of this graph will increase dramatically.
    • TimelineSourceCountry. Similar to the TimelineVol mode, but instead of showing total coverage volume, it breaks coverage volume down by source country so you can see which countries are focusing the most on a topic. Note that GDELT attempts to monitor as much media as possible in each country, but smaller countries with less developed media systems will necessarily be less represented than larger countries with massive local press output. With the launch of GDELT3 later this summer, the resolution and utility of this graph will increase dramatically.
    • ToneChart. This is an extremely powerful visualization that creates an emotional histogram showing the tonal distribution of coverage of your query. All coverage matching your query over the search time period is tallied up and binned by tone, from -100 (extremely negative) to +100 (extremely positive). (Though typically the actual range will be from -20 to 20 or less). Articles in the -1 to +1 bin tend to be more neutral or factually-focused, while those on either extreme tend to be emotionally-laden diatribes. Typically most sentiment dashboards display a single number representing the average of all coverage matching the query ala "The average tone of Donald Trump coverage in the last week is -7". Such displays are not very informative since its unclear what precisely "-7" means in terms of tone and whether that means that most coverage clustered around -7 or whether it means there were a lot of extremely negative and extremely positive coverage that averaged out to -7, but no actual coverage around that tonal range. By displaying tone as a histogram you are able to see the full distributional curve, including whether most coverage clusters around a particular range, whether it has an exponential or bell curve, etc. In HTML mode you can mouse over each bar to see a popup with the top 10 most relevant articles in that tone range and click on any of the headlines to view them.
    • WordCloudEnglish. This mode takes a selection of the top most relevant articles relating to your query, breaks their English translations/text into discrete words and displays a word cloud of up to the top 200 most frequent words that appeared in your coverage (common stop words are automatically removed). This is a powerful way of understanding the topics and words dominating the relevant coverage and suggesting additional contextual search terms to narrow or evolve your search. Note that if there are too few matching articles for your query, the word cloud may be blank.
    • WordCloudNative. This is identical to the WordCloudEnglish mode, but instead of making a word cloud of the English translations/text of each article, it splits their original native text into words and uses those for the word cloud. Languages which do not use spaces to separate words, such as Chinese, Japanese, Thai and Vietnamese, are segmented into words using machine learning algorithms via GDELT Translingual. Note that this mode excludes all English language coverage from consideration, meaning that if your search returns only English language coverage, this word cloud will not generate any results.
    • WordCloudTheme. This is identical to the WordCloudEnglish mode, but instead of the article text words, this constructs a histogram of the GDELT GKG Themes assigned to each article, which offers a higher-level semantic view of the major topics dominating the coverage.
    • WordCloudImageTags. This is identical to the WordCloudEnglish mode, but instead of the article text words, this mode takes all of the VGKG-processed images found in the matching articles (or which matched any image query operators) and constructs a histogram of the top topics assigned by Google's deep learning neural network algorithms as part of the Google Cloud Vision API.
    • WordCloudImageWebTags. This is identical to the WordCloudImageTags mode, but instead of using the tags assigned by Google's deep learning algorithms, it uses the Google knowledge graph topical taxonomy tags assigned by the Google Cloud Vision API's Web Annotations engine. This engine performs a reverse Google Images search on each image to locate all instances where it has been seen on the open web, examines the captions of all of those instances of the image and compiles a list of topical tags that capture the contents of those captions. In this way this field offers a far more powerful and higher resolution understanding of the primary topics and activities depicted in the image, including context that is not visible in the image, but relies on the captions assigned by others, whereas the WordCloudImageTags field displays the output of deep learning algorithms considering the visual contents of the image.
  • FORMAT. This controls what file format the results are displayed in. Not all formats are available for all modes.  To assist with website embedding, the CORS ACAO header for all output of the API is set to the wildcard "*", permitting universal embedding.
    • HTML. This is the default mode and returns a browser-based visualization or display. Some displays, such as word clouds, are static images, some, like the timeline modes, result in interactive clickable visualizations, and some result in simple HTML lists of images or articles. The specific output varies by mode, but all are intended to be displayed directly in the browser in a user-friendly intuitive display and are designed to be easily embedded in any page via an iframe.
    • CSV. This returns the requested data in comma-delimited (CSV) format. The specific set of columns varies based on the requested output mode. Note that since some modes return multilingual content, the CSV is encoded as UTF8 and includes the UTF8 BOM to work around Microsoft Excel limitations handling UTF8 CSV files.
    • RSS. This output format is only available in ArticleList mode and returns the list of matching article URLs and titles in RSS 2.0 format. This makes it possible to display the results using any standard RSS reader. It also makes it seamless for web archives to create tailored archival feeds to preserve news coverage on certain topics or meeting certain criteria.
    • RSSArchive. This special format is also only available in ArticleList mode and extends the standard RSS output by including both the main article URL and its alternative mobile or AMP version, if available, as a separate item. If no mobile versions of the search result articles are available, the output of this format will be identical to the standard RSS output format. For any article in the search results that had an alternative mobile or AMP edition, a second item will appear in the RSS feed for the mobile/AMP version. If both an AMP version and a mobile version of the page is available, only the AMP version will be returned. This format is intended for use by web archives to create tailored feeds that preserve both the desktop and mobile versions of matching coverage given that mobile versions are often different than their desktop counterparts. By consuming this feed as a data source, web archives can automatically ensure they are capturing both desktop and mobile experiences of matching content.
    • JSON. This returns the requested data in UTF8 encoded JSON. The specific fields varies by output mode.
    • JSONP. This mode is identical to "JSON" mode, but accepts an additional parameter in the API URL "callback=XYZ" (if not present defaults to "callback") and wraps the JSON in that callback to return JSONP compliant JavaScript code.
  • SEARCHLANG. By default all query keywords/keyphrases are considered to be in English and searches are conducted only of the original English text and English machine translations of all articles. You can alternatively provide your search terms in any of the 65 machine translated languages supported by GDELT via their three-character language code (Languages Lookup). If this parameter is provided, all keywords/keyphrases will be considered to be in this language and the API will search only the original native text of all articles published in that language. Only one language is permitted per search. This is different from the "sourcelang" query parameter in that sourcelang narrows the search to articles published in a particular language, but still expects all keywords/keyphrases to be in English and only searches the English translation/original text of each article. Note that this parameter does not affect the "imagewebtag", "imagetag" and "theme" query options, which all search in English.
  • TIMESPAN. By default the DOC API searches the last 3 months of coverage monitored by GDELT. You can narrow this range by using this option to specify the number of months, weeks, days, hours or minutes (minimum of 15 minutes). The API then only searches documents published within the specified timespan backwards from the present time. If you would instead like to specify the precise start/end time of the search instead of an offset from the present time, you should use the STARTDATETIME/ENDDATETIME parameters.
    • Minutes. Specify a number by itself to provide the timespan in minutes.
    • Hours. Specify a number followed by "h" or "hours" to provide the timespan in hours.
    • Days. Specify a number followed by "d" or "days" to provide the timespan in days.
    • Weeks. Specify a number followed by "w" or "weeks" to provide the timespan in weeks.
    • Months. Specify a number followed by "m" or "months" to provide the timespan in months.
  • STARTDATETIME/ENDDATETIME. These parameters allow you to specify the precise start and end date/times to search, instead of using an offset like with TIMESPAN.
    • STARTDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to begin the search – only articles published after this date/time stamp will be considered. It must be within the last 3 months. If you do not specify an ENDDATETIME, the API will search from STARTDATETIME through the present date/time.
    • ENDDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to end the search – only articles published before this date/time stamp will be considered. It must be within the last 3 months. If you do not specify a STARTDATETIME, the API will search from 3 months ago through the specified ENDDATETIME.
  • MAXRECORDS. This option only applies to the ArticleList and various ImageCollage modes, it is ignored in all other modes. To conserve system resources, in Article List and the ImageCollage modes, the API only returns up 75 results by default, but this can be increased up to 250 results if desired by using this URL parameter.
  • TIMELINESMOOTH. This option is only available in the various Timeline modes and performs moving window smoothing over the specified number of time steps, up to a maximum of 5. Due to GDELT's high temporal resolution, timeline displays can sometimes capture too much of the chaotic noisy information environment that is the global news landscape, resulting in jagged displays. Use this option to enable moving average smoothing up to 5 days.
  • TRANS. Only available in ArticleList mode with HTML output, this embeds a machine translation widget in the results page to seamlessly machine translate all of the article titles into your requested language. Currently only the Google Translate Widget is supported. This means that if your primary language is French, all article titles in your search results across all 65 core languages that GDELT supports will be transparently translated in your browser instantly by Google Translate into French.
    • GoogTrans. Set to "googtrans" to embed the Google Translate Widget, which is the only translation widget presently supported.
  • SORT. By default results are sorted by relevance to your query. Sometimes you may wish to sort by date or tone instead.
    • DateDesc. Sorts results by publication date, displaying the most recent articles first.
    • DateAsc. Sorts results by publication date, displaying the oldest articles first.
    • ToneDesc. Sorts results by tone, displays the most positive articles first.
    • ToneAsc. Sorts results by tone, displays the most negative articles first.