Announcing The GDELT Context 2.0 API

Kalev Leetaru

5 years ago

We're incredibly excited to announce today the release of the GDELT Context 2.0 API! This newest addition to the stable of GDELT APIs joins the DOC 2.0, GEO 2.0, TV 2.0 and TVAI 2.0 APIs!

The Context 2.0 API is functionally extremely similar to the DOC 2.0 API, with the exception that instead of searching documents, it searches individual sentences and instead of only returning the URL of matching results, it returns a brief snippet of text showing the context of the match, typically the sentence that matched the keyword and a portion of the sentence before and after to offer an understanding of the context of the match. This makes it possible to understand whether an article mentioning "Covid-19" and "pandemic" together for example, is a casual reference to the outbreak or a clinical update on the pandemic's spread.

Instead of searching entire documents, this API requires that all search terms appear in the same sentence together. Thus, while the DOC 2.0 API can return articles that mention "Covid-19" in a sentence at the beginning of an article and "pandemic" at the end of the article, the Context 2.0 API requires that all keywords be contained in the same sentence together to ensure maximal relevancy.

The field of information science incorporates many areas of research related to information relevancy. We envision powerful new kinds of relevancy rankings that can use this additional contextual information together with machine learning approaches like neural or classical language understanding models to find the most relevant articles in response to a user's query. Using this additional contextual information, we expect relevancy filters that can actually model the response snippets and use them to semantically answer a user's question and route them to the most comprehensive and detailed articles supporting those answers, as well as identify contested narratives in which there are fundamentally opposing answers captured in the global news media. Eventually we hope to be able to incorporate these kinds of advanced relevancy models into the DOC 2.0 API in place of the current date and textual relevancy scores.

A maximum of one matching sentence per article will be returned. This means that if a given article contains multiple sentences that match the query, only a single representative sentence will be returned. The specific sentence selected from each article will be ranked by semantic relevance in textual ranking mode or selected at random in date descending mode and may change from query to query. This filtering process is performed after the query has executed, meaning that a request for 75 articles in which 15 results are sentences from articles already in the results will yield 60 results actually returned. Thus, most queries will typically receive fewer than the requested number of articles.

As we scale up the new Context 2.0 API, this inaugural release is limited to searching only the past 72 hours and may exclude some articles that are difficult to segment into sentences or which utilize particularly complex grammars. Thus, it represents a subset of coverage searched by the DOC 2.0 API for now. As the API evolves, these limitations will ease.

QUICK START EXAMPLES

Up to 75 matching articles mentioning "ultraviolet (covid OR virus OR coronavirus)" in a single sentence, returned in JSON format.
- https://api.gdeltproject.org/api/v2/context/context?format=html&timespan=24H&query=ultraviolet%20(covid%20OR%20virus%20OR%20coronavirus)&mode=artlist&maxrecords=75&format=json
Up to 75 matching articles mentioning "ultraviolet (covid OR virus OR coronavirus)" in a single sentence with a quote mark, returned in JSON format.
- https://api.gdeltproject.org/api/v2/context/context?format=html&timespan=24H&query=ultraviolet%20(covid%20OR%20virus%20OR%20coronavirus)&mode=artlist&maxrecords=75&format=json&isquote=1

FULL DOCUMENTATION

QUERY. This contains your search query and supports keyword and keyphrase searches, OR statements and a variety of advanced operators. NOTE – all of the operators below must be used as part of the value of the QUERY field, separated by spaces, and cannot be used as URL parameters on their own
- - "". Anything found inside of quote marks is treated as an exact phrase search. Thus, you can search for "Donald Trump" to find all matches of his name.
    - "donald trump"
  - (a OR b). You can specify a list of keywords to be boolean OR'd together by enclosing them in parentheses and placing the capitalized word "OR" between each keyword or phrase. Boolean OR blocks cannot be nested at this time. For example, to search for mentions of Clinton, Sanders or Trump, you would use "(clinton OR sanders OR trump)".
    - (clinton OR sanders OR trump)
  - -. You can place a minus sign in front of any operator, word or phrase to exclude it. For example "-trump" would exclude the word "Trump" from your search.
    - -trump
  - Domain. Returns all coverage from the specified domain. Follow by a colon and the domain name of interest. Search for "domain:cnn.com" to return all coverage from CNN.
    - domain:cnn.com
  - DomainIs. This is identical to the main "Domain" operator above, but requires an exact match, allowing searching for common short domains like "un.org". For example, when searching for "domain:un.org" many other domains that end in "un.org" are returned like "catholicsun.org". Using this option you can restrict to a precise match, allowing you to return only articles from the "un.org" domain.
    - domainis:un.org
- - Near. Allows you to specify a set of keywords that must appear within a given number of words of each other. To use this operator, you specify the word "near", followed by the maximum distance all of the words can appear apart in a given document and still be considered a match, a colon, and then the list of words in quote marks. Phrase matching is not supported at this time, so the list of words is treated as a list of individual words that must all appear together within the given proximity. Note that if the words appear in a sentencein a different order than specified in the "near" operator, each ordering difference increments the word distance counted by the "near" operator. (Thus, near10:"donald trump" will return documents where "trump" appears within 10 words after "donald", but will also return sentences in which "donald" appears within 9 words after "trump".) The distance measure is not precise and can count punctuation and other tokens as "words" as well. It is also important to remember that proximity in a sentence does not necessarily imply two words are connected semantically each other. Remember that all words have to appear in the same sentence together.
    - near5:"trump putin".
  - Repeat. Allows you to specify that a given word must appear at least a certain number of times in a sentence to be considered a match. To use this operator, you specify the word "repeat", followed by the number of times the word should appear, followed by the word itself in quote marks. Only a single word is permitted using this operator, it does not support phrase searches at this time. By limiting results to articles that mention a word multiple times, you can filter to just those articles more likely to actually be about your keyword, rather than merely casually mentioning it. Note that the "repeat" operator only requires that a sentence mention the keyword AT LEAST the requested number of times – a sentence will match even if it mentions the keyword many more times than the requested number.
  - repeat2:"trump"
ISQUOTE. If set to 1, indicates that only sentences that contain a quote mark should be returned, returning quoted phrases. Note that since the Context API returns only single sentences, this may truncate quotations and thus the Global Quotation Graph is a more authoritative dataset, but this allows seamless quotation search within the Context API.
MODE. This specifies the specific output you would like from the API.
- ArtList. There is only one mode at this time and you must specify "artlist."
FORMAT. This controls what file format the results are displayed in. Not all formats are available for all modes. To assist with website embedding, the CORS ACAO header for all output of the API is set to the wildcard "*", permitting universal embedding.
- CSV. This returns the requested data in comma-delimited (CSV) format. The specific set of columns varies based on the requested output mode. Note that since some modes return multilingual content, the CSV is encoded as UTF8 and includes the UTF8 BOM to work around Microsoft Excel limitations handling UTF8 CSV files.
- RSS. This output format is only available in ArticleList mode and returns the list of matching article URLs and titles in RSS 2.0 format. This makes it possible to display the results using any standard RSS reader. It also makes it seamless for web archives to create tailored archival feeds to preserve news coverage on certain topics or meeting certain criteria.
- JSON. This returns the requested data in UTF8 encoded JSON. The specific fields varies by output mode.
- JSONP. This mode is identical to "JSON" mode, but accepts an additional parameter in the API URL "callback=XYZ" (if not present defaults to "callback") and wraps the JSON in that callback to return JSONP compliant JavaScript code.
- JSONFeed. This output format is only available in ArticleList mode and returns the list of matching article URLs and titles in JSONFeed 1.0 format.
TIMESPAN. By default the DOC API searches the last 24 hours of coverage monitored by GDELT. You can narrow this range by using this option to specify the number of months, weeks, days, hours or minutes (minimum of 15 minutes). The API then only searches documents published within the specified timespan backwards from the present time. If you would instead like to specify the precise start/end time of the search instead of an offset from the present time, you should use the STARTDATETIME/ENDDATETIME parameters. At present you can search only up to 72 hours ago.
- Minutes. Specify a number followed by "min" to provide the timespan in minutes.
- Hours. Specify a number followed by "h" or "hours" to provide the timespan in hours.
- Days. Specify a number followed by "d" or "days" to provide the timespan in days.
- Weeks. Specify a number followed by "w" or "weeks" to provide the timespan in weeks.
- Months. Specify a number followed by "m" or "months" to provide the timespan in months.
STARTDATETIME/ENDDATETIME. These parameters allow you to specify the precise start and end date/times to search, instead of using an offset like with TIMESPAN.
- STARTDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to begin the search – only articles published after this date/time stamp will be considered. It must be within the last 72 hours. If you do not specify an ENDDATETIME, the API will search from STARTDATETIME through the present date/time.
- ENDDATETIME. Specify the precise date/time in YYYYMMDDHHMMSS format to end the search – only articles published before this date/time stamp will be considered. It must be within the last 72 hours. If you do not specify a STARTDATETIME, the API will search from 72 hours ago through the specified ENDDATETIME.
MAXRECORDS. To conserve system resources, the API only returns up 75 results by default, but this can be increased up to 200 results if desired by using this URL parameter.
SEARCHLANG. Search in the given language. At this time only space-segmented languages are supported. Specify the language code (including underscore) from the supported languages list.
SORT. By default results are sorted by relevance to your query. Sometimes you may wish to sort by date or tone instead.
- DateDesc. Sorts results by publication date, displaying the most recent articles first.
- DateAsc. Sorts results by publication date, displaying the oldest articles first.