The GDELT Project

Announcing The Global Relationship Graph

One of the great "grand challenges" of the field of natural language understanding lies in teaching machines to truly understand what they read, enabling them to actually reason about the facts and relationships presented in arbitrary text. While no such system exists today, there are myriad approaches designed to tackle specific dimensions of this problem, from Q&A to summary distillation to fact and relation extraction. Systems range from neural approaches, including treating it as a summarization problem, to historical grammar-based systems that look for specific linguistic patterns indicative of particular kinds of statements and relations. Indeed, GDELT makes use of a number of such approaches for event and entity extraction and disambiguation.

The Covid-19 era has reinforced the need for truly generalized fact and relation extraction – algorithms that can process global news coverage in realtime and not just codify what is said, but do so in a way that enables at-scale analysis of emerging narratives and especially the autonomous detection of contested narratives, such as competing directives around mask wearing or compiling how governments around the world are handling school reopenings. Doing so requires reaching beyond GDELT's existing fixed-taxonomy algorithms towards generalized free-form relationship extraction.

Towards this end we are enormously excited to announce the Global Relationship Graph (GRG), a new initiative focusing on extracting and codifying the basic world knowledge reported in the world's news media each day. Moving beyond the Event database's fixed taxomomy and the Global Knowledge Graph's predefined metadata categories, the Global Relationship Graph intends to explore how a range of NLU approaches can be used to codify at global scale the information captured in free form text. We will be exploring many different algorithms and approaches and welcome your ideas and feedback of algorithms and tools you recommend.

For this initial prototype release of the GRG, we are debuting two small Covid-19-related pilot datasets to gather feedback on the kinds of data that would be of greatest utility for free form knowledge extraction. Both were compiled by examining the random sample of English language articles GDELT annotated through Google's Cloud Natural Language API each day during June 2020 that included the words "covid" or "coronavirus" or "virus" somewhere in the article text. The resulting dependency graph is analyzed to extract the following two datasets:

Only a very small random sample of GDELT's daily monitoring volume is annotated through the Cloud Natural Language API each day and of the 11 languages currently annotated, only English language articles published in June 2020 are being considered here for this pilot, totaling a small sample of 14,424 articles. The final dataset includes 1,993,358 Verb-Centered Variable Ngrams and 2,545,320 Dependency Tree Ngrams. The goal here is to explore what kinds of representations would be of greatest use for generalized relation extraction and thus we welcome your ideas and feedback.

You can download the two datasets in newline-delimited UTF8 JSON:

You can also access both tables in BigQuery:

We're enormously excited about this new initiative and will be exploring a number of different representations and look forward to engaging with all of you on new algorithms and approaches!