One of the great "grand challenges" of the field of natural language understanding lies in teaching machines to truly understand what they read, enabling them to actually reason about the facts and relationships presented in arbitrary text. While no such system exists today, there are myriad approaches designed to tackle specific dimensions of this problem, from Q&A to summary distillation to fact and relation extraction. Systems range from neural approaches, including treating it as a summarization problem, to historical grammar-based systems that look for specific linguistic patterns indicative of particular kinds of statements and relations. Indeed, GDELT makes use of a number of such approaches for event and entity extraction and disambiguation.
The Covid-19 era has reinforced the need for truly generalized fact and relation extraction – algorithms that can process global news coverage in realtime and not just codify what is said, but do so in a way that enables at-scale analysis of emerging narratives and especially the autonomous detection of contested narratives, such as competing directives around mask wearing or compiling how governments around the world are handling school reopenings. Doing so requires reaching beyond GDELT's existing fixed-taxonomy algorithms towards generalized free-form relationship extraction.
Towards this end we are enormously excited to announce the Global Relationship Graph (GRG), a new initiative focusing on extracting and codifying the basic world knowledge reported in the world's news media each day. Moving beyond the Event database's fixed taxomomy and the Global Knowledge Graph's predefined metadata categories, the Global Relationship Graph intends to explore how a range of NLU approaches can be used to codify at global scale the information captured in free form text. We will be exploring many different algorithms and approaches and welcome your ideas and feedback of algorithms and tools you recommend.
For this initial prototype release of the GRG, we are debuting two small Covid-19-related pilot datasets to gather feedback on the kinds of data that would be of greatest utility for free form knowledge extraction. Both were compiled by examining the random sample of English language articles GDELT annotated through Google's Cloud Natural Language API each day during June 2020 that included the words "covid" or "coronavirus" or "virus" somewhere in the article text. The resulting dependency graph is analyzed to extract the following two datasets:
- Verb-Centered Variable Ngrams. Here each verb phrase in an article is identified and up to 5 tokens before ("pre" snippet) and after it ("post" snippet) are compiled into a clause. Clauses are permitted to extend over a comma, but all other punctuation ends the clause. The result is a list of verb phrases and their immediate context for each article, creating essentially an 10-gram centered on each verb phrase. Unlike traditional ngrams, these 10-grams are only generated around verbs, meaning they are not a sliding window but rather a fixed context around each verb phrase, capturing the statements of action related in the article. Each unique "pre-verb-post" clause found in articles published during a given minute becomes a record and contains a list of all of the articles published that minute that contained that clause. The goal of this dataset is to enhance the traditional ngram modality by making it more semantically meaningful in centering it only on verb phrases.
- Dependency Tree Ngrams. Here each verb phrase in an article is identified and the dependency graph is walked backwards ("pre" snippet) and forwards ("post snippet") by token to compile the pre and post snippets. Only the immediately next token in the graph is examined for the pre/post snippets, meaning adjectives, adverbs, etc are typically dropped. This results in ungrammatical snippets that remove critical detail, but offer a more concise representation of the graph's parse of the sentence. Note, however, that errors in the dependency parse as well as its highly condensed form means it can sometimes actually change the meaning of a clause. Thus, this dataset is provided as a first exploration of what a dependency-based clause representation might look like. Possible expansions include semantically-aware extraction of clauses based on their dependency labels, which would enable richer representation of information.
Only a very small random sample of GDELT's daily monitoring volume is annotated through the Cloud Natural Language API each day and of the 11 languages currently annotated, only English language articles published in June 2020 are being considered here for this pilot, totaling a small sample of 14,424 articles. The final dataset includes 1,993,358 Verb-Centered Variable Ngrams and 2,545,320 Dependency Tree Ngrams. The goal here is to explore what kinds of representations would be of greatest use for generalized relation extraction and thus we welcome your ideas and feedback.
You can download the two datasets in newline-delimited UTF8 JSON:
You can also access both tables in BigQuery:
We're enormously excited about this new initiative and will be exploring a number of different representations and look forward to engaging with all of you on new algorithms and approaches!