The GDELT Project

Announcing The Global Entity Graph (GEG) And A New 11 Billion Entity Dataset

Today we are tremendously excited to unveil the GDELT Global Entity Graph (GEG) and its inaugural release GEG-GCNLAPI, a dataset of 11.3 billion metadata entity annotations from a random sample of more than 103 million worldwide English-language online news articles published July 17, 2016 through present, along with coverage in Chinese (Simplified), Chinese (Traditional), French, German, Italian, Japanese, Korean, Portuguese (Brazilian & Continental), Russian and Spanish as of January 17, 2020, all of it updated every minute, annotated through Google’s Cloud Natural Language API.

One of GDELT’s focuses is on exploring how machines can “read” realtime information from across the world, autonomously identify events, relationships, facts, topics, narratives, emotions and other information and utilize all of these signals to reason about the world, flagging disease outbreaks at their earliest glimmers, warning of rising social tension before it crosses into physical unrest, map everything from violence to wildlife crime in realtime, chart shifting narratives around societal issues like vaccination or refugees and even help forecast the macro-scale future of society itself.

Today GDELT primarily synthesizes the textual world through its Event (EVENT) event repository and Global Knowledge Graph (GKG) metadata index. The GKG identifies mentions of persons, organizations, locations and proper names across more than 65 languages today.

Yet, as powerful as it is, the GKG is still based on classical grammatical, statistical and machine learning algorithms.

This raises the question – given the enormous advances in deep learning over the last few years, what would it look like to apply state-of-the-art neural NLP algorithms to “read” a sample of 100 million global online news articles spanning three years and produce a list of the entities they identified in each, along with the “importance” of each to that article and, where possible, disambiguate and connect those entities to their Wikipedia entries?

The GDELT Global Entity Graph (GEG) represents a powerful new NLP testbed where we will be exploring a range of advanced NLP tools and APIs to prototype how the latest advances in machine understanding can help us autonomously reason about the real world in realtime, from identifying emerging violence to mapping wildlife crime to piercing the veil of digital falsehoods (“fake news”) and foreign influence by identifying conflicting narratives and inorganic shaping of the information commons.

This inaugural GEG release explores what the world’s news media looks like through the eyes of one of the most advanced production-grade deep learning NLP systems today: Google’s Cloud Natural Language API.

Each of the 103 million news articles randomly sampled from the past three years was processed through the Natural Language API’s entity analysis feature, which “inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.), and returns information about those entities.”

Most powerfully, the Natural Language API is able to disambiguate and resolve alternative and popular names for known entities into the same reference. For example, mentions of "U.S. Federal Reserve," "Federal Reserve," "Federal Reserve Board," "New York Fed," "Atlanta Fed," "St. Louis Fed" and even just "The Fed" and "Fed" all resolve to the unique entity ID "/m/02xmb" and provide a URL to its Wikipedia entry.

This means that an article that spells out the full formal name “U.S. Federal Reserve” and one that merely references “The Fed” can both be understood as referring to the same entity.

Each entity is also assigned a “type” which can be a well-known person like a head of state, an organization, a location, an event, a date and so on.

From a semantic relevance standpoint, each entity is also given a “salience” score in the range of 0.0 to 1.0 that “provides information about the importance or centrality of that entity to the entire document text. Scores closer to 0 are less salient, while scores closer to 1.0 are highly salient.”

Salience scores make it possible to distinguish realtime breaking news announcing the first glimmers of a disease outbreak from a historical retrospective that merely casually mentions the disease in passing – a critical distinction that can be hard to make based purely on keyword searches.

Sentiment is also available for each article, offering three sentiment values: polarity, magnitude and score. Polarity was an earlier field that has been depreciated and score is a field that was added in late 2016 and is not present in the earliest records in this dataset. See the Natural Language API documentation for more detail on interpreting these values.

For each article, the end result is a list of entities identified in that document, disambiguated and resolved to their common form for well-known entities, their types and their overall semantic importance to the document.

The GEG-GCNLAPI dataset has been created by processing a random sample of around 100,000 English language online news articles each day over the past three years from around the world through the Cloud Natural Language API. Since it is a random sample, it represents only a small fraction of the news coverage monitored by GDELT each day and thus covers only a portion of what the GKG does. As of January 17, 2020 it also processes an additional set of articles in Chinese (Simplified), Chinese (Traditional), French, German, Italian, Japanese, Korean, Portuguese (Brazilian & Continental), Russian and Spanish each day.

The dataset is structured as a set of UTF8 newline-delimited JSON files at 15-minute resolution from July 17, 2016 through October 4, 2019 in the format “http://data.gdeltproject.org/gdeltv3/geg_gcnlapi/YYYYMMDDHHMMSS.geg-gcnlapi.json.gz”. Due to the dataset’s experimental nature, there are some gaps with no records for brief spans of time over the three years.

Within each file, each row represents a given article, with the date/time the article was seen (rounded to the nearest 15 minutes), the URL of the article and a JSON array containing the list of entities identified by the API. The JSON structure of each row is as follows:

Note that the URLs found within a given 15-minute file historically have aligned with those found in the GKG for the same period (though representing just a small subsample of them), but in future this will be increasingly decoupled as GDELT 3.0 launches, meaning that analyses looking across both GKG and GEG annotations will need to use a rolling window to match them.

You can download all of the GEG-GCNLAPI files from:

The dataset is also available in Google’s BigQuery:

We are incredibly excited to see what you’re able to do with this extraordinary new dataset!