Announcing The Global Entity Graph (GEG) And A New 11 Billion Entity Dataset

Kalev Leetaru

2 years ago

Today we are tremendously excited to unveil the GDELT Global Entity Graph (GEG) and its inaugural release GEG-GCNLAPI, a dataset of 11.3 billion metadata entity annotations from a random sample of more than 103 million worldwide English-language online news articles published July 17, 2016 through present, along with coverage in Chinese (Simplified), Chinese (Traditional), French, German, Italian, Japanese, Korean, Portuguese (Brazilian & Continental), Russian and Spanish as of January 17, 2020, all of it updated every minute, annotated through Google’s Cloud Natural Language API.

One of GDELT’s focuses is on exploring how machines can “read” realtime information from across the world, autonomously identify events, relationships, facts, topics, narratives, emotions and other information and utilize all of these signals to reason about the world, flagging disease outbreaks at their earliest glimmers, warning of rising social tension before it crosses into physical unrest, map everything from violence to wildlife crime in realtime, chart shifting narratives around societal issues like vaccination or refugees and even help forecast the macro-scale future of society itself.

Today GDELT primarily synthesizes the textual world through its Event (EVENT) event repository and Global Knowledge Graph (GKG) metadata index. The GKG identifies mentions of persons, organizations, locations and proper names across more than 65 languages today.

Yet, as powerful as it is, the GKG is still based on classical grammatical, statistical and machine learning algorithms.

This raises the question – given the enormous advances in deep learning over the last few years, what would it look like to apply state-of-the-art neural NLP algorithms to “read” a sample of 100 million global online news articles spanning three years and produce a list of the entities they identified in each, along with the “importance” of each to that article and, where possible, disambiguate and connect those entities to their Wikipedia entries?

The GDELT Global Entity Graph (GEG) represents a powerful new NLP testbed where we will be exploring a range of advanced NLP tools and APIs to prototype how the latest advances in machine understanding can help us autonomously reason about the real world in realtime, from identifying emerging violence to mapping wildlife crime to piercing the veil of digital falsehoods (“fake news”) and foreign influence by identifying conflicting narratives and inorganic shaping of the information commons.

This inaugural GEG release explores what the world’s news media looks like through the eyes of one of the most advanced production-grade deep learning NLP systems today: Google’s Cloud Natural Language API.

Each of the 103 million news articles randomly sampled from the past three years was processed through the Natural Language API’s entity analysis feature, which “inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.), and returns information about those entities.”

Most powerfully, the Natural Language API is able to disambiguate and resolve alternative and popular names for known entities into the same reference. For example, mentions of "U.S. Federal Reserve," "Federal Reserve," "Federal Reserve Board," "New York Fed," "Atlanta Fed," "St. Louis Fed" and even just "The Fed" and "Fed" all resolve to the unique entity ID "/m/02xmb" and provide a URL to its Wikipedia entry.

This means that an article that spells out the full formal name “U.S. Federal Reserve” and one that merely references “The Fed” can both be understood as referring to the same entity.

Each entity is also assigned a “type” which can be a well-known person like a head of state, an organization, a location, an event, a date and so on.

From a semantic relevance standpoint, each entity is also given a “salience” score in the range of 0.0 to 1.0 that “provides information about the importance or centrality of that entity to the entire document text. Scores closer to 0 are less salient, while scores closer to 1.0 are highly salient.”

Salience scores make it possible to distinguish realtime breaking news announcing the first glimmers of a disease outbreak from a historical retrospective that merely casually mentions the disease in passing – a critical distinction that can be hard to make based purely on keyword searches.

Sentiment is also available for each article, offering three sentiment values: polarity, magnitude and score. Polarity was an earlier field that has been depreciated and score is a field that was added in late 2016 and is not present in the earliest records in this dataset. See the Natural Language API documentation for more detail on interpreting these values.

For each article, the end result is a list of entities identified in that document, disambiguated and resolved to their common form for well-known entities, their types and their overall semantic importance to the document.

The GEG-GCNLAPI dataset has been created by processing a random sample of around 100,000 English language online news articles each day over the past three years from around the world through the Cloud Natural Language API. Since it is a random sample, it represents only a small fraction of the news coverage monitored by GDELT each day and thus covers only a portion of what the GKG does. As of January 17, 2020 it also processes an additional set of articles in Chinese (Simplified), Chinese (Traditional), French, German, Italian, Japanese, Korean, Portuguese (Brazilian & Continental), Russian and Spanish each day.

The dataset is structured as a set of UTF8 newline-delimited JSON files at 15-minute resolution from July 17, 2016 through October 4, 2019 in the format “http://data.gdeltproject.org/gdeltv3/geg_gcnlapi/YYYYMMDDHHMMSS.geg-gcnlapi.json.gz”. Due to the dataset’s experimental nature, there are some gaps with no records for brief spans of time over the three years.

Within each file, each row represents a given article, with the date/time the article was seen (rounded to the nearest 15 minutes), the URL of the article and a JSON array containing the list of entities identified by the API. The JSON structure of each row is as follows:

date. The timestamp the article was seen, rounded to the nearest 15 minutes.
url. The URL of the article.
lang. The Google-provided language code of the article used by the API, allowing filtering by language.
polarity. Article sentiment polarity. See the Natural Language API documentation. (This value has been depreciated by the API).
magnitude. Article sentiment magnitude. See the Natural Language API documentation.
score. Article sentiment score. See the Natural Language API documentation. (This value was added later and is not present in the first few months of the dataset).
entities. A JSON array containing the list of distinct entities identified by the API. Multiple references to the same entity in the document of the same type are grouped together. Thus, if “White House” is mentioned 10 times in the article as type “Location” it will only appear once here, but if it is mentioned as a “Location” in some contexts and an “Organization” in others, it will appear twice, once for its Location context and once for its Organization context.
- name. The entity as identified by the API.
- type. The entity “type” as determined by the API.
- mid. Provides the unique Google-assigned ID for entities for which Google has assigned an identifier (typically only for well-known entities). Absent for entities without a MID. Note that the presence of a MID is typically accompanied by a wikipediaUrl entry, but not always and users should not assume that a MID guarantees a wikipediaUrl.
- wikipediaUrl. Provides the URL of the Wikipedia entry for this entity if Google has mapped it to its corresponding Wikipedia page (typically only for well-known entities). Absent for entities without a Wikipedia mapping.
- numMentions. The number of times this entity+type mapping was identified in the article. Thus, an article that mentions “Ebola” 9 times in the article will have a 9 in this field.
- avgSalience. The average “salience” score for this entity. For proper name entities, the API computes a single salience score covering all of its mentions. For all other entities, the API reports the salience of each individual mention, which the GEG averages across the entire document.

Note that the URLs found within a given 15-minute file historically have aligned with those found in the GKG for the same period (though representing just a small subsample of them), but in future this will be increasingly decoupled as GDELT 3.0 launches, meaning that analyses looking across both GKG and GEG annotations will need to use a rolling window to match them.

You can download all of the GEG-GCNLAPI files from:

http://data.gdeltproject.org/gdeltv3/geg_gcnlapi/MASTERFILELIST.TXT (master list of all files)

The dataset is also available in Google’s BigQuery:

gdelt-bq:gdeltv2.geg_gcnlapi

We are incredibly excited to see what you’re able to do with this extraordinary new dataset!