Announcing The Global Entity Graph’s (GEG) G1 Baseline Dataset: 8 Billion Entities From 2019

Kalev Leetaru

6 years ago

Today we are excited to announce the GDELT Global Entity Graph’s (GEG) second release: the G1 Baseline Dataset, a dataset comprised of more than 8 billion entity annotations computed from 83 million English-language worldwide online news articles monitored by GDELT since January 1, 2019 and updated every 15 minutes through present. Each article is annotated through a grammar-based extractor applied to a classical HMM POS tagger, with both common and proper maximal entities, contextualized counts and date references extracted. The end result is a reasonably accurate classical annotation dataset that covers the majority of English-language coverage monitored by GDELT each day and offers a baseline dataset to compare against the rest of the current and future GEG datasets.

The GEG G1 Baseline Dataset comprises the entirely of all worldwide English-language online news coverage monitored by GDELT since January 1, 2019. The only exclusion at present is an initial language model filter that is applied to each article to determine whether it will yield reasonable-quality Part of Speech tagging results from the Hidden Markov model tagger used here. This means that a small percentage of English language coverage is not represented in this dataset and the GDELT Global Knowledge Graph (GKG) still represents the canonical exhaustive annotation catalog.

Each article is tagged and processed through a specially crafted extraction grammar to identify the four classes of entities it currently recognizes. At this time common and proper entities are extracted as-is, though in future a postprocessing language model may be applied to refine the results and remove common error cases.

Common and proper entity names are extracted in maximal form, rather than subdivided into discrete component entities. Thus, “Kenyan Director of Health Operations Jane Doe” would be treated as a single entity rather than as three three separate entities “Kenyan”, “Director of Health Operations” and “Jane Doe” in order to preserve the semantic connection among the three. This is a notable difference between this dataset and other entity extractors and must be accounted for in analyses. In this case we believe preserving these linkages is more important, since entity subdivision can be performed in a postprocessing step in which the semantic relationships among the subentities can be properly preserved.

For all four entities, a “salience” score is calculated that represents its overall “importance” to the document. At present this score is based on the word position within the document of all mentions of each entity under the journalistic inverse pyramid rule that words mentioned more often and earlier in an article are more important than those mentioned sparingly at the end. Thus, an entity mentioned once in the lead of an article might be considered more salient than one mentioned three times in the last sentence. In future this may be updated to incorporate additional signals. All scores are normalized to 0.0 (little salience) to 1.0 (high salience).

For numeric entities, the text immediately following the number is analyzed to determine whether it provides information that would help understand what the number represents and if so, up to four words of this is represented in the “context” field. Thus, an article mentioning “There were 200 dead and 30 injured” will yield “200” with context of “dead” and “30” with context of “injured” helping to lend understandability to numeric references, rather than just extracting “200” and “30” without any context as to what those counts represent.

The goal of the GEG G1 Baseline Dataset is to provide a maximal-coverage classical baseline that offers reasonable entity extraction quality across as much of GDELT’s monitored coverage as possible and with sufficiently high quality and coverage that it can form the basis of research as-is and act as a baseline against which newer approaches, especially neural systems, can be compared.

The dataset is structured as a set of UTF8 newline-delimited JSON files at 15-minute resolution from January 1, 2019 through present in the format "http://data.gdeltproject.org/gdeltv3/geg_g1/YYYYMMDDHHMMSS.geg-g1.json.gz". Due to the dataset's experimental nature, there may be some brief gaps.

Within each file, each row represents a given article, with the date/time the article was seen (precise to the second GDELT saw the article), the URL of the article and a JSON array containing the list of entities identified by the API. The JSON structure of each row is as follows:

date. The timestamp the article was seen, recording the specific time, down to the second, that GDELT saw the article.
url. The URL of the article.
entities. A JSON array containing the list of distinct entities identified by the extractor. Multiple references to the same entity in the document are grouped together. Thus, if "White House" is mentioned 10 times in the article it will appear once here. Note that if an entity appears multiple times in a common context and once in a proper context, the proper context will be downgraded to common status.

name. The entity as extracted. For common and proper names this will be the entity name in full. For numbers it will be the resolved number (ie "one hundred and two million" will be represented as "102000000"). For dates it will be the resolved date in "YYYYMMDD" format (ie "January 2, 2019" will become "20190102"). Partial dates will have 0's in the missing date components (ie "January 2" without a year will become "00000102" and "2011" without a month or day will become "20110000").
type. The entity "type" (at this time one of "COMMON", "PROPER", "DATE" or "NUMBER". Additional types may be added over time.
numMentions. The number of times this entity was identified in the article. Thus, an article that mentions "Ebola" 9 times in the article will have a 9 in this field.
avgSalience. The average "salience" score for this entity, reflecting its "importance" to the article as a whole.

Note that the URLs found within a given 15-minute file historically have aligned with those found in the GKG for the same period, but in future this will be increasingly decoupled as GDELT 3.0 launches, meaning that analyses looking across both GKG and GEG annotations will need to use a rolling window to match them.

You can download all of the GEG-G1 files from:

http://data.gdeltproject.org/gdeltv3/geg_g1/MASTERFILELIST.TXT (master list of all files)

You can check the LASTUPDATE.TXT file every 5 minutes to see when the most recent file is ready:

http://data.gdeltproject.org/gdeltv3/geg_g1/LASTUPDATE.TXT

The dataset is also available in Google's BigQuery:

gdelt-bq:gdeltv2.geg_g1

We are incredibly excited to see what you're able to do with this extraordinary new dataset!