The GDELT Project

Announcing The Global Embedded Metadata Graph

We are enormously excited today to announce the debut of the Global Embedded Metadata Graph (GEMG), which records the hidden semantic metadata in news articles that underlie the modern web, spanning July 2018 to present. From Open Graph objects in HTML <META> tags to rich Schema.org annotations expressed in JSON-LD blocks, the GEMG is designed to capture the rich descriptive structured metadata that is provided with many news articles. This can include basic fields like extended title, author and publication date, to author-selected descriptive keywords and outlet-provided descriptive summary sentences, along with links to author homepages and social media accounts, category labels and myriad other fields. This highly structured and precisely defined metadata is used by search engines, social media platforms and other services to make sense of online news content, but is traditionally inaccessible to scholarly research. Basic questions like how many sites provide descriptive keywords or publish authorship information in machine readable format are now immediately answerable. What percentage of sites include Open Graph tags and how many provide more than basic Schema.org fields?

A key driving force in the creation of the GEMG is the desire to make it easier to reason about the contents and provenance of news articles in order to tie similar coverage together, to trace the spread and presented provenance of stories, to understand the broader context and authorship of an article and to use other available publisher-provided metadata to understand an article in ways that could be used to connect it to relevant fact checks, lend critical context to its origins and diffusion trajectory and so on.

Information science, metadata and semantic web scholars will also find this dataset invaluable for understanding adoption rates and implementation realities of how news outlets across the world are utilizing these semantic annotations and the influence of those use cases on the semantic web.

The structure of this inaugural release of the GEMG is straightforward. Each online news article GDELT monitors in all 152 languages is scanned for the presence of any <META> tags in the HTML source that include a "name", an "itemprop", an "http-equiv" or a "property" attribute along with a "content" attribute. These are compiled into an array of JSON objects. The <HTML> tag that opens the page is also scanned for a "lang" property and if found is included in the metatags set with type "htmltag".

The HTML source is also scanned for all blocks of Mimetype "application/ld+json", which are compiled into an array of strings in the order they appear in the page. Each JSON-LD is reproduced as a scalar string, as this ensures that invalid JSON-LD blocks do not prevent parsing of the entire record. All scalar string field values that exceed 500 characters are truncated.

HTML character escapes are unescaped in both META tags and JSON-LD blocks and invalid and non-compliant Unicode sequences in JSON-LD blocks are corrected. We also attempt to correct many common correctable JSON-LD structural errors, ranging from misplaced commas to invalid carriage return placement and rewrite the field with the corrected version (we only correct it into parseable JSON, we don't check for compliance with the Schema.org standard). This means field ordering is nondeterministic, but structural grouping like arrays and nesting are preserved. If we are unable to correct an JSON-LD block to make it parseable JSON, we discard it to ensure the dataset does not include unparseable data.

The final dataset is in UTF-8 JSON-NL format, one row per article, with the following fields:

The entire dataset is available as UTF-8 JSON-NL files, one per minute in the format (date/time is in UTC):

The first available file is from July 16, 2018, "20180716000100" and will be updating every minute moving forward (though not all minutes will have data since GDELT 2.0 operates on a 15 minute heartbeat that yields data for several minutes after each 15 minute update). Over time we will be extending this dataset backwards historically, but for now it is from here forward.

The entire dataset is also available as a BigQuery table:

We're enormously excited about the fundamentally new kinds of explorations over the metadata of news and the journalistic semantic web.