Announcing The Global Embedded Metadata Graph

We are enormously excited today to announce the debut of the Global Embedded Metadata Graph (GEMG), which records the hidden semantic metadata in news articles that underlie the modern web, spanning July 2018 to present. From Open Graph objects in HTML <META> tags to rich Schema.org annotations expressed in JSON-LD blocks, the GEMG is designed to capture the rich descriptive structured metadata that is provided with many news articles. This can include basic fields like extended title, author and publication date, to author-selected descriptive keywords and outlet-provided descriptive summary sentences, along with links to author homepages and social media accounts, category labels and myriad other fields. This highly structured and precisely defined metadata is used by search engines, social media platforms and other services to make sense of online news content, but is traditionally inaccessible to scholarly research. Basic questions like how many sites provide descriptive keywords or publish authorship information in machine readable format are now immediately answerable. What percentage of sites include Open Graph tags and how many provide more than basic Schema.org fields?

A key driving force in the creation of the GEMG is the desire to make it easier to reason about the contents and provenance of news articles in order to tie similar coverage together, to trace the spread and presented provenance of stories, to understand the broader context and authorship of an article and to use other available publisher-provided metadata to understand an article in ways that could be used to connect it to relevant fact checks, lend critical context to its origins and diffusion trajectory and so on.

Information science, metadata and semantic web scholars will also find this dataset invaluable for understanding adoption rates and implementation realities of how news outlets across the world are utilizing these semantic annotations and the influence of those use cases on the semantic web.

The structure of this inaugural release of the GEMG is straightforward. Each online news article GDELT monitors in all 152 languages is scanned for the presence of any <META> tags in the HTML source that include a "name", an "itemprop", an "http-equiv" or a "property" attribute along with a "content" attribute. These are compiled into an array of JSON objects. The <HTML> tag that opens the page is also scanned for a "lang" property and if found is included in the metatags set with type "htmltag".

The HTML source is also scanned for all blocks of Mimetype "application/ld+json", which are compiled into an array of strings in the order they appear in the page. Each JSON-LD is reproduced as a scalar string, as this ensures that invalid JSON-LD blocks do not prevent parsing of the entire record. All scalar string field values that exceed 500 characters are truncated.

HTML character escapes are unescaped in both META tags and JSON-LD blocks and invalid and non-compliant Unicode sequences in JSON-LD blocks are corrected. We also attempt to correct many common correctable JSON-LD structural errors, ranging from misplaced commas to invalid carriage return placement and rewrite the field with the corrected version (we only correct it into parseable JSON, we don't check for compliance with the Schema.org standard). This means field ordering is nondeterministic, but structural grouping like arrays and nesting are preserved. If we are unable to correct an JSON-LD block to make it parseable JSON, we discard it to ensure the dataset does not include unparseable data.

The final dataset is in UTF-8 JSON-NL format, one row per article, with the following fields:

  • date. The date and time GDELT saw the article.
  • url. The URL of the article.
  • title. The title of the article.
  • lang. The human-readable language name as returned by CLD2.
  • jsonld. An array of scalar strings, each of which contains a complete JSON-LD block from the article in the order it was seen in the page. Each block is stored as a scalar string rather than as an object because of the prevalence of invalid JSON seen on the open web. A certain percentage of JSON-LD blocks compiled each day violate JSON constraints and will yield an error when parsed by most JSON implementations. Storing as a string allows the GEMG record to be faithfully parsed in all cases and your application logic can decide what to do when parsing each individual JSON-LD block, such as using the relaxed parsing used by Google search when parsing such blocks (Google's Rich Results Test will successfully pass many JSON-LD blocks that standard JSON parsers will fail as fatally invalid). Only two changes are made to each block. The first is that all carriage returns and tabs are converted to single spaces and all runs of whitespace are collapsed to a single space. The second is that any field value that exceeds 500 characters in length is truncated to the first 500 characters and followed by "…[TRUNCATED ORIGCHARLEN=X]" where X is the total number of characters of the original field value.
  • metatags. An array of JSON objects containing the contents of all <META> tags found in the article that had both a "content" attribute and either a "name" or "itemprop" or "property" attribute. Each relevant tag is recorded as an object with the following three fields. Similar to the "jsonld" field, all values exceeding 500 characters are truncated.
    • key. The contents of the "name/itemprop/property" attribute.
    • type. Whether the tag was expressed as a "name", "itemprop", "http-equiv" or "property" attribute. A special "htmltag" value is used to record the "lang" attribute found in the <HTML> tag.
    • value. The contents of the "contents" field.

The entire dataset is available as UTF-8 JSON-NL files, one per minute in the format (date/time is in UTC):

  • http://data.gdeltproject.org/gdeltv3/gemg/YYYYMMDDHHMMSS.gemg.json.gz

The first available file is from July 16, 2018, "20180716000100" and will be updating every minute moving forward (though not all minutes will have data since GDELT 2.0 operates on a 15 minute heartbeat that yields data for several minutes after each 15 minute update). Over time we will be extending this dataset backwards historically, but for now it is from here forward.

The entire dataset is also available as a BigQuery table:

We're enormously excited about the fundamentally new kinds of explorations over the metadata of news and the journalistic semantic web.