Identifying Invalid JSON-LD Through The Global Embedded Metadata Graph

How might we use the Global Embedded Metadata Graph to identify invalid JSON-LD code in news articles? The following query takes each JSON-LD block and attempts to parse it using BigQuery's built-in JSON parsing support, flagging blocks it fails to parse. Given that truncation can sometimes cause problems with parsing of a field we exclude blocks that were truncated. The end result returns around 3,600 blocks as of this moment:

SELECT url, rec, JSON_EXTRACT(rec, '$') recparsed FROM `gdelt-bq.gdeltv2.gemg`, unnest(jsonld) rec WHERE jsonld is not null and JSON_EXTRACT(rec, '$') is null and rec not like '%TRUNCATED ORIGCHARLEN%'

Most results are simple JSON errors like a missing comma between fields:

"UpDate": "2020-11-20T10:22:13"
"data": {

Or carriage returns between quote marks and field names and a comma before a quote mark instead of after it in separating it from the next field and spaces separating all of the date/time components:

"datePublished": "2020-12-03T01:37:18.000+0100, "
dateModified ": "
2020 - 12 - 03 T01: 37: 18.000 + 0100 ", "
mainEntityOfPage "

And similar kinds of basic JSON errors.

Google's Rich Results Test passes many of these blocks as valid, yet most JSON parsers will fail when attempting to parse them, so this is a quick way to identify such invalid blocks and better understand the most common kinds of errors.

The GDELT Project

Identifying Invalid JSON-LD Through The Global Embedded Metadata Graph

Archives