With the release of the Global Embedded Metadata Graph, we've been seeing firsthand just how bad the state of JSON compliance is in the JSON-LD blocks of news websites around the world, from simple errors like misplaced commas or carriage returns splitting fields, to more complex errors like name overlap and name/value inversion. Our initial intent for the GEMG was to preserve JSON-LD blocks exactly as they appeared, to enable both semantic research into their contents and structural research into common forms of errors in their instantiation. Instead, the feedback we've heard is that most researchers interested in structural questions like how often commas are misplaced or how often a particular JSON rule is violated would be better served by a purpose-built dataset for that specific use case, whereas the overwhelming majority of users of the GEMG appear interested only in the contents of the JSON-LD block not its form, to understand patterns in what kinds of fields are specified or to use fields like keywords to better understand articles or to reach beyond the 65 languages that GDELT live-translates through GDELT Translingual. We are also hearing of a lot of problems with standards-strict JSON parsers failing when encountering the non-standard JSON often present in JSON-LD blocks.
Towards this end, we're excited to announce today a filtering and JSON correction facility for the GEMG that performs JSON validation (validating only that it is parseable JSON, not that it conforms to the Schema.org standard) on every single JSON-LD block it encounters that corrects many kinds of common errors. Irrecoverable JSON-LD blocks that cannot be readily corrected are now excluded from the dataset to eliminate fatal JSON parsing errors.
We also now automatically correct many kinds of Unicode errors, including improper and non-compliant encoding like UTF-32 sequences within UTF-8 blocks. We also automatically unescape HTML character escapes within both <META> tags and JSON-LD blocks to make analysis easier. We rewrite each JSON-LD block with these changes, meaning the specific ordering of fields is now nondeterministic (if the original block had "url, title, keywords" the new version might have ordering "title, keywords, url," etc), but all nesting and structural positioning is preserved as-is.
The end result is that the GEMG dataset should no longer contain invalid JSON blocks or invalid or noncompliant Unicode escapes that cause parser errors, making it much easier to work with the data!