Automatic JSON-LD Correction Coming To The Global Embedded Metadata Graph

With the release of the Global Embedded Metadata Graph, we're seeing just how bad the state of JSON compliance is in the JSON-LD blocks of news websites around the world, from simple errors like misplaced commas or carriage returns splitting fields, to more complex errors like name overlap and name/value inversion. Our initial intent for the GEMG was to preserve JSON-LD blocks exactly as they appeared, to enable both semantic research into their contents and structural research into common forms of errors in their instantiation. Instead, the feedback we've been hearing over the past week is that most researchers interested in structural questions like how often commas are misplaced or how often a particular JSON rule is violated would be better served by a purpose-built dataset for that specific use case, whereas the overwhelming majority of users of the GEMG appear interested only in the contents of the JSON-LD block not its form, to understand patterns in what kinds of fields are specified or to use fields like keywords to better understand articles or to reach beyond the 65 languages that GDELT live-translates through GDELT Translingual. We are also hearing of a lot of problems with standards-strict JSON parsers failing when encountering the non-standard JSON often present in JSON-LD blocks.

Towards this end, we're happy to announce that we are working on a filtering and JSON correction facility for the GEMG that would perform JSON validation (validating only that it is parseable JSON, not that it conforms to the standard) and to correct many kinds of common errors. In particular, we are looking at parsing each JSON-LD block then rewriting it. This would also address the majority of the Unicode issues we're seeing with JSON parsers. JSON-LD blocks that cannot be parsed will be excluded. We hope to have some announcements on this front in the coming weeks so stay tuned.

Note that there will be no changes to the GEMG file format, the only difference will be that the contents of the "jsonld" field would now be validated so you know that it is a valid and parseable JSON block.