Improved Unicode Escaping In Visual Global Knowledge Graph EXIF Data

Historically, the Visual Global Knowledge Graph has embedded extracted EXIF metadata in standard UTF8 JSON encoding. This has proved challenging to some use cases that use more limited processing pipelines that may not support the entire Unicode range or which may not gracefully handle invalid character sequences, causing parsing of the entire record to fail. To address this, in the coming days we will begin to JSON escape all non-ASCII characters, ensuring that the entire VGKG record is exclusively ASCII encoded. All standards-compliant JSON parsers can decode JSON-escaped characters, so processing of the JSON block in VGKG records will be unaffected. Note that some images have corrupted encodings in their EXIF metadata due to incorrectly configured upstream workflows, meaning a small percentage of images will still yield gibberish in their EXIF data. At this time we merely extract EXIF metadata as-is as it appeared in the image itself using ExifTools and do not perform any additional sanity checks on the results in order to provide the most faithful rendition of each record. However, for images with valid Unicode EXIF metadata this new encoding approach should resolve parsing issues, especially with less robust JSON parsers that struggle with invalid character sequences found in certain images.

The GDELT Project

Improved Unicode Escaping In Visual Global Knowledge Graph EXIF Data

Archives