UTF-8, UTF-16 & Beyond The Unicode BMP: Lessons From The Global Embedded Metadata Graph

While fairly rare in the body text of news coverage, with the release of the Global Embedded Metadata Graph, we're starting to see an increase in emoji use. Characters from beyond the Basic Multilingual Plane (BMP) are part of the supplementary planes colloquially called the "astral" planes. Such characters are legally included in UTF-8 encoded JSON and can be escaped "as a twelve-character sequence, encoding the UTF-16 surrogate pair." Unfortunately, some JSON implementations appear to struggle with UTF-8 sequences for characters above the BMP or fatally error on code points outside those assigned at the time they were compiled, while well-known implementations fatally error when encountering legally escaped UTF-16 surrogate pair sequences. Other implementations have non-conforming requirements, such as mandating the use of UTF-32 escape sequences in UTF-8, which is not part of the specification.

We're still exploring how best to handle these edge cases, especially as they affect certain major analytics toolkits and platforms. Unfortunately, the search engines and social media platforms that consume these metadata fields are extremely tolerant of deviations from the JSON standard and so we are still examining how best to bridge the compliance-optional world of web JSON with the standards-centric JSON support of analytic tooling.

In the meantime, if you encounter Unicode errors when loading the GEMG into your analytic environment, from invalid escape sequences to illegal surrogates, you should review the end-to-end Unicode support for your tooling (including every piece of the processing pipeline from ingest to output) to learn their specific requirements. In some cases, users have found success in post-processing the GEMG files to, for example, translate all characters outside the BMP to UTF-32BE escape sequences, but this is heavily dependent on the target platform.