Experiments With Machine Translation: Grapheme Clusters

Modern English on the web typically has a one-to-one correspondence of visible characters to Unicode code points. This is not true for all languages. A number of the world's languages have visible characters that are not encoded in their entirety as a singular Unicode "character" (code point). Instead, what appears to a person as a single logical character may actually be composed of multiple characters and combining marks that are displayed to the end user in a combined singular state.

For example, the Thai kam "กำ" is logically a single character (note how your cursor does not split the two code points) in which the accent from the second appears visibly over the first. Yet in Unicode this is represented by two separate code points in sequence (Ko Kai and Sara Am). An even more extreme example is the Devanagari kshi "क्षि" which, while appearing as a singular character in your browser (note again how your cursor treats it as a single character) is actually made up of four separate Unicode code points: Letter Ka, Sign Virama, Letter Ssa and Vowel Sign I.

Unicode representations of such characters are based around what is known as a "grapheme cluster" that defines a sequence of Unicode code points that combine to display as a single immutable visible "character" to the user.

While nearly all modern string processing and regular expression libraries today support basic Unicode operations, it turns out that a surprising number of them (including ones deployed in major analytic packages) do not support grapheme clusters. This means that these libraries see the two examples above as a sequence of independent, unrelated, Unicode code points that each stand entirely on their own. This can wreak havoc on downstream analytic tools for which whole swaths of the characters of a given language don't exist and where "words" can start with a virama or other combining mark.

Considerable preprocessing can be required to help such libraries see such sequences as immutable characters, such as using a preprocessing library to insert ZWSP (Zero Width Space) characters around such clusters and using downstream logic to decode these. The specific workarounds can vary by library, requiring multiple approaches. Since no native singular Unicode code point exists for such characters, simple compositional normalization does not offer relief.

Even very large widely-used analytic packages can lack support for grapheme clusters and split them apart in analytically fatal ways, while others officially support them but fail in unusual ways, so it is important before processing content in these languages to evaluate the support of your underlying libraries and packages.