Experiments With Machine Translation: The Many Forms Of A Character

When working with the full extent of the world's languages represented in Unicode, one comes across myriad examples of characters with different code points that look identical or the same letter that looks very different in different languages or even in the same language but different contexts. You will even find exact duplicates in which the exact glyph is reproduced under multiple code points due to legacy compatibility issues.

Such examples can easily lead analyses astray when using live web examples and human native speakers to review content that is subsequently used as a training data by machines that see language only as a sequence of code points. To a human different code points might appear the same or similar characters appear wildly different. Unicode normalization can assist with this, mapping the Angstrom Sign (U+212B) to the Latin Capital Letter A With Ring Above (U+00C5), for example, though in some scenarios it can be helpful to provide code point references during human analysis.