Translingual 2.0: Why Unicode Class Matching Isn't Sufficient For Language Detection

Many language detection engines like CLD2 use Unicode character classes as a shortcut for recognizing languages that are written in distinct alphabets. For example, CLD2 notes that "For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result." The problem with this approach is that even scripts most commonly associated with one large language can still be shared among multiple other smaller languages.

For example, Burmese script has its own Unicode character class, which CLD2 and most other language detectors use as a shortcut to identifying text as being in the Burmese language. The problem with this is that the Burmese script is also used by the Jingpho, Karen, Mon, Rakhine and Shan languages. Similarly, the Greek alphabet is also used by Pontic Greek. Thus, when CLD2 encounters a document written in Shan, it misclassifies it as Burmese with 100% confidence because of CLD2's conflation of Burmese script with the Burmese language.

Thus, when moving beyond just the 100-200 most common languages recognized by today's language detectors, traditional shortcuts like script detection tend to break down.