Translingual 2.0: The Complexities Of Building Language Models Over Many Languages

At the core of many language tasks like language detection is the construction of language models over large numbers of languages with vast differences in the availability of monolingual content. Some languages may have petabytes of available content covering almost every imaginable topic and with a high degree of "purity" (the absence of other languages mixed in). Other languages may have just megabytes of content covering only a narrow topic from a handful of authors, while still other languages may be limited to just a few tens of kilobytes of readily accessible content. The primary scripts of some languages have yet to be integrated into Unicode, while others may have multiple scripts of which just one or two are available in Unicode.

Adding to this complexity, codeswitching and the heavy integration of loanwords or foreign content is common in many languages, making it more difficult to statistically isolate the language itself. Wikipedia is a common source for monolingual content, but for many languages their Wikipedia editions are filled with content from outside the language, especially English. Some smaller Wikipedia editions have as little as 20-30% of their content in their designated language, with large quantities of generic title-only stub articles and entire articles and large passages of others written in English or geographically or culturally proximate languages that are mutually intelligible or widely understood to readers of the given language. Some languages can be written in half a dozen to as many as a dozen scripts or more, with available content often unevenly distributed among them. In some cases, communities self-identify through the script they use to write a common language, making the recognition of that language/script combination especially critical, but also often more difficult to find sufficient example content.

When expanding beyond the most common languages throughout the world, identifying high purity multiauthor and multitopic monolingual content becomes especially difficult or even impossible. No single source covers even a small fraction of the world's actively written languages today.

The end result is that very large scale global language detection must necessarily be based on language models constructed across a rich diversity of sources with vast variations in their size and representative scope. This means the underlying modeling approaches must be sufficiently robust that they are able to successfully model a language with petabytes of available content covering the entire topical range of its underlying societies beside a language with just a few tens of kilobytes written by a single author covering a single topic and not allow one to dominate the other.

For highly dissimilar languages like Arabic and Russian (disjoint scripts) or Czech and German (shared script but disjoint character use), most modeling approaches show high robustness to these data issues. Conversely, most modeling approaches tend to struggle to distinguish highly similar languages like Malay and Indonesian or the members of a mutually intelligible language family when the training data for each language is fairly similar, let alone when it is widely divergent, requiring a portfolio of approaches.