The GDELT Project

Translingual 2.0: A New Approach To Scriptio Continua Languages

Translingual 1.0 was based on the concept of language as space-delimited sequences of discrete "words." For scriptio continua languages like Burmese, Chinese, Japanese, Khmer, Laotian, Thai and others, Translingual 1.0 was forced to rely upon third party text segmentation packages to preprocess each input article to split it into "words" using either statistical or machine learning language models. While some of these tools were reasonably efficient in both computational and memory requirements, others had considerable resource requirements and became limiting factors in our translation pipelines as they often used architectures that were difficult to optimize.

The quality of such segmenting tools varies dramatically, ranging from results similar to human annotators through results little differentiated from gibberish, with tools often exhibiting high domain sensitivity. Most importantly, many languages lack robust domain independent segmentation tools and the most robust tools tend to use machine learning models that cannot be readily updated over time as new words enter the lexicon (an especially problematic challenge in the Covid-19 era). Thus, the scriptio continua languages we could support under Translingual 1.0 were entirely dependent on the availability of robust segmentation support and updating lexicons required elaborate pre and post processing workflows to work around the fact that most segmenting libraries do not support external offset marking and interaction.

We are excited to announce that Translingual 2.0 sees language entirely differently: as a richly diverse landscape of different organizations of meaning whose form varies by language and family. For languages that do not use traditional dividers between individual units of meaning, Translingual 2.0 understands them as sequences of subwords that represent "minimal meaning units" (MMU's) that can range from a single ideogram or grapheme cluster to more complex groupings that convey some form of basic meaning building block in the source language. Decomposing continuous character sequences into MMUs allows Translingual 2.0 to natively process any scriptio continua language without requiring it to be pre-segmented.

Unlike BPE's artificial statistical decomposition into meaningless character sequences, MMU's model the underlying morphology of a language, allowing them to more readily interact with overlays, such as an SME-provided rare term for Covid-19 or contextual influencer, as well as Translingual 2.0's high-order entity representation model. In other words, while BPE segments text into a meaningless vocabulary of codepoint sequences that hold no meaning to speakers of that language, MMU's represent language as building blocks that model the expression of meaning in that language as constructs built from atomic building blocks.

Using this new architecture, Translingual 2.0 no longer uses external word segmentation libraries: it accepts as its native input the original source content and as it models the source text it "sees" it as a sequence of MMUs that, in turn, become successively higher order semantic representations that come together to form the final translation graph.

We are tremendously excited about the possibilities this new architecture brings for our expansion to the world's diverse languages!