The GDELT Project

Translingual 2.0: A New Era In Globalized Language Detection: GDELT's First Inhouse Language Detector

We are immensely excited to announce the unveiling of the first major piece of Translingual 2.0 and the new GDELT Language Services (GLS) infrastructure: a massive new language detection infrastructure that can identify more than 440 language-script combinations from across the world, ranging from international lingua francas to nearly extinct local dialects.

Since GDELT Translingual's inception in 2014, we have relied upon Google's open CLD2 language detection library for all of our language identification tasks. CLD2 supports 83 core languages, while we run it in extended mode, which expands it to 161 languages (175 language-script combinations) with an additional 65 script-to-language mappings, for a total of 240 language-script combinations.

While this served us well early on, as GDELT has continued to reach ever more locally across the globe, we've been increasingly limited by the selection of languages supported by CLD2, its reduced accuracy in extended mode, the inability to readily extend it to new languages and the limitations of its minified quadgram+octagram architecture that was designed to optimize for minimal memory consumption over language coverage. While CLD2 is a superb tool for the overwhelming majority of traditional language detection applications and highly accurate on the majority of languages that most users will ever encounter, GDELT's unique global lens and increasing push towards local news in long-tail languages and our need for larger language models to increase differentiation precision means we needed a fundamentally new approach and architecture to language detection. Most importantly, as we increasingly work with local language communities and NGOs to acquire or build machine translation resources for these long-tail languages, we need language detection services that can be updated in lock-step. After all, to deploy a translation model, we first need a language detection model that can robustly identify content in that language so we can forward it for translation.

Today we are incredibly excited to unveil the first results of our massive new initiative around global languages: our first inhouse language detection infrastructure that supports more than 440 language-script combinations and counting. The initial set of languages that have passed our final evaluation tests over the past few months can be seen below. We will be deploying this infrastructure internally in the coming weeks to evaluate it at production scale over time to gather final accuracy and differentiation performance statistics before completing the transition of our language processing workflows over to it in the transition to GDELT 3.0. During this process some of the languages below may change as we evaluate production performance over time, so consider this a working list of languages that have passed our initial tests.

Enormous effort has gone into compiling, researching and purifying training datasets for each of these languages, endless experimentation evaluating the right mix of sample data yielding the greatest differentiation from similar languages and truest representation of each language and research, experimentation and fine-tuning of a vast landscape of different modeling approaches to develop our final architecture that achieves the best accuracy and highest inference performance. The final inference process was designed to be extremely efficient with runtime linear to the article length, runs on ordinary CPUs and can be implemented in a highly optimized pipeline in any programming language and trades memory use for computational complexity, yielding a highly efficient runtime.

We are tremendously excited about this first step towards Translingual 2.0 and will be sharing a steady stream of updates in the weeks to come!