The GDELT Project

GDELT Translingual: Translating the Planet

With the debut of GDELT 2.0 we are incredibly excited and proud to announce the public debut of a system we believe will fundamentally reshape how we understand the world around us: GDELT Translingual. Put simply, GDELT Translingual represents what we believe is the largest realtime streaming news machine translation deployment in the world: all global news that GDELT monitors in 65 languages, representing 98.4% of its daily non-English monitoring volume, is translated in realtime into English for processing through the entire GDELT Event and GKG/GCAM pipelines.

With the advent of GDELT Translingual, GDELT is able to take the next steps in its mission to provide a platform for computing on the entire world. Nearly all current global news monitoring platforms focus primarily or exclusively on Western English-language sources, missing the vast majority of world events, while the few platforms that incorporate any form of translation do so sparingly, either through the use of a small team of human translators, through on-demand translation of a small number of articles as they are read by human analysts, or through simplistic keyword tagging of a small number of subjects. In contrast, GDELT Translingual is designed to allow GDELT to monitor the entire planet at full volume, performing full translation of every single news report it monitors in realtime in 65 languages as it sees that article and processing it just the same as it would a native English report, creating the very first glimpses of a world without language barriers.

We are not aware of any other monitoring system in the world that comes even close in its attempt to translate the entire planet in realtime, and creating a platform this massive has required enormous fundamental development over the past few months. In particular, GDELT Translingual has two primary requirements that could not be met by current translation systems: the ability to dynamically modulate the quality of translations to use less computational time during periods of high intake volume to ensure that it is still able to translation all material within the 15 minute window, and the ability to see not just a single “best” translation, but to access the entire universe of all possible translations of each sentence to find the translation that will yield the highest quality Event and GKG matches (more on both in a moment).

What you see in the output of GDELT Translingual 1.0 is the results of a truly massive infrastructure with an incredible number of moving parts. As with all machine translation systems, you are bound to find numerous errors in its results, especially in languages with many layers of overlapping decisions, such as ambiguous word segmentation scenarios in languages like Chinese. If you have suggestions for additional language resources, toolkits, or are willing to share Moses translation models or other translation resources, we’d love to hear from you! Remember that what you see here represents just the very first version 1.0 incarnation of GDELT Translingual and we’ve already got a list a mile long of additional ideas and features to constantly improve its accuracy and capabilities from here – so stay tuned, we’re just getting started!

We’ve only just begun to explore the potential of GDELT Translingual and the kinds of previously unimaginable applications that become possible when you can “live translate the planet.”

WHY CREATE A TRANSLATION SYSTEM?

Why create a full-fledged translation infrastructure capable of translating the planet in realtime instead of simply translating GDELT’s Event and GKG systems into the other languages of interest? The most obvious reason is GDELT’s scale: there are more than 120,000 entries in the Global Knowledge Graph, the majority of which are multiword phrases, while GCAM has more than 1.2 million entries, including a large number of lengthy phrases. Translating this much material into 65 languages would simply be intractable. Further, due to the inherent ambiguities of language, a large fraction of these words and phrases could not simply be directly translated – they would require substantial contextual information to permit disambiguation, vastly increasing the amount of material to be translated and the linguistic resources required. In addition, GDELT is constantly growing, with new themes, emotions, and features added every few weeks, which would make it nearly impossible to constantly translate every single new addition into 65 languages on an ongoing basis. Instead, GDELT attempts to blend the best of both worlds, combining support for native script entries in both the Global Knowledge Graph and GCAM (making it possible to filter for a very specific phrasing in a given language) and translation to allow a new feature to instantly function across all 65 languages.

Of course, most of you are probably wondering – why make your own translation system when there are so many systems out there already? The short answer is that, as noted above, GDELT has two very unique needs that it does not share with many other translation systems. Building an infrastructure that would allow it to live-translate the entirety of what it monitors in 65 languages in realtime, while coping with the enormous fluctuations in volume over the course of a given day, along with the ability to directly access the underlying universe of possible translations of each sentence to provide feedback proved difficult to achieve with current systems. Very few systems supported the raw internal access and ability to perform time constrained translations that GDELT needed. More critically, what you see here represents only the very beginning of GDELT Translingual – over the coming months we plan to massively expand its capabilities, using it as a base platform to break down the barrier of language when monitoring the planet and this gives us the unbounded flexibility to reshape how translation is used at scale to understand the world in realtime.

 

SUPPORTED LANGUAGES

The following 65 languages are supported by GDELT Translingual (both in native script form and many common transliterations):

 

THE TRANSLATION PIPELINE

GDELT’s translation infrastructure proceeds in a tightly coupled pipeline, designed to maximize efficiency in order to provide streaming translation capable of sustained servicing of GDELT’s 15 minute update interval (including the ability to absorb massive unexpected surges of foreign language material during breaking news events):

 

DATA SOURCES

A translation system as massive as GDELT Translingual, spanning 65 languages, would not be possible without drawing from an enormous collection of linguistic resources and tools. In addition to the resources cited below, GDELT Translingual draws from a myriad people and resources that provided translations, grammatical rules, advice and recommendations on language-specific nuances, and a wealth of other information and assistance that has made this platform possible.

GNS

Geography is a key focal point of GDELT and to ensure maximal coverage of even the smallest local locations and name variants, all available multilingual Romanization data was extracted from the United States National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) database. All “S” script entries were crosswalked to their Romanized versions and all entries in which the Romanized and native script names differed were also extracted. GNS only records the native script name of each toponym in the primary official language of that location, meaning that only Egyptian Arabic name variants for Cairo are included in GNS, for example. In practice, the majority of larger cities throughout the world (that are likely to be mentioned in other languages) have entries on Wikipedia with interlingual links connecting them to their English names (discussed in more detail in the Wikipedia section below), allowing mentions of them to be recognized across languages. GNS in this case is used to ensure that mentions of very small local landmarks and cities that are not frequently mentioned outside of local media in the primary domestic language are recognized (since they likely do not have an entry Wikipedia and/or have interlingual links). A small remote hilltop in an uninhabited area of Egypt is likely only mentioned in Egyptian Arabic domestic press (and recorded in GNS), while a major city like Cairo has its own entry in Wikipedia with interlingual links connecting it to its name and name variants in many languages.

Google Translate

We would like to thank the Google Translate for Research program for its long-standing grant provided to the GDELT Project. This was instrumental in the early prototyping of GDELT’s translation pipeline in exploring the ability of the GDELT Event and GKG systems to cope with machine translated content and to verify and compare the output of the GDELT Translingual system during its development.

Language Detection

The Google Chrome Compact Language Detector 2 (CLD2) is used for all language identification tasks, supplemented by a set of covering language models to increase accuracy in boundary cases and to dramatically accelerate base separation of English/Non-English.

Moses

The Moses statistical machine translation system is used to provide very high quality translations for a subset of languages for which translation models have been graciously shared with GDELT by their respective creators. We would like to thank Philipp Koehn and the team behind Moses, both for making Moses itself available to the open community, and for the use of several of their translation and language models, including Czech, French, German, Spanish, and Russian:

We'd like to thank Mark Fishel and the Institute of Computer Science at the University of Tartu, Estonia for use of their Estonian translation model:

Unicode Common Locale Data Repository (CLDR)

The Unicode CLDR “provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available.” CLDR data is incorporated directly and through enrichments integrated into the WordNet distributions compiled by:

Wikipedia

Wikipedia represents one of the world’s largest interlingual multidisciplinary knowledge resources. Each of its more than 34.8 million entries spanning over 270 languages includes a set of “interlingual links” that connect it to the equivalent entry in other languages, offering a rudimentary form of translation. Entries for major topics like a capital city or head of state can include interlingual links to the corresponding entries in upwards of 100 other languages, while more specialized entries may have only a handful of links to other languages or none at all. While more nuanced than a true bilingual translation dictionary, these interlingual links nonetheless provide an extremely powerful translation capacity. Most critically, they cover the world’s people, organizations, locations, and major events, connecting their names (which are rarely found in traditional translation guides) across languages. Even more importantly, Wikipedia is constantly updated in near-realtime, meaning that new emerging names are rapidly added to Wikipedia, such that within hours or days of a new figure emerging into the news, there is not only a basic entry for that person on Wikipedia, there are interlingual links translating that person’s name into many other languages. In total, there are currently 29.5 million interlingual links to/from the English Wikipedia’s 4.6 million pages and the other 270 Wikipedia languages. There are 519,169 links to/from Farsi alone and 731,774 to/from Russian, while even smaller languages like Estonian feature 92,081 links to/from English.

However, Wikipedia entries tend to use an entity’s full formal name, which is not always the name used in news coverage. To address this, Wikipedia also includes “redirects,” which are commonly used variants of a name that redirect to the relevant page, allowing searches for “Barack Hussein Obama” and “President Obama” to both redirect to the entry for “Barack Obama.” Redirects are used across all language editions of Wikipedia, meaning that in Arabic, for example, the various shortened names used to refer to major political leaders are all recorded in these redirect fields, connecting the alternate spellings and forms of a name back to the full formal version of that name, whose interlingual links can, in turn, be used to connect each of those language-specific name variants back to the English translation of that name. In total there are 25.7 million redirect links in the top 127 language Wikipedias.

GDELT Translingual combines Wikipedia’s redirects and interlingual links together to build a massive translation dictionary containing over 55 million entries and updated in near-realtime. This dictionary has a particular emphasis on proper names of people, organizations, and named events (such as the “Orange Revolution”),

Wikitionary

Wikitionary (http://en.wikipedia.org/wiki/Wiktionary) is a sister project of Wikipedia, aiming to “create a free content dictionary of all words in all languages” and today covers 159 languages. Like Wikipedia, Wikitionary features interlingual connectivity, drawing together various translations of a word across languages. Given that Wikitionary includes many archaic and rarer alternative translations, some of which conflict with modern use, and thus several filtering passes are used to attempt to remove entries which conflict with the other linguistic data sources herein.

Wikitionary entries are incorporated through two processes. One involves directly compiling all interlingual links from Wikitionary through parsing it, while the second incorporates the Wikitionary enrichments integrated into the WordNet distributions compiled by:

WordNet

The Following 22 WordNets are used (some encompass multiple languages – see http://compling.hss.ntu.edu.sg/omw/ for more details on each):

Word Segmentation

Not all languages feature unambiguous word boundaries and thus the following algorithms are used to perform word segmentation for their respective languages.