The GDELT Project

Experiments With Machine Translation: Connecting Original And Translated Text Through Many-To-Many Translation Graphs

In traditional machine translation, an entire news article is handed off as a text blob to the translation engine, which returns a text blob representing the translation of the article into the target language. In statistical translation there is typically a one-to-one correspondence of sentences between languages, but NMT systems can sometimes break this assumption through hallucination, short sentence/artifact merging and unstable edge case handling. This means it isn't possible to simply chop the two texts into sentences and assume a one-to-one correspondence between original and translated text.

Within sentences, the narrow rolling window of SMT systems and their explicit probabilistic selection process meant it was possible to connect source and translated texts at the subsentence level and even capture and preserve the entire potential translation space as a massive probability tree, recording for each clause any high-likelihood alternative translations and tracing their impact on subsequent translation decisions. In contrast, production NMT systems operate more in the fashion of an opaque black box, with far less visibility and accessibility into their decision making process, with the result that it is extremely difficult to create the same kinds of clausal-level mappings between source and translated text.

With Translingual 2.0 we are intently focused on ensuring we preserve this clausal-level mapping and extend it down to the level of individual words and high-level semantic entities, representing each article as a many-to-many graph connecting source and translated text at the word level, with each word, span of words or multiple spans of words in one text connected to their source in the other, accommodating everything from sentential ordering changes and structural incompatibilities between languages to the many-to-many mappings often required to express the same concept among two dissimilar languages.

This unique graph representation allows us to readily render the translation into distilled static text for fulltext indexing and standard NLP processing pipelines, but most importantly it opens the door to a wealth of cross-language tasks in which we can operate on the two languages as one. As just one trivial example, a geocoder that functions only in English could identify and disambiguate a location reference, then apply a source language entity-level sentiment tool to compute the native sentiment of the location mention, combining the two monolingual tools into a single bilingual pipeline.