Experiments With Machine Translation: Improving Translation Speed And Hardware Efficiency

Kalev Leetaru

4 years ago

A key challenge in machine translation is that accuracy tends to be negatively correlated with speed and hardware requirements – a trend that has grown exponentially worse in the neural (NMT) era. Even relatively small and highly optimized neural translation models can easily saturate a V100 or even an A100 GPU at just a few hundred words per second and require careful memory management that typically limits each GPU or cluster of GPUs to a single language. Even achieving a few hundred words per second requires a substantial tradeoff in entity fidelity, dramatically reducing accuracy on even common noun phrases. Yet, even with these tradeoffs, hundreds of words per second per high-end GPU is simply not tractable given the sheer volume of material GDELT monitors each day in more than 152 languages.

In GDELT Translingual 1.0, we used the concept of dynamic model modulation to create a highly efficient throughput-optimized architecture. Each article was prefiltered through a low-cost minimized model and then successively passed through higher and higher resolution model layers to improve its translation quality based on an assessment of the article's linguistic needs and available hardware resources. During periods of relatively low resource contention, all articles might complete maximal quality translation, while during peak resource needs (such as a high-intensity breaking story yielding a sudden unexpected surge in coverage in a single language), articles were triaged based on the "minimodel" translation. Articles with simple grammatical and narrative structure and straightforward word use would be translated just enough to recover their core intent, while an article with a heavy density of emotional language, complex narration and reliance on turns of phrase and unusual word contexts would be translated all the way through the entire model stack.

GDELT Translingual 2.0's new architecture replaces this layered approach with a hybrid neural-statistical architecture that blends the much higher quality of neural translation with the orders-of-magnitude higher speed of statistical translation. Most importantly, we graft this onto an intermediate secondary semantic graph representation that allows us to represent major known entities, topics and even ordinary noun phrases as high-order known semantic entities within the model to preserve their meaning through to the final translation. This avoids the proper name limitations of neural models and allows us to minimize hallucination in lower-resource languages. Our new architecture has also been designed to allow continual updating of the underlying models over time, with automatic detection of emerging words and phrases in need of translation and the ability to precisely measure the contextual impact of each new addition to overall model performance – characteristics lacking from our previous generation of architectures and not typically available in pure neural architectures, but whose need has been reinforced by the pandemic's flood of new words entering the news lexicon. Finally, by trading large memory and creative representational structures for raw matrix operation speed, we are able to run this new architecture entirely on CPUs, dramatically increasing scalability and reliability as the models can now run across any high-performance CPU VM globally rather than requiring a specific (and more limited availability) GPU architecture.