Experiments With Machine Translation: Neural Machine Translation, Fidelity, Entity Recognition And Grammatical Fluency

Neural machine translation (NMT) has emerged as the dominate approach to machine translation today, replacing statistical (SMT) approaches, which in turn had replaced rules-based (RBMT) systems. NMT has become the go-to translation architecture because of its ability to achieve seemingly human-like fluency on many texts and across highly dissimilar language pairs. Yet, this high degree of apparent fluency masks deeper challenges that complicate their use in at-scale news translation.

Fidelity To Source Text

SMT systems yielded highly fidelity to original source texts, since they translated in short phrases with minimal and highly localized clause restructuring. While the resulting translations may have exhibited poor grammatical structure, they did not reach beyond the confines of the source material. In contrast, NMT systems exhibit far more interpretive behavior akin to a human translator and can even hallucinate entirely unrelated passages.

Smaller NMT models tend to be more faithful to the original text, preserving more of the original argument structure, while larger models, especially state of the art models, tend to rewrite arguments and add rhetorical flourishes absent from the original text that can fundamentally affect their meaning. For example, one major commercial system translated a Spanish sentence that translated literally as "He didn't comment." to "He made it clear to everyone who would listen that he would never make a public statement that leaned either way."

Both statements convey that the individual did not provide comment, but the latter ascribes a set of actions and state of mind that are unsupported by the original text. In fact, we have observed that such rhetorical or hallucinated expansions are surprisingly common in state of the art commercial systems, meaning their translations frequently deviate from the source text in ways that imply events, actions, beliefs or intent that go beyond (often far beyond) the original text.

Human translators can also add such rhetorical flourishes, interpretations and "added intent" when translating works for a contemporary or unfamiliar audience to contextualize the ideas within, especially in the field of literature. However, when it comes to news coverage, the need to precisely recover the intent, mindset and nuance expressed in the source text requires a high degree of fidelity to the source text and minimal interpretation. Today's NMT models typically do not have an inference-time parameter to adjust the degree of rewriting they perform – this must be tuned during training through a combination of train/test data and model design/parameters. Simpler models, especially those tuned for mobile use, tend to exhibit less interpretation of source material, but the degree to which NMT systems imply large-scale meaning that is not present (as opposed to connotative and imperfect translations) is an unsolved challenge with significant implications to downstream analytic tasks.

Entity Recognition

At first glance, the results of any major commercial translation system will typically appear almost human-like, with a natural grammatical structure that avoids the stilted and abrupt transitions of SMT systems and that provides sentence-level arrangement of structure. Look again and that fluency can mask a major limitation: inaccurate translation of proper names and other entities.

  • Numeric Quantities. Accuracy can be especially confounding when it comes to faithfully translating numeric quantities. One major state of the art commercial system would until recently incorrectly translate "trillions" as "billions" in Spanish in a number of contexts, but would correctly translate it in others. "Six billion dollars" might become "six trillion dollars" while "six billion people" might be correctly translated and "six billion acres" incorrectly become "six trillion acres" – there appeared to be no pattern to which contexts it correctly and incorrectly translated. In fact, translating the numbers 1-99 and adding "hundred/thousand/million/billion/trillion" to the end resulted in hit-or-miss translations: many numbers translated correctly regardless of context, but a surprising number of numeric quantities would be translated incorrectly depending on the surrounding text – just changing the object of the quantity to different words would yield alternating accurate/inaccurate translations. The opaque nature of NMT models makes it more difficult to externally verify and diagnose these behaviors compared with the observable probability tables of SMT systems. That such errors manifest themselves even in the most advanced commercial systems offers testament to the difficulty of verifying the accuracy of NMT systems.
  • New Vocabulary. New words enter the vocabularies of the world's languages every day, often in the form of proper names, but also new adjectives, verbs and other parts of speech. Words can even change their part of speech, such as the English noun "impact" increasingly being employed as a verb. SMT systems offer a range of robust approaches to such Out Of Vocabulary (OOV) terms, while even state of the art commercial NMT systems struggle with this. A year and a half since the term "Covid-19" was formally announced, top commercial NMT systems still routinely struggle with translating it, frequently mistranslating it as everything from "coronary" and "crown" to myriad ever more bizarre translations, even when limited to just coverage from papers of record in each language that employ rigorous adherence to linguistic and grammatical norms. The difficulty in grafting emerging words into NMT models means even large commercial systems can rely on fragile corrected NMT translations that can yield even more unpredictable behavior.
  • Proper Names. The reliance of NMT systems on word decomposition and other strategies to manage their vocabularies can lead to bizarre mistranslations. From "Barbados Obama" to "Antigua Merkel" to "Gooplike" (Google) to "Deal" (Dell) to "Serious" (Syria), the limitless ways in which NMT systems mistranslate proper names is remarkable both for the variety of mistranslations and their egregiousness. Company names and acronyms pose especial challenges to NMT systems, as common words are repurposed to new meaning. Top commercial NMT systems tend to have particular problem with Chinese company names, mistranslating even some of the largest firms in the country that have existed for years and are well-represented in available training material. Proper names that translate into less common English names tend to exhibit particularly poor translation performance. Even names that are common in a given language and have existed widely in that language for centuries or even millennia will often fail to be accurately translated if their English equivalent is uncommon. This makes for especially problematic translations of local news coverage referring to local people, places, organizations and events.


Separate from accuracy, it is worth noting that even a very small and highly optimized NMT model and highly tuned pipelining can fully saturate an A100 or V100 GPU at just a few hundred translated words per second or less, depending on the language and requires complex memory management and adaptive batching to achieve maximal throughput. This means that even with extremely small and highly optimized models and access to very large quantities of GPU accelerators, it can be difficult to scale NMT systems to handle realtime content firehoses. GPU-equipped VMs also don't currently support live migration and GPU memory pressure can result in unpredictable application failure patterns at scale, requiring careful pipelining and making it more difficult to provide fixed-time latency guarantees.

Importantly, unlike SMT systems, current NMT architectures do not widely support adaptive model modulation strategies to cope with time or resource pressures. SMT systems can dynamically monitor latency, throughput, memory pressure, CPU utilization and hardware availability to trade off translation accuracy for speed at the resolution of a single clause – a strategy we rely on heavily to maintain fixed-time pipeline latency during unexpected volume surges. In contrast, the primary strategy for time/space tradeoff in NMT systems is to maintain multiple models, yet even here the tradeoff is not as linear as SMT systems, which can be statistically pruned in-place based on real-world observed model utilization – another major strategy we use.


While NMT systems have emerged as the go-to solution for machine translation today, they present a number of challenges for the news domain where source text fidelity and entity recovery are of particular importance.