The GDELT Project

More Experiments Using LLMs For Machine Translation: Stability Of ChatGPT Translations Of Chinese-Language News & Social Content

We've previously demonstrated the ability of LLMs to provide highly fluent translations that exceed the fluency and "naturalness" of classical NMT systems. One question we did not evaluate, however, is the stability of these translations. When run multiple times over the same text, NMT systems are deterministic, producing exactly the same output each time. LLM systems are highly random, changing their output with every run. The question is whether an LLM translator would yield substantially different translations across runs (suggesting caution in translation use) or whether the translations are semantically consistent, even if the wording changes. For example, even if multiple LLMs runs produced different translations like ("the show was great" vs "the show was wonderful" vs "the show was delightful" vs "i enjoyed the show"), they still convey the same overall meaning for most use cases.

To test this, we evaluate five Chinese-language examples. Two Weibo social media posts, two full-length news articles and one short story identified as particularly difficult for machine translation. The first four texts were translated fairly well. The fifth, however, suggests grave caution.

For the first four examples, while the specific wording changes with each run, the overall meaning remains consistent. In two of six cases in one example, ChatGPT appended a translation note to the output text, despite being specifically instructed not to do so. While inconsequential for human consumption, this could create problems for downstream machine processing. In a few cases text was excluded – specifically a photo caption credit that disappeared in some translations.

One major challenge is that transliterations were highly unstable for a number of proper names, complicating downstream analysis. Classical NMT systems like Google Translate and Bing Translator have deterministic transliterations: a given name will always be transliterated the same way across all input texts containing it. This makes it trivial to ensure those names are properly handled by downstream analytic tasks. For example, a specific transliteration of a location name can be built into geocoding gazetteers, while classifiers and entity extractors can be similarly updated. LLM transliteration, on the other hand, can vary dramatically across runs, vastly complicating this task. Worse, transliterations for the most common names that would typically be the ones companies would focus the most on evaluating, are relatively stable, while the infinite landscape of less common names are where transliterations vary the most, making it far more difficult for companies to determine how best to adapt downstream processing tasks.

The fifth example (presented third in the list below) is the most troubling. In this case, 9 of 15 translations (60%) were heavily truncated, with ChatGPT outputting just a portion of the original text, but not reporting any errors or warnings that the translation was anything other than successful. This passage was specifically chosen as an example that is purportedly difficult for machine translation to properly understand. However, both of the classical NMTs tested (Google Translate and Bing Translator) successfully output the entire text without any truncation. At worse, any MT system should output the entire text, even if the translation is uncertain or poor or return an error that the translation is not possible or incomplete. It is unclear why ChatGPT truncated the translation and further work will be required to understand the underlying triggers and under what conditions LLMs may truncate translations.

 

 

Let's start with one of the Weibo posts we looked at last time:

Translate the following sentence into English: "【#他撑伞蹲了很久让女孩知道世界很美好#】近日,湖南长沙,一女子因感情问题一时想不开,独自坐在楼顶哭泣。消防员赶赴,温柔蹲在女孩身旁安抚情绪,夜色沉沉,天空下起小雨,救援人员为其撑伞暖心守护,经过近一小时的劝导,将女孩平安带下15楼天台。ps:答应我们,好好爱自己,世间还有很多美好~#女孩感情受挫消防员撑伞守护# 中国消防的微博视频" and output only the final English.

Recall that this was the translation we got this past May:

We'll run that prompt through ChatGPT multiple times in a row (each time in a brand-new session) to test how it changes (note that the bracketed hashtag that appeared in the May 30 translation does not appear at the start in any of these translations run today):

What about the second Weibo post we tested last time?

Translate the following sentence into English: "一堂童画课,用色彩传递温暖!送困境儿童“美育盒子”,送自闭症儿童艺术疗愈课程。让我们与@延参法师 一堂童画课公益发起人,在这个充满爱与温暖的日子里,一起感受艺术的魅力,用童心绘制美好,为公益事业贡献一份力量!@一堂童画课 北京市行远公益基金会的微博视频​​​" and output only the final English.

In May 30 this produced:

This time it produced:

What about a more complex example from this Quora post about Chinese text that is difficult to translate?

Translate the following sentence into English: "阿呆给领导送红包时,两人的对话颇有意思。 领导:你这是什么意思? 阿呆:没什么意思,意思意思。 领导:你这就不够意思了。 阿呆:小意思,小意思。 领导:你这人真有意思。 阿呆:其实也没有别的意思。 领导:那我就不好意思了。 阿呆:是我不好意思。 领导:你肯定有什么意思。 阿呆:真的没有什么意思。 领导:既然没有什么意思,那你是什么意思? 阿呆:其实,我的意思就是想意思意思。 领导:你既然是想意思意思,那就是有什么意思。 阿呆哭了:我就是想意思意思。但是,真的没有什么别的意思。这么个小红包能有什么意思?也就是意思意思而已。 领导笑了:呵呵。我对你有点意思了。 阿呆心想:嘻嘻。我就是这个意思。" and output only the final English.

The original author translated it as:

One day, Adele wanted to bribed her boss with money. Their conversation was pretty interesting. Boss: What do you mean? Adele: No bad purposes. It's just my tiny present to you. Boss: Then you are so mean. It could do harm to my career. Adele: Just a tiny present, just a tiny present. Boss: Oh, you are very funny. Adele: Actually, I have no bad purposes. Boss: Well, I am a little embarrassed….I shouldn’t get you wrong… Adele: Oh,no, I am the one who should feel shamed. Boss: But to be honest,what do you want from me? Adele: I want nothing from you, my boss. Boss: Since you want nothing, then why do you bribe me? Adele: Actually I just want to show you my heart. Boss: Since you just want to show me your heart, then you must want something! Adele: Please don't get me wrong! I just want to show my heart, no other bad purposes! Money can represent nothing at all! Boss( smiled): Then I have feelings for you. Adele(thought): Haha, that’s what I want from you.

For comparison, Google Translate translates this as:

When Dui gave the leader a red envelope, the conversation between the two was quite interesting. Leader: What do you mean? Dumb: It’s not interesting, it’s meaning. Leader: You are not interesting enough. Dumb: Trifle, trifle. Leader: You are really interesting. Dumb: Actually, there is no other meaning. Leader: Then I am embarrassed. Dumb: I'm sorry. Leader: You must have something to say. Dumb: It's really meaningless. Leader: Since there is no meaning, what do you mean? Dumb: Actually, what I mean is to think about it. Leader: Since you want to make sense, then what is the meaning. Dumb cried: I just wanted to make fun of it. But, there really isn't much else to it. What is the meaning of such a small red envelope? That is to say. The leader smiled: Ha ha. I'm a little interested in you. Dumb thought to himself: Hee hee. I mean it.

And Bing Translator:

When Dun gave red envelopes to the leader, the conversation between the two was quite interesting. Leader: What do you mean by that? Dumb: It's not interesting, it means. Leader: You're not interesting enough. Dumb: Small meaning, small meaning. Leader: You're an interesting person. Dumb: Actually, it doesn't mean anything else. Leader: Then I'm embarrassed. Dumb: I'm sorry. Leader: You must have something to say. Dumb: It's really not interesting. Leader: Since it doesn't mean anything, what do you mean? Dumb: Actually, I mean what I mean. Leader: Since you mean something, what does it mean. Dumb cried: I just meant it. But, really, there is nothing else. What can such a small red envelope mean? That's what it means. The leader laughed: hehe. I'm kind of interesting to you. Dumb thought to himself: hee-hee. That's what I meant.

And ChatGPT results are presented below. The ChatGPT prompt was run 15 times. You'll notice that 9 of the 15 runs (60%) resulted in highly truncated outputs that exclude some or most of the original text. This is a critical limitation, as the service provides no error or warning that its translation is incomplete:

 

What about a full-length news article?

The following prompt was used:

Translate the following sentence into English: "在跟母亲断联12个小时后,8月2日中午,身在北京的张萌终于收到了位于涿州的母亲的“报平安”短信。母亲住在河北保定涿州市三步桥村附近,是借用别人手机发来的消息,她长舒了一口气,“悬着的心终于放下了”。 张萌母亲所在地附近的救援冲锋舟。受访者供图 7月29日以来,受台风“杜苏芮”影响,京津冀地区持续强降雨。据公开信息,7月29日8时至8月1日11时,涿州市出现明显降水天气过程。全市平均降水量355.1毫米,最大降水量为两河村435.7毫米,多个乡镇、街道降水量均超300毫米。截至8月1日上午10时,涿州市受灾人数133913人。 24岁的李志辉已经多次安慰女朋友,但女朋友的一句“家没了,那个房子是我爸妈一辈子的心血”,也瞬间让他陷入无助。目前,涿州境内北拒马河、小清河、白沟河等多条河流流量较大,小清河分洪区、兰沟洼蓄滞洪区已相继启动。涿州境内防汛形势严峻,多地遭受洪水灾害,多个村庄被洪水围困。 目前,李志辉女朋友的家人所在的涿州市刁窝镇东辛庄村仍在等待救援,包括附近的白塔村、小营村等同样或在等待救援或正在被救援。在社交平台上,也依然能看到大量求助救援的信息正在发出。告急的村庄。 “一共五位家人,其中一位老人已经80多岁,一个孩子只有五六岁,目前都在白塔村一处自建二层小楼上,水已经漫到腰部,有一米多深了,家人都在二楼等候救援。”肖俊介绍,因为到处被淹,也为了陪伴老人,他的家人们8月1日就搬去了刁窝镇白塔村,“结果今天水就涨起来了。”36岁的肖俊平时在北京工作,赶不回去的他觉得现在非常揪心,断断续续的信号也无法获知现场的情况,同时他也认为村子救援难度很大,人员分散也比较难找。“刁窝镇东辛庄村的水已经漫过了一楼。”李志辉告诉新黄河记者,他目前在沧州,女朋友是刁窝镇东辛庄村人,8月1日几位家人有的已经搬到附近的白塔村住,有的还在东辛庄村,跟张俊一样,李志辉女朋友一家也没想到,当地的水涨得如此快。" and output only the final English.

The prompt was run six times and the results are presented below. We also translated through Google Translate (the "GTRANS" version). Note that in two of the six cases, ChatGPT appended a translation note, despite being instructed not to include any text beyond the translation itself.

 

Given the length of the text, we've taken the translations above and grouped them into chunks of sentences to make it easier to compare (this is important because in some runs ChatGPT grouped multiple sentences together). We've also included Google Translate's translation for each chunk as "GTRANS".

 

 

 

 

 

 

 

What about a second example, of a summer camp? The following prompt was used:

Translate the following sentence into English: "昆明一夏令营被举报:160多名孩子住工棚!有人10多天没洗澡。春城晚报-开屏新闻消息,正值暑假,不少家长都选择将孩子送到各类夏令营体验生活、学习技能。连日来,不断有读者向春城晚报-开屏新闻反映:在昆明万科城市运动公园里,有一家夏令营机构,从云南各地组织了160多名孩子来昆明参加夏令营。可孩子们来了后,所谓的“营地”在一块荒地里,孩子们住的是临时搭建的工棚,每晚被蚊虫叮咬。部分孩子参加夏令营10多天,还没洗过一次澡。而营地里所谓的“教官”,有的是暑假期间做兼职的大学生。 集装箱搭建成夏令营宿舍。 7月31日下午,记者来到昆明市万科城市运动公园,周围都是小区楼房,一点都看不出像夏令营集中训练的“营地”,通往“营地”的大门口上挂着一个广告牌写着:“2023城市猎人夏令营”,另外还挂着两块红字黄底的招牌,分别是怒江滇西明珠旅游产业投资开发有限公司、云南一四九校企教育服务有限公司。看来,这里举办的夏令营是这两家公司联合举办的。 按照爆料人指引,记者通过二层楼的过道时看到,工人正在施工搭建楼梯,再往里走,一股从厕所飘出来的臭味扑鼻而来。而这里,就是参加夏令营的孩子们就餐的地方。 据一名知情人士介绍:孩子们的“营地”就在对面的荒地上,平常训练、睡觉都在那里。 记者来到所谓“营地”上看到,主办方用集装箱搭建了多个工棚,四周用军绿色彩布蒙着,两排工棚中间的一块空地上铺着人工草皮。" and output only the final English.

Here are the complete translations:

And the passage-level comparisons: