The GDELT Project

More Experiments With LLM Translation: Translation Stability In GCP's PaLM Vs ChatGPT

Continuing our series evaluating LLM translation capabilities, we'll have PaLM and ChatGPT translate two paragraphs from a Chinese language news article and test how stable the translations are across runs. Overall ChatGPT's truncation has improved over the past month since our last tests. Unfortunately, while PaLM yields strong results comparable to GPT-3.5, its guardrail false positives continue to render it less suitable for real-world applications. For the sample mainstream news passage below, PaLM refused to provide output for 6 of 11 attempts (55%).

Let's take two paragraphs from this news article from this past May:

工作人員當時正在調配染髮劑,發現後上前阻止已經來不及,女客則是覺得有異樣,轉頭一看發現自己被剪掉一大磋頭髮,氣得當場報警處理,最終三方達成和解,母親賠償女客人11500人民幣,美容院則是未盡管理責任,必須免費提供8次共12000人民幣的美髮服務。 不少大陸網友也搖頭:「這種熊孩子不管以後還得了」、「熊孩子的罪,家長必須承擔」、「媽媽顧著滑手機完全不理女兒也是奇葩」、「一刀1萬,好貴的學費」、「其實美容院應該才是主責方」、「是我不會要錢,我會直接把熊孩子剃光頭」。

PaLM:

For the 6 of 11 cases where PaLM refused to provide output, what guardrail did we trigger? Let's take a closer look at what content category we violated. Unfortunately, as the output below shows, PaLM does not provide categorical scores when it suffers a false positive guardrail error: it provides nothing more than "blocked: true" to make it impossible to perform any kind of diagnostic:

{
  "predictions": [
    {
      "safetyAttributes": {
        "blocked": true
      },
      "content": ""
    }
  ],
  "metadata": {
    "tokenMetadata": {
      "inputTokenCount": {
        "totalBillableCharacters": 277,
        "totalTokens": 175
      },
      "outputTokenCount": {
        "totalBillableCharacters": 0,
        "totalTokens": 241
      }
    }
  }
}

ChatGPT: