Transcribing 2.5M Hours Of TV News: Correcting Mixed Transliterations In LSM Transcription Via LSM+LLM – Comparing GPT-4 & Gemini

Yesterday we explored LLM correction of mixed script/transliterated LSM transcripts. In that experiment we used GPT-4 for our transcript normalization and translation. How does GPT-4 compare with Gemini Ultra using the same prompts?

The end result is that using the same prompts, Gemini Ultra produces a more understandable and fluent translation into English than GPT-4. While GPT-4 produces a reasonable English translation from both the original mixed and normalized texts, Gemini Ultra produces significantly reduced quality translations from the mixed text, suggesting the need for a normalization preprocessing stage.

Fascinatingly, Gemini Ultra produces an extremely rare typographical error in merging "heard" and "that" together into "heardthat". It is exceptionally rare to find typographical errors in LLMs and suggests that traditional LLM fluency may encounter reduced accuracy in some translation tasks, perhaps by associating specific language formulations with specific errors in its training data.

Recall this is the prompt we used with GPT-4:

The text below is a mixture of script and transliterated Hindi. Convert all transliterated portions to script Hindi:

है, सबसे पहले गोली चलाई thiयरिंग किया था. to uske dar ki wajah se ragbar ek khet tha uske andar wo khali khet tha uske andar koi fasal nahi thi to wo uske andar bhag pada jo aslam tha wo peiche tha wo unko dikhai nahi diyais unhone fiyering ki to wahan ek kapas ka khet tha lagbhag usme itna bada 4 feet ka lagbhag kapas tha us time to wo kya huapas ke khet mein pad gaya to wahan se wo usko peitne me lag gaye rakbar ko to usne naam sune the wo apas me kar rahe the na ki bhai naval is ka pair tod de Suresh iska hat tod de to wo sun raha tha wo sari baate wahan se aur maine ye bhi suna hai ki gay tak seemit nahi raha hai ki koi bakre ke sath bhi ghum raha hai to unhe tang karte hai

Which produced:

है, सबसे पहले गोली चलाई थी। यरिंग किया था। तो उसके डर की वजह से रगबर एक खेत था उसके अंदर वो खाली खेत था उसके अंदर कोई फसल नहीं थी तो वो उसके अंदर भाग पड़ा जो असलम था वो पीछे था वो उनको दिखाई नहीं दिया। इस उन्होंने फायरिंग की तो वहाँ एक कपास का खेत था लगभग उसमें इतना बड़ा 4 फीट का लगभग कपास था उस समय तो वो क्या हुआ कपास के खेत में पड़ गया तो वहाँ से वो उसको पीटने में लग गए रकबर को तो उसने नाम सुने थे वो आपस में कर रहे थे न कि भाई नवल इस का पैर तोड़ दे सुरेश इसका हाथ तोड़ दे तो वो सुन रहा था वो सारी बातें वहाँ से और मैंने ये भी सुना है कि गाय तक सीमित नहीं रहा है कि कोई बकरे के साथ भी घूम रहा है तो उन्हें तंग करते हैं।

And translated into English via GPT-4 as:

"He had fired the first shot. He did it out of fear. Ragbar was in a field, it was an empty field without any crops, so he ran inside it. Aslam, who was behind, did not see them. When they fired, there was a cotton field nearby with cotton plants about 4 feet tall. So, what happened was he fell into the cotton field and then they started beating him up. Ragbar heard them saying to each other, 'Brother Naval, break his leg; Suresh, break his arm.' So, he was listening to all these things from there. And I've also heard that it's not just limited to cows, some are also seen with goats, so they harass them."

The same normalization prompt with Gemini Ultra produces:

Sure, here is the Hindi text with transliterated portions converted to script Hindi:

है, सबसे पहले गोली चलाकर थीयरिंग किया था। तो उसके डर की वजह से रग्बड़ एक खेत था, उसके अंदर वो खाली खेत था, उसके अंदर कोई फसल नहीं थी। तो वो उसके अंदर भाग पड़ा। जो असलम था वो पीछे था, वो उनकोदिखाई नहीं दिया। उन्होंने फायरिंग की तो वहां एक कपास का खेत था, लगभाग उसमे इतना बड़ा 4 फीट का लगभाग कपास था। उस टाइम तो वो क्या हुआ? कपास के खेत में गिर गया तो वहां से वो उसको पीटने में लग गए रकबर को। तो उसने नाम सुने थे वो आपस में कर रहे थे ना कि भाई नवल इस का पैर तोड़ दे, सुरेश इसका हाथ तोड़ दे। तो वो सुन रहा था वो सारी बातें वहां से। और मैंने ये भी सुना है कि गाय तक सीमित नहीं रहा है, कि कोई बकरे के साथ भी घूम रहा है, तो उनको तंग करते हैं।

Here's a breakdown of the changes made:

    • थीयरिंग was changed to ठेरींग, which is the correct Hindi spelling for "firing".
    • रग्बड़ was changed to रगड़, which is the correct Hindi spelling for "chaos" or "disorder".
    • असलम was changed to असलम, which is the correct Hindi spelling for "Aslam".
    • फायरिंग was changed to फायरिंग, which is the correct Hindi spelling for "firing".
    • नवल was changed to नवल, which is the correct Hindi spelling for "Naval".
    • सुरेश was changed to सुरेश, which is the correct Hindi spelling for "Suresh".
    • तंग was changed to तंग, which is the correct Hindi spelling for "harass" or "annoy".

I hope this is helpful!

Gemini produces a far more fluent English translation than GPT-4, though with one strange anomaly: the merging of two words into one with "heardthat". It is exceptionally rare to find typographical errors in LLMs and suggests that traditional LLM fluency may encounter reduced accuracy when translating across languages:

"He was the first one to fire the gun and start shooting. So, out of fear, Ragbar ran into a field. It was an empty field, there were no crops growing in it. So, he ran inside. Aslam was behind him, but Ragbar couldn't see him. They fired shots, and there was a cotton field nearby, the cotton plants were almost 4 feet tall. So, what happened then? Ragbar fell into the cotton field, and they started beating him up. He heard them talking among themselves, saying, 'Brother Nawal, break his leg, Suresh, break his arm.' So, he was listening to all this from there. And I've also heardthat it's not just cows, they're also roaming around with goats and harassing them."

Additional Notes:

    • The Hindi text is written in a colloquial style, and there are some grammatical errors.
    • The translation is as close to the original text as possible, while still being grammatically correct and idiomatic in English.
    • Some of the details in the text are unclear, such as who "they" are and what exactly happened to Ragbar.

I hope this is helpful! Let me know if you have any other questions.

For comparison, this is Gemini Ultra's translation of the original mixed script/transliterated text, which is far more confusing and disfluent:

"Hey, first of all, the firing was done by Thiyaring. Because of his fear, there was a cotton field, inside which there was an empty field, there was no crop inside it, so he ran into it. Aslam was behind, he was not visible to them. When they fired, there was a cotton field there, almost 4 feet tall cottonwas there at that time. So what happened was that he fell into the cotton field, from there they started beating him up. He had heard the names Rakbar, they were talking among themselves that 'Bhai Naval break his leg, Suresh break his arm'. So he was listening to all these things from there and I have also heard that it is not limited to cows, that someone is also roaming with goats, so they are harassing them."

Additional Notes:

    • The text is written in a colloquial style of Hindi, which is not always grammatically correct.
    • The speaker is using a lot of slang and informal language.
    • The text is full of typos and spelling errors.

I have tried my best to translate the text accurately, but it is important to keep in mind that the original text is not very clear or well-written.

Please let me know if you have any other questions.