Transcribing 2.5M Hours Of TV News: The Complicating Factors Of Transliteration/Romanization In LSM Transcription

Like their textual brethren, the Large Speech Models (LSM) that have become the dominant ASR (speech transcription) technology can be highly unpredictable. One critical complication that has received little attention is the degree to which they can unpredictably transition between script and transliterated/Romanized forms of non-Latin languages. These transitions can occur at random, even during continuous speech by a single speaker and can sometimes involve multiple transliteration standards, with different words or phrases being drawn from both the script form and multiple transliteration standards, resulting in highly intermixed text.

Why is this problematic? After all, for native human readers of a language, this may prove of little difficulty. The problem is fulltext search: keyword searching such a corpus requires elaborate boolean queries incorporating both the original search phrase and all possible transliterations of it through the major standards for that language. This increases the complexity of searching LSM-derived archives by orders of magnitude.

Yet, merely translating everything into English is not a solution, as the example below shows – NMT translations are highly stilted and difficult to follow, LLM translations can differ on major details across runs and even LLM-based Q&A can yield key factual differences.

For example, here is a trivial case where the Hindi-language transcript abruptly switches from script to a transliterated form mid-speaker, meaning a keyword search hoping to match this snippet would have to search both forms:

है, सबसे पहले गोली चलाई thiयरिंग किया था. to uske dar ki wajah se ragbar ek khet tha uske andar wo khali khet tha uske andar koi fasal nahi thi to wo uske andar bhag pada jo aslam tha wo peiche tha wo unko dikhai nahi diyais unhone fiyering ki to wahan ek kapas ka khet tha lagbhag usme itna bada 4 feet ka lagbhag kapas tha us time to wo kya huapas ke khet mein pad gaya to wahan se wo usko peitne me lag gaye rakbar ko to usne naam sune the wo apas me kar rahe the na ki bhai naval is ka pair tod de Suresh iska hat tod de to wo sun raha tha wo sari baate wahan se aur maine ye bhi suna hai ki gay tak seemit nahi raha hai ki koi bakre ke sath bhi ghum raha hai to unhe tang karte hai

What about the alternative? Just translate everything into English and make search simple? The problem is that despite massive improvements with large models, translations are still imperfect in ways that can complicate search and reasoning.

Google Translate's translation captures the general gist, but makes it difficult to follow what precisely occurred in the field:

It was Thiering who fired the first shot. From the looks of his face, there was no difference in the color of his journey, it was like an empty field, he was walking towards Islam, he was walking but he was not visible to them, he felt that this vehicle was almost like a cotton field. It was about 4 feet long and at that time we fell in the field and they started beating us with the help of vehicle. We heard our name and said that we were doing something nearby. Suresh has told him that he is roaming around with all the people from there and I have also heard that gays are not limited to him and he is also roaming with a goat to tease them.

Bing Translate offers a similarly stilted translation:

The first shot was thiering. So because of his fear, Ragbar was a field, inside it was an empty field, there was no crop inside it, so he ran inside it, the Aslam was drunk, he did not see it, he was firing, there was a cotton field there, there was almost 4 feet of cotton in it, at that time what happened to him, when he fell in the nearby field, he got it from there. Rakbar got engaged in drinking, he had heard the names, they were doing it among themselves, not brother Naval broke his leg. Suresh broke his hand, so he was listening to all those things from there and I have also heard that it is not limited to the cow that if someone is roaming with a goat, then they harass them.

Gemini Ultra refuses to translate the text, producing around half a translation before the end of the text triggers the "Other" safety category that cannot be disabled:

Main text ke aadhaar par kaam karne vala AI hoon aur mujhe is kaam ke baare mein jaankari nahin hai.

Gemini Pro 1.5 with safety features reduced to minimum produces the following which is far more fluent, but leaves some questions about context:

Yes, the first shot was fired. He was scared. Because of his fear, there was a barren field, an empty field with no crops inside. So, he ran into it. Aslam was behind, he couldn't see them. They fired. There was a cotton field there, about 4 feet tall cotton at that time. So, what happened, he fell into the cotton field. From there, they started beating Rakbar. He heard the names they were saying among themselves, like 'brother Naval, break his leg,' 'Suresh, break his arm.' He was listening to all these things from there. And I also heard that it's not limited to cows, if someone is walking with goats, they also harass them.

GPT 4 yields the similar – though this time it is Naval's leg that is being broken:

First, the shot was fired. Due to fear of that, Ragbar was in a field, inside of which was an empty field with no crops, so he ran inside it. Aslam, who was behind, couldn't see them, so when they started firing, there was a cotton field nearby, about four feet tall with cotton at that time. So, he fell into the cotton field. From there, they started beating Ragbar. He had heard names, they were saying amongst themselves, like 'Brother, break Naval’s leg, Suresh break his hand.' So he was listening to all these conversations from there. And I have also heard that it's not just limited to cows, some are also wandering with goats, so they annoy them.

Running GPT-4 again yields highly similar text, HOWEVER, in the first translation it was Naval's leg and hand being broken, while in this case it is Naval that is breaking Ragbar's leg:

He fired the first shot. Fearing him, Ragbar ran into a field. It was an empty field without any crops, so he ran into it. Aslam, who was behind, couldn't see them and fired his gun. Nearby, there was a cotton field about four feet high with cotton plants at the time, so he fell into that cotton field. From there, they started beating Ragbar. He had heard their names; they were saying things like 'brother Naval, break his leg' and 'Suresh, break his hand.' He was listening to all this from there. And I have also heard that it's not limited to cows; some are also roaming with goats, and they harass them.

Adding greater complexity to this – GPT-4 was asked twice "In the text below, who broke whose leg?" Similar to its translations, the first time it answered that Naval's leg was broken, while the second time Naval was the one breaking someone else's leg:

The text implies that there were threats to break someone's leg, specifically Naval's leg, but it doesn't confirm that anyone actually broke his leg. The phrases used suggest it was more of a spoken threat or intention rather than an action that was completed.

Vs:

The text suggests that individuals named Naval and Suresh were being urged by others to break someone's leg and hand. Specifically, they were told "Brother Naval, break his leg" and "Suresh break his hand." These instructions were directed at harming Ragbar, the person who had run into the cotton field.

This reinforces the complexity and inexactness of translation as a surrogate.