The Challenges Of Multilingualism In The The Large Model Era: Using LSMs & LMMs To Transcribe An Amharic Broadcast

The Internet Archive's TV News Archive spans more than 2.5 million hours of global television news in 150 languages spanning 100 channels from 50 countries over the past quarter-century. Earlier this year we completed the machine transcription of this massive archive in what we believe to be one of the largest applications ever of large speech models to global television news coverage at this scale. Using this vast transcription archive coupled with the Visual Explorer we have for the first time ever been able to peer across the world's rich diversity of languages and their use in television news reporting to understand the concepts of multilingualism, code switching and the evolution of language over the past quarter-century. The large model era, with its Large Speech Models (LSMs) and Large Multimodal Models (LMMs) was supposed to usher in an unprecedented era of truly multilingual capabilities, but one of the major findings of our work to date is that despite their vast internet-scale datasets and powerful underlying algorithms, even the best of today's models still struggle with much of the world's linguistic landscape, meaning that as LLMs, LSMs and LMMs advance, much of the world is being increasingly left behind. To illustrate these challenges, we'll perform speech recognition (ASR) on an example Amharic-language broadcast from Ethiopia through the eyes of GCP's Chirp and Chirp 2 models and OpenAI's Whisper Large-1, Large-2 and Large-3 models. The findings here offer a stark reminder of the need for the AI community to balance its relentless focus on improving reasoning capabilities and building ever-larger and faster models with the need to improve their representation of the full diversity of the world's languages.

Let's take this Amharic-language Ethiopian broadcast from 2011 that is representative of a wide swath of the Archive's older international coverage. The audio quality is characteristic of many international television news broadcasts of the era, with audio artifacts like clipping, outdoor capture with background noise like chirping birds and indoor surface reflection in hard-walled untreated rooms, but fully intelligible and trivially understood by a human speaker. LSM and LMM systems are typically able to hear through considerable similar background noise when transcribing languages more heavily represented in their training datasets like English and European languages, but when it comes to many of the world's languages and dialects, ASR systems based on LSMs and LMMs tend to fail in highly unpredictable ways when they encounter artifacts in non-studio audio, with LSMs simply failing to produce output for much of the video, while LMMs are more apt to hallucinate content to fill in the areas where they are unable to perceive recognizable speech.

The end result is that all five models (Chirp 1, Chirp 2, Whisper Large-V1, Whisper Large-V2, Whisper Large-V3) either yield silence over the majority of the transcript or fill the majority of the transcript with hallucinations and repeated text, yielding gibberish. Whisper's English translations suffer from identical issues.

Let's first look at GCP's Chirp models. As an LSM, Chirp 1 is designed to simply produce no output when it is unable to confidently transcribe the speech. In contrast, Chirp 2 is closer to a base LMM model, which means it is prone to readily hallucinate text to fill in gaps in its recognized speech.

We can readily see this with our Chirp 1 results:

While Chirp 1 typically produces precisely identical results across runs, here we see marked differences, such as "Afar misal akkaa iyaan hin hadhuu goree sagaa layunayitii Waayiti tanii tafar ummaatti baaxaay wadaraake iggiin duddu kudhaan cafar bar" becoming "Afar misal akkaa iyaan hin hadhuu goree sagaa Iofar ummataay baaxaay waduirraa kee gee imi" in the second run. Notably, both transcriptions contain little Amharic, being much closer to Oromo.

Switching to Chirp 2, we see that unlike Chirp 1, at least Chirp 2 produces Amharic output in the Ge'ez script. However, we can also see that the results are almost exclusively hallucinated, with constant repetition riddled throughout.

Now let's try OpenAI's Whisper. Our first run with Large-V1 and Large-V2 autodetects the speech as Somali, but then strangely transcribes it as Arabic, producing a transcript that is a majority hallucination and repetition. Our first Large-V3 run identifies it as Swahili and transcribes it as such, with heavy repetition and hallucination in the second half of the video.

What if we override Whisper's automatic language detection and force it to assume the video contains Amharic speech? While Whisper was instructed to recognize the video as Amharic, it instead recognizes it as roughly Afar, Minang and Afar, respectively.

While Chirp only performs transcription at this time, Whisper can optionally provide an English translation of the transcript. How does this perform? Unfortunately, the results below are abysmal, composed nearly exclusively of hallucinations and repeated text.