From LSM's To LMMs For ASR: Evaluating Gemini's Performance At Transcribing An Evening News Broadcast

As we continue to evaluate the rapid progress of large model ASR systems, from lightly to heavily generative LSMs to LMMs, we have found that highly generative LSM speech transcription models like OpenAI's Whisper and GCP's Chirp 2 models yield extremely poor results on television news broadcasts, with consistent and widespread hallucination and dropped passages. This got us wondering how an integrated LMM like Gemini might perform at transcription given all of the momentum around unified processing models. Unfortunately, Gemini 1.5 Pro exhibits the same challenges as Chirp 2, with the added downside the results change dramatically from one run to another. Running Gemini 1.5 Pro twice on the same exact audio file to transcribe it at the exact same time yields 441 differences between the two transcripts, from capitalization and punctuation differences to entire passages hallucinated or missing: worse even than Chirp 2's performance. As with Chirp 2, Gemini frequently "corrects" what was actually said into words and sentences that are more statistically probable – including "correcting" a discussion of California rent control to add a discussion of its "power" crisis that was not actually mentioned, but is frequently mentioned in its training data.

For example, here is the same clip that we looked at yesterday. We ran Gemini 1.5 Pro from scratch with the exact same prompt "Provide me a complete verbatim transcript of this audio file. Transcribe every word spoken." and exact same audio file and run at the exact same time. You can see that even in this brief clip there are are 30 changes. The majority are minor, but there are also several substantial ones. In one it misspells Kris Van Cleave's name, but spells it correctly in the other transcript. It spells First Officer Emily Wiprud's name incorrectly in both transcripts. Yet, most importantly, it hallucinates the phrases "Air Traffic Control" and "we just experienced a decompression incident" in place of "we just need to depressurize". Statistically, the phrases that Gemini outputs are more common than the phrases actually spoken and Gemini "corrects" the actual spoken words by rewording its output transcript to contain these more common phrases that weren't actually what was spoken in the transcript. While these two hallucinations may seem minor, they are merely indicative of the broad landscape of hallucinations that Gemini adds to its transcripts:

Similarly, in its transcript of this clip, Gemini hallucinates in one of the transcripts that Harris adds "its all about freedom" to her remarks, which is not what she actually said:

The government, and Donald Trump certainly should not be telling a woman what to do with her body. It's all about freedom. The two candidates sparred over the economy and inflation.

Or this clip about rent control. The actual transcript:

All the homes are gone. And the rent's too high. I work too damn hard. I work too damn hard. Still can't afford to stay. Can't afford to stay. California's leaving. California's leaving. The dreams drifting away. Support rent control.

Gemini 1.5 Pro Transcript #1:

All the homes are gone, and the rents too high. Power to damn hard, still California to stay. California's leaving. The dreams drifting away. Support rent control.

Gemini 1.5 Pro Transcript #2:

All the homes are gone. And the rent's too high. Power to damn hearts. Power to damn hearts. Still, California to stay. California to stay. Californians leaving. Californians leaving. The dreams driftin' away. Support rent control.

Note how Gemini hallucinates the appearance of "power" in both – a reflection of the high statistical correlation between discussions of "power" (both political and electrical) and California. Searches for discussion of California's power crisis would return this clip due to this hallucinated mention.