Testing The New OpenAI Whisper ASR Large-V2 Model On A Russian TV News Broadcast

This past October we explored applying OpenAI's Whisper ASR to television news broadcast selections from more than 100 channels spanning 50 countries over the past 20 years, including in-depth deep dives on Russian, French and English broadcasts. With the release of Whisper's new Large-V2 model two weeks ago, how does it compare? Most importantly, does the new model address some of the issues we've observed with the V1 Large model?

To explore the Large V2 model, we used the same V100-equipped VM we used for our earlier tests. First, we upgraded it to the latest version of Whisper:

python3 -m pip install --upgrade pip
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

And then ran the same RUSSIA1_20221013_143000_60_minut broadcast through the new model twice in both transcribe and translate tasks:

time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v2 --task translate > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.largev2.translate.txt
time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v2 > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.tiny.largev2.txt
time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v2 --task translate > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.largev2.translate.run2.txt
time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v2 > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.tiny.largev2.run2.txt

We also reran it through the Large-V1 model once in both transcribe and translate tasks to verify whether any other Whisper changes may have impacted its inference speed or output:

time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v1 --task translate > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.largev1.translate.txt
time whisper ./RUSSIA1_20221013_143000_60_minut.mp4 --model large-v1 > WHISPER-V100-RUSSIA1_20221013_143000_60_minut.tiny.largev1.transcribe.txt

When we previously ran the Large-V1 model with the original Whisper release on this V100 VM, it took 51m for transcription and 38m for translation. This time the Large-V1 model took 42m for transcription and 33m for translation. It is unclear whether the speedup was due to random chance or whether other Whisper implementation improvements have resulted in speed gains.

Interestingly, the Large-V2 model also took around 43m for transcription and 32m for translation, suggesting these changes may be architectural.

What do each of the runs look like? You can download their complete outputs below:

One of the greatest challenges with Whisper is its non-deterministic nature: each time it is run it yields different results, even for transcription for which there is one single "correct" answer. Below you can see the "diff" comparisons of the two Large-V2 runs and of the first Large-V2 run versus the Large-V1 run. For each comparison the number of words that were changed are listed, showing the magnitude of non-determinism.

As one simple example of the existential nature of this non-determinism and its impact on usability, here is the same statement as translated by each of the three translation passes:

Turkey turned out to be the most reliable route today to deliver grain to Europe.

Turkey turned out to be the most reliable route today to deliver gas to Europe.

Turkey turned out to be the most reliable route today for the delivery of weapons.

All three describe Turkey being the most reliable route, but for what? Is it grain, gas or weapons and is it to Europe specifically? This is one of the great challenges with Whisper as it currently stands: each time a video is translated, it can yield not just different results, but existentially different results that fundamentally change the meaning of a statement. OpenAI continues to work on solving this challenge and it will be interesting to see how long it takes before there are major improvements in meaning stability across runs.

The correct answer is actually the second – it is "gas to Europe" (listen to the clip).

We'd be interested in hearing your thoughts on whether the new Large-V2 model yields better, worse or same results for this broadcast!