Applying OpenAI's Whisper ASR To 101 Sample TV News Broadcasts Spanning 50 Countries

OpenAI's open source Whisper ASR has received immense interest for its multilingual transcription and translation capabilities spanning nearly 100 languages. The Internet Archive's Television News Archive offers an ideal testbed through which to explore Whisper's real-world performance, with global holdings spanning more than 100 channels across 50 countries and territories on 5 continents in at least 35 languages and dialects over 20 years. What would it look like to process one sample broadcast from each of these channels through Whisper, requesting both transcription and translation from its Tiny, Small, Medium and Large model sizes and what might this diversity of geographies, vintages, topics, speakers, languages and dialects teach us about Whisper's potential?

In collaboration with the Television News Archive, we selected one representative broadcast from each of the 100 channels available in the Visual Explorer. The majority of the Archive's international channels do not have web-playable video clips, meaning that you will only have the thumbnail gallery in the Visual Explorer to examine alongside the Whisper-generated transcript. However, for some international channels the Archive has over the years made one or two broadcasts playable as part of special collections, such as the 9/11 Archive, in which case that was the video we examined here. This means that for some channels, the specific broadcast examined may be extremely short or not as representative of the channel's overall coverage, but has the benefit of being able to compare the transcript with the actual audio of the broadcast. For the other channels we emphasized older broadcasts in many cases to test Whisper's ability to handle poorer-quality audio. Each broadcast below includes a notation beside it as to whether it has a playable video clip or not.

To run Whisper, we split the broadcasts across the same two T4 and V100 GPU GCE VMs we used for our previous deep dives. For technical specifics on the actual hardware and software environment used, see last week's initial deep dive. No specific affinity was enforced – shows were simply sharded randomly across the two VMs to maximize throughput.

Using a simple one-liner, we can count the total airtime represented in these files:

find . -maxdepth 1 -iname '*.mp4' -exec ffprobe -v quiet -of csv=p=0 -show_entries format=duration {} \; | paste -sd+ -| bc

In all, 101 broadcasts were examined, totaling 244,405 seconds (4,073 minutes / 67.9 hours) of audio. Despite using two GPU VMs in parallel, it still took just over three days of 24/7 processing across the two machines to analyze all 101 broadcasts through all four models in both transcription and translation modes.

You can see the final results below.