A Deep Dive Exploration Applying OpenAI's Whisper ASR To A PBS NewsHour Broadcast

Yesterday we explored applying OpenAI's open source Whisper ASR to transcribe and translate a Russian television news broadcast, finding that the results were highly fluent, but also highly non-deterministic, with dropouts, repetition and hallucination. Given that OpenAI has emphasized Whisper's enhanced English transcription capabilities and offers a parallel set of dedicated English language models, how might the results compare?

To explore this question, in collaboration with the Internet Archive's Television News Archive, we analyzed the October 12, 2022 3:00PM PDT edition of PBS NewsHour from affiliate KQED. This broadcast is one hour long and includes a range of speakers, topics and proper names to test Whisper's transcription abilities.

Key findings are that Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning. While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs. Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content. Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others. A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.

Technical Configuration

For testing, we used the same pair of GCE VMs used on the Russian broadcast:

  • T4 VM: N1 series (Intel Haswell) VM with 48 vCPUs and 312GB RAM with one T4 GPU ("NVIDIA Corporation TU104GL [Tesla T4] (rev a1)")
  • V100 VM: N1 series (Intel Skylake) VM with 4 vCPUs and 78GB RAM with one V100 GPU ("NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)")

See yesterday's post for details on the specific version numbers of the installed software and drivers and how we installed it.

Runtime & Resource Consumption

As with yesterday's exploration, we ran each model+task combination twice on the T4 and V100 VMs and recorded their total inference time and host and GPU resource consumption, reported in the table below. For details on how we recorded these numbers, see yesterday's post. While the original content is already in English, we tested a single Medium translation task as well to see if the results differed substantially from transcription and runtime, given that yesterday's results showed translation tasks completing significantly faster than transcription tasks.

Runtime (T4) Runtime (T4) (2nd Time) Runtime (V100) Runtime (V100) (2nd Time) Host Mem (T4) Host Mem (V100) GPU Mem (T4) GPU Mem (V100) GPU Use (T4) GPU Use (V100)
Tiny: Transcribe 2m42s 2m37.445s 2m26s 2m27.346s 11.0GB/2.6GB 10.8GB/2.6GB 2025MiB 2327MiB 62% 40%
Small: Transcribe 9m10.834s 9m8.732s 5m6.420s 5m6.645s 14.7GB/3.4GB 13.6GB/3.4GB 3535MiB 3787MiB 92% 62%
Medium: Transcribe 22m41.257s 22m40.714s 9m8.525s 8m59.942s 20.0GB/5.9GB 20.0GB/5.9GB 6189MiB 6189MiB 96% 96%
Large: Transcribe 43m53.155s 43m53.568s 15m57.197s 15m51.501s 30.6GB/9.6GB 30.2GB/9.5GB 11095MiB 11783MiB 97% 94%
Tiny.en: Transcribe 2m47.113s 2m39.976s 2m38.484s 2m19.806s 12.0GB/2.6GB 10.9GB/2.6GB 2251MiB 2503MiB 59% 38%
Small.en: Transcribe 9m27.976s 8m43.466s 5m22.444s 4m57.191s 14.6GB/3.4GB 13.5GB/3.4GB 3495MiB 3747MiB 91% 63%
Medium.en: Transcribe 24m57.375s 23m0.521s 10m25.208s 9m10.111s 21.0GB/5.9GB 19.7GB/5.9GB 6597MiB 6637MiB 96% 83%
Medium Translate 9m12.089s 19.7GB/5.9GB 6667MiB 84%

Comparing the inference times above with yesterday's Russian results for the faster V100 GPU, the Tiny model ran 4.1x faster on this English broadcast than yesterday's Russian broadcast, Small 2.2x faster, Medium 3.5x faster and Large 3.3x faster. Given that this broadcast was 1 hour compared with yesterday's 2.5 hour broadcast, a rate of 2.5x faster would indicate identical speed, meaning only the Small model yielded slower transcription speed. Dividing the total seconds of broadcast airtime by the total seconds Whisper took on its initial V100 run to transcribe it, the table below shows the seconds of airtime transcribed per second of GPU time, showing that English transcription is faster other than for the Small model, but by a modest amount. The English-specific models are slower in all cases, meaning that from an inference time standpoint, using an English-only model will not increase the transcription rate.

Compare With Russian Russian English Ratio Russian English
Tiny: Transcribe 599 146 4.102739726 Tiny: Transcribe 15.02504174 24.65753425
Small: Transcribe 660 307 2.149837134 Small: Transcribe 13.63636364 11.72638436
Medium: Transcribe 1912 540 3.540740741 Medium: Transcribe 4.707112971 6.666666667
Large: Transcribe 3118 951 3.278654048 Large: Transcribe 2.886465683 3.785488959
Tiny.en: Transcribe 158 Tiny.en: Transcribe 22.78481013
Small.en: Transcribe 322 Small.en: Transcribe 11.18012422
Medium.en: Transcribe 625 Medium.en: Transcribe 5.76

Key findings from the table above:

  • English Transcription Is Only Slightly Faster Than Russian. Whisper transcribed this English-language broadcast slightly faster than it did yesterday's Russian-language broadcast, accounting for the differences in their lengths, but only by a modest amount.
  • English-Optimized Models Are Slower. The three English-optimized models actually take longer to transcribe the broadcast than the multilingual models.
  • High Resource Consumption & Utilization. The models consume substantial host and GPU resources. Like with Russian, the models used only a single vCPU core at 100%, but the GPU was almost fully utilized, meaning only a single broadcast at a time can be transcribed. Though, the Medium English model offers some potential for running two broadcasts at once, since it has 20% GPU capability unused and has sufficient memory to load two copies of the model.
  • Consistent Inference Times & Resource Consumption. As with yesterday's results, running Whisper repeatedly on the same model+task combination yields similar inference times and similar host and GPU resource consumption, offering predictability in resource scheduling.

 

Transcription Results

What do the actual results look like? To compare the results of each model, we have provided their respective transcripts below, along with the original human-typed closed captioning transcription as a comparison gold standard to help evaluate how well these fully automated transcripts compare with live human captioning.

How do each of the automated transcripts compare with the human captioning? Remember that the human captioning in this case is NOT a true "gold" dataset in that it leaves out some spoken content, such as portions of the sponsorship statement and may have other errors, since it is typically typed live in realtime by a human transcriptionist. Thus, The degree to which the automated transcripts deviate from it is not a measure of their error rate, but rather only their deviation rate – they may be more complete or accurate in some cases.

Since broadcast news captioning typically lacks capitalization (it is almost always entirely upper or lower case), both human captioning and ASR transcript are converted to lowercase, punctuation and timecode information removed, and both are split into words for comparison. Thus, the two are compared as a list of lowercased words.

The human transcript includes typographical errors like "thosassaults" instead of "those assaults" and "por" instead of "power", along with repetition like "laura: russian fire also knocked — amna: russian fire also knocked". The human captioning includes some speaker identification information, with the person's name at the beginning of their speech, which is obviously lacking from the automated ASR.

In numerous cases the ASR is more complete. In one clip, the human captioning transcribed "martinez had stepped down from her post as council president. martinez and two other councilmembers talked about attracting latino voting power d made abusive or marks" whereas the ASR transcribed "Nuri Martinez had already stepped down from her post as council president. On the leaked recording, Martinez and two other council members talked about protecting Latino voting power and made abusive remarks about other groups." While the ASR misspelled her first name as "Nuri" instead of "Nury" it added several words missing from the human transcript including "nuri", "already" and "on the leaked recording in". It corrected "d" to "and" and "or marks" to "remarks". It also corrected "attracting" to "protecting". Later in the same clip it added "California" to "attorney general" which was missing the human version. In another clip, it added "punitive damages" that was missing from the human captioning, while in another it significantly expanded the word "money" in the human version to "nearly $1 billion, $965 million, to be exact" in the ASR version.

In fact, a great deal of the differences are either stylistic ("it is" ves "it's", "10" vs "ten", "e-mailing" vs "emailing") or where Whisper's ASR filled in words missing from the human captioning, meaning the ASR produced a more faithful and overall more accurate transcript than the human-typed captioning.

For each model size, we ran the model three times in sequence over the broadcast on the two GPU VMs in parallel, for a total of six runs. Below you will find each of the six transcripts, along with a pairwise comparison of them that measures how much change there was between each version.

Tiny: Transcribe

Tiny.en: Transcribe

Small: Transcribe

Small.en: Transcribe

Medium: Transcribe

Medium.en: Transcribe

Medium: Translate

Large: Transcribe

 

Conclusions

Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning.

While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs.

Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content.

Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others.

A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.