A Deep Dive Exploration Applying OpenAI's Whisper ASR To A PBS NewsHour Broadcast

Kalev Leetaru

3 years ago

Yesterday we explored applying OpenAI's open source Whisper ASR to transcribe and translate a Russian television news broadcast, finding that the results were highly fluent, but also highly non-deterministic, with dropouts, repetition and hallucination. Given that OpenAI has emphasized Whisper's enhanced English transcription capabilities and offers a parallel set of dedicated English language models, how might the results compare?

To explore this question, in collaboration with the Internet Archive's Television News Archive, we analyzed the October 12, 2022 3:00PM PDT edition of PBS NewsHour from affiliate KQED. This broadcast is one hour long and includes a range of speakers, topics and proper names to test Whisper's transcription abilities.

Key findings are that Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning. While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs. Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content. Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others. A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.

Technical Configuration

For testing, we used the same pair of GCE VMs used on the Russian broadcast:

T4 VM: N1 series (Intel Haswell) VM with 48 vCPUs and 312GB RAM with one T4 GPU ("NVIDIA Corporation TU104GL [Tesla T4] (rev a1)")
V100 VM: N1 series (Intel Skylake) VM with 4 vCPUs and 78GB RAM with one V100 GPU ("NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)")

See yesterday's post for details on the specific version numbers of the installed software and drivers and how we installed it.

Runtime & Resource Consumption

As with yesterday's exploration, we ran each model+task combination twice on the T4 and V100 VMs and recorded their total inference time and host and GPU resource consumption, reported in the table below. For details on how we recorded these numbers, see yesterday's post. While the original content is already in English, we tested a single Medium translation task as well to see if the results differed substantially from transcription and runtime, given that yesterday's results showed translation tasks completing significantly faster than transcription tasks.

	Runtime (T4)	Runtime (T4) (2nd Time)	Runtime (V100)	Runtime (V100) (2nd Time)	Host Mem (T4)	Host Mem (V100)	GPU Mem (T4)	GPU Mem (V100)	GPU Use (T4)	GPU Use (V100)
Tiny: Transcribe	2m42s	2m37.445s	2m26s	2m27.346s	11.0GB/2.6GB	10.8GB/2.6GB	2025MiB	2327MiB	62%	40%
Small: Transcribe	9m10.834s	9m8.732s	5m6.420s	5m6.645s	14.7GB/3.4GB	13.6GB/3.4GB	3535MiB	3787MiB	92%	62%
Medium: Transcribe	22m41.257s	22m40.714s	9m8.525s	8m59.942s	20.0GB/5.9GB	20.0GB/5.9GB	6189MiB	6189MiB	96%	96%
Large: Transcribe	43m53.155s	43m53.568s	15m57.197s	15m51.501s	30.6GB/9.6GB	30.2GB/9.5GB	11095MiB	11783MiB	97%	94%
Tiny.en: Transcribe	2m47.113s	2m39.976s	2m38.484s	2m19.806s	12.0GB/2.6GB	10.9GB/2.6GB	2251MiB	2503MiB	59%	38%
Small.en: Transcribe	9m27.976s	8m43.466s	5m22.444s	4m57.191s	14.6GB/3.4GB	13.5GB/3.4GB	3495MiB	3747MiB	91%	63%
Medium.en: Transcribe	24m57.375s	23m0.521s	10m25.208s	9m10.111s	21.0GB/5.9GB	19.7GB/5.9GB	6597MiB	6637MiB	96%	83%
Medium Translate			9m12.089s			19.7GB/5.9GB		6667MiB		84%

Comparing the inference times above with yesterday's Russian results for the faster V100 GPU, the Tiny model ran 4.1x faster on this English broadcast than yesterday's Russian broadcast, Small 2.2x faster, Medium 3.5x faster and Large 3.3x faster. Given that this broadcast was 1 hour compared with yesterday's 2.5 hour broadcast, a rate of 2.5x faster would indicate identical speed, meaning only the Small model yielded slower transcription speed. Dividing the total seconds of broadcast airtime by the total seconds Whisper took on its initial V100 run to transcribe it, the table below shows the seconds of airtime transcribed per second of GPU time, showing that English transcription is faster other than for the Small model, but by a modest amount. The English-specific models are slower in all cases, meaning that from an inference time standpoint, using an English-only model will not increase the transcription rate.

Compare With Russian	Russian	English	Ratio		Russian	English
Tiny: Transcribe	599	146	4.102739726	Tiny: Transcribe	15.02504174	24.65753425
Small: Transcribe	660	307	2.149837134	Small: Transcribe	13.63636364	11.72638436
Medium: Transcribe	1912	540	3.540740741	Medium: Transcribe	4.707112971	6.666666667
Large: Transcribe	3118	951	3.278654048	Large: Transcribe	2.886465683	3.785488959
Tiny.en: Transcribe		158		Tiny.en: Transcribe		22.78481013
Small.en: Transcribe		322		Small.en: Transcribe		11.18012422
Medium.en: Transcribe		625		Medium.en: Transcribe		5.76

Key findings from the table above:

English Transcription Is Only Slightly Faster Than Russian. Whisper transcribed this English-language broadcast slightly faster than it did yesterday's Russian-language broadcast, accounting for the differences in their lengths, but only by a modest amount.
English-Optimized Models Are Slower. The three English-optimized models actually take longer to transcribe the broadcast than the multilingual models.
High Resource Consumption & Utilization. The models consume substantial host and GPU resources. Like with Russian, the models used only a single vCPU core at 100%, but the GPU was almost fully utilized, meaning only a single broadcast at a time can be transcribed. Though, the Medium English model offers some potential for running two broadcasts at once, since it has 20% GPU capability unused and has sufficient memory to load two copies of the model.
Consistent Inference Times & Resource Consumption. As with yesterday's results, running Whisper repeatedly on the same model+task combination yields similar inference times and similar host and GPU resource consumption, offering predictability in resource scheduling.

Transcription Results

What do the actual results look like? To compare the results of each model, we have provided their respective transcripts below, along with the original human-typed closed captioning transcription as a comparison gold standard to help evaluate how well these fully automated transcripts compare with live human captioning.

Human Closed Captioning (original broadcast captioning)
Tiny: Transcribe
Tiny.en: Transcribe
Small: Transcribe
Small.en: Transcribe
Medium: Transcribe
Medium.en: Transcribe
Medium: Translate
Large: Transcribe

How do each of the automated transcripts compare with the human captioning? Remember that the human captioning in this case is NOT a true "gold" dataset in that it leaves out some spoken content, such as portions of the sponsorship statement and may have other errors, since it is typically typed live in realtime by a human transcriptionist. Thus, The degree to which the automated transcripts deviate from it is not a measure of their error rate, but rather only their deviation rate – they may be more complete or accurate in some cases.

Since broadcast news captioning typically lacks capitalization (it is almost always entirely upper or lower case), both human captioning and ASR transcript are converted to lowercase, punctuation and timecode information removed, and both are split into words for comparison. Thus, the two are compared as a list of lowercased words.

Tiny: Transcribe vs Captioning: Changed Words: 1926 / 9741 (19.77%)
Tiny.en: Transcribe vs Captioning: Changed Words: 1858 / 9776 (19.01%)
Small: Transcribe vs Captioning: Changed Words: 1777 / 9660 (18.40%)
Small.en: Transcribe vs Captioning: Changed Words: 1721 / 9712 (17.72%)
Medium: Transcribe vs Captioning: Changed Words: 1823 / 9818 (18.57%)
Medium.en: Transcribe vs Captioning: Changed Words: 1658 / 9592 (17.29%)
Medium: Translate vs Captioning: Changed Words: 1848 / 9849 (18.76%)
Large: Transcribe vs Captioning: Changed Words: 1678 / 9715 (17.27%)

The human transcript includes typographical errors like "thosassaults" instead of "those assaults" and "por" instead of "power", along with repetition like "laura: russian fire also knocked — amna: russian fire also knocked". The human captioning includes some speaker identification information, with the person's name at the beginning of their speech, which is obviously lacking from the automated ASR.

In numerous cases the ASR is more complete. In one clip, the human captioning transcribed "martinez had stepped down from her post as council president. martinez and two other councilmembers talked about attracting latino voting power d made abusive or marks" whereas the ASR transcribed "Nuri Martinez had already stepped down from her post as council president. On the leaked recording, Martinez and two other council members talked about protecting Latino voting power and made abusive remarks about other groups." While the ASR misspelled her first name as "Nuri" instead of "Nury" it added several words missing from the human transcript including "nuri", "already" and "on the leaked recording in". It corrected "d" to "and" and "or marks" to "remarks". It also corrected "attracting" to "protecting". Later in the same clip it added "California" to "attorney general" which was missing the human version. In another clip, it added "punitive damages" that was missing from the human captioning, while in another it significantly expanded the word "money" in the human version to "nearly $1 billion, $965 million, to be exact" in the ASR version.

In fact, a great deal of the differences are either stylistic ("it is" ves "it's", "10" vs "ten", "e-mailing" vs "emailing") or where Whisper's ASR filled in words missing from the human captioning, meaning the ASR produced a more faithful and overall more accurate transcript than the human-typed captioning.

For each model size, we ran the model three times in sequence over the broadcast on the two GPU VMs in parallel, for a total of six runs. Below you will find each of the six transcripts, along with a pairwise comparison of them that measures how much change there was between each version.

Tiny: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run4 x Run5: Changed Words: 829 / 9606 (8.63%)
Run3 x Run5: Changed Words: 791 / 9586 (8.25%)
Run2 x Run5: Changed Words: 765 / 9586 (7.98%)
Run5 x Run6: Changed Words: 751 / 9628 (7.80%)
Run4 x Run6: Changed Words: 731 / 9606 (7.61%)
Run1 x Run5: Changed Words: 726 / 9578 (7.58%)
Run3 x Run6: Changed Words: 684 / 9586 (7.14%)
Run1 x Run3: Changed Words: 669 / 9554 (7.00%)
Run3 x Run4: Changed Words: 647 / 9561 (6.77%)
Run2 x Run6: Changed Words: 639 / 9577 (6.67%)
Run1 x Run4: Changed Words: 631 / 9555 (6.60%)
Run2 x Run4: Changed Words: 629 / 9545 (6.59%)
Run2 x Run3: Changed Words: 617 / 9532 (6.47%)
Run1 x Run6: Changed Words: 615 / 9578 (6.42%)
Run1 x Run2: Changed Words: 569 / 9525 (5.97%)

Tiny.en: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run1 x Run4: Changed Words: 266 / 9459 (2.81%)
Run1 x Run5: Changed Words: 266 / 9459 (2.81%)
Run1 x Run6: Changed Words: 266 / 9459 (2.81%)
Run2 x Run4: Changed Words: 266 / 9459 (2.81%)
Run2 x Run5: Changed Words: 266 / 9459 (2.81%)
Run2 x Run6: Changed Words: 266 / 9459 (2.81%)
Run3 x Run4: Changed Words: 266 / 9459 (2.81%)
Run3 x Run5: Changed Words: 266 / 9459 (2.81%)
Run3 x Run6: Changed Words: 266 / 9459 (2.81%)
Run1 x Run2: Changed Words: 0 / 9326 (0.00%)
Run1 x Run3: Changed Words: 0 / 9326 (0.00%)
Run2 x Run3: Changed Words: 0 / 9326 (0.00%)
Run4 x Run5: Changed Words: 0 / 9445 (0.00%)
Run4 x Run6: Changed Words: 0 / 9445 (0.00%)
Run5 x Run6: Changed Words: 0 / 9445 (0.00%)

Small: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run2 x Run4: Changed Words: 163 / 9281 (1.76%)
Run2 x Run5: Changed Words: 163 / 9281 (1.76%)
Run2 x Run6: Changed Words: 163 / 9281 (1.76%)
Run1 x Run4: Changed Words: 154 / 9273 (1.66%)
Run1 x Run5: Changed Words: 154 / 9273 (1.66%)
Run1 x Run6: Changed Words: 154 / 9273 (1.66%)
Run3 x Run4: Changed Words: 153 / 9272 (1.65%)
Run3 x Run5: Changed Words: 153 / 9272 (1.65%)
Run3 x Run6: Changed Words: 153 / 9272 (1.65%)
Run1 x Run2: Changed Words: 33 / 9240 (0.36%)
Run2 x Run3: Changed Words: 25 / 9236 (0.27%)
Run1 x Run3: Changed Words: 24 / 9231 (0.26%)
Run4 x Run5: Changed Words: 0 / 9203 (0.00%)
Run4 x Run6: Changed Words: 0 / 9203 (0.00%)
Run5 x Run6: Changed Words: 0 / 9203 (0.00%)

Small.en: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run1 x Run4: Changed Words: 61 / 9360 (0.65%)
Run1 x Run5: Changed Words: 61 / 9360 (0.65%)
Run1 x Run6: Changed Words: 61 / 9360 (0.65%)
Run2 x Run4: Changed Words: 61 / 9360 (0.65%)
Run2 x Run5: Changed Words: 61 / 9360 (0.65%)
Run2 x Run6: Changed Words: 61 / 9360 (0.65%)
Run3 x Run4: Changed Words: 61 / 9360 (0.65%)
Run3 x Run5: Changed Words: 61 / 9360 (0.65%)
Run3 x Run6: Changed Words: 61 / 9360 (0.65%)
Run1 x Run2: Changed Words: 0 / 9353 (0.00%)
Run1 x Run3: Changed Words: 0 / 9353 (0.00%)
Run2 x Run3: Changed Words: 0 / 9353 (0.00%)
Run4 x Run5: Changed Words: 0 / 9349 (0.00%)
Run4 x Run6: Changed Words: 0 / 9349 (0.00%)
Run5 x Run6: Changed Words: 0 / 9349 (0.00%)

Medium: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run1 x Run4: Changed Words: 51 / 9516 (0.54%)
Run1 x Run5: Changed Words: 51 / 9516 (0.54%)
Run1 x Run6: Changed Words: 51 / 9516 (0.54%)
Run2 x Run4: Changed Words: 51 / 9516 (0.54%)
Run2 x Run5: Changed Words: 51 / 9516 (0.54%)
Run2 x Run6: Changed Words: 51 / 9516 (0.54%)
Run3 x Run4: Changed Words: 51 / 9516 (0.54%)
Run3 x Run5: Changed Words: 51 / 9516 (0.54%)
Run3 x Run6: Changed Words: 51 / 9516 (0.54%)
Run1 x Run2: Changed Words: 0 / 9488 (0.00%)
Run1 x Run3: Changed Words: 0 / 9488 (0.00%)
Run2 x Run3: Changed Words: 0 / 9488 (0.00%)
Run4 x Run5: Changed Words: 0 / 9515 (0.00%)
Run4 x Run6: Changed Words: 0 / 9515 (0.00%)
Run5 x Run6: Changed Words: 0 / 9515 (0.00%)

Medium.en: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run1 x Run4: Changed Words: 196 / 9346 (2.10%)
Run1 x Run5: Changed Words: 196 / 9346 (2.10%)
Run1 x Run6: Changed Words: 196 / 9346 (2.10%)
Run3 x Run4: Changed Words: 190 / 9326 (2.04%)
Run3 x Run5: Changed Words: 190 / 9326 (2.04%)
Run3 x Run6: Changed Words: 190 / 9326 (2.04%)
Run1 x Run2: Changed Words: 118 / 9339 (1.26%)
Run2 x Run3: Changed Words: 107 / 9314 (1.15%)
Run1 x Run3: Changed Words: 98 / 9353 (1.05%)
Run2 x Run4: Changed Words: 78 / 9252 (0.84%)
Run2 x Run5: Changed Words: 78 / 9252 (0.84%)
Run2 x Run6: Changed Words: 78 / 9252 (0.84%)
Run4 x Run5: Changed Words: 0 / 9228 (0.00%)
Run4 x Run6: Changed Words: 0 / 9228 (0.00%)
Run5 x Run6: Changed Words: 0 / 9228 (0.00%)

Medium: Translate

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run1 x Run2: Changed Words: 0 / 9557 (0.00%)
Run1 x Run3: Changed Words: 0 / 9557 (0.00%)
Run2 x Run3: Changed Words: 0 / 9557 (0.00%)

Large: Transcribe

Run1 (Transcript)
Run2 (Transcript)
Run3 (Transcript)
Run4 (Transcript)
Run5 (Transcript)
Run6 (Transcript)
Run1 x Run4: Changed Words: 290 / 9425 (3.08%)
Run1 x Run5: Changed Words: 290 / 9425 (3.08%)
Run1 x Run6: Changed Words: 290 / 9425 (3.08%)
Run2 x Run4: Changed Words: 290 / 9425 (3.08%)
Run2 x Run5: Changed Words: 290 / 9425 (3.08%)
Run2 x Run6: Changed Words: 290 / 9425 (3.08%)
Run3 x Run4: Changed Words: 290 / 9425 (3.08%)
Run3 x Run5: Changed Words: 290 / 9425 (3.08%)
Run3 x Run6: Changed Words: 290 / 9425 (3.08%)
Run1 x Run2: Changed Words: 0 / 9322 (0.00%)
Run1 x Run3: Changed Words: 0 / 9322 (0.00%)
Run2 x Run3: Changed Words: 0 / 9322 (0.00%)
Run4 x Run5: Changed Words: 0 / 9376 (0.00%)
Run4 x Run6: Changed Words: 0 / 9376 (0.00%)
Run5 x Run6: Changed Words: 0 / 9376 (0.00%)

Conclusions

Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning.

While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs.

Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content.

Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others.

A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.