Yesterday we explored applying OpenAI's open source Whisper ASR to transcribe and translate a Russian television news broadcast, finding that the results were highly fluent, but also highly non-deterministic, with dropouts, repetition and hallucination. Given that OpenAI has emphasized Whisper's enhanced English transcription capabilities and offers a parallel set of dedicated English language models, how might the results compare?
To explore this question, in collaboration with the Internet Archive's Television News Archive, we analyzed the October 12, 2022 3:00PM PDT edition of PBS NewsHour from affiliate KQED. This broadcast is one hour long and includes a range of speakers, topics and proper names to test Whisper's transcription abilities.
Key findings are that Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning. While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs. Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content. Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others. A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.
Technical Configuration
For testing, we used the same pair of GCE VMs used on the Russian broadcast:
- T4 VM: N1 series (Intel Haswell) VM with 48 vCPUs and 312GB RAM with one T4 GPU ("NVIDIA Corporation TU104GL [Tesla T4] (rev a1)")
- V100 VM: N1 series (Intel Skylake) VM with 4 vCPUs and 78GB RAM with one V100 GPU ("NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)")
See yesterday's post for details on the specific version numbers of the installed software and drivers and how we installed it.
Runtime & Resource Consumption
As with yesterday's exploration, we ran each model+task combination twice on the T4 and V100 VMs and recorded their total inference time and host and GPU resource consumption, reported in the table below. For details on how we recorded these numbers, see yesterday's post. While the original content is already in English, we tested a single Medium translation task as well to see if the results differed substantially from transcription and runtime, given that yesterday's results showed translation tasks completing significantly faster than transcription tasks.
Runtime (T4) | Runtime (T4) (2nd Time) | Runtime (V100) | Runtime (V100) (2nd Time) | Host Mem (T4) | Host Mem (V100) | GPU Mem (T4) | GPU Mem (V100) | GPU Use (T4) | GPU Use (V100) | |
Tiny: Transcribe | 2m42s | 2m37.445s | 2m26s | 2m27.346s | 11.0GB/2.6GB | 10.8GB/2.6GB | 2025MiB | 2327MiB | 62% | 40% |
Small: Transcribe | 9m10.834s | 9m8.732s | 5m6.420s | 5m6.645s | 14.7GB/3.4GB | 13.6GB/3.4GB | 3535MiB | 3787MiB | 92% | 62% |
Medium: Transcribe | 22m41.257s | 22m40.714s | 9m8.525s | 8m59.942s | 20.0GB/5.9GB | 20.0GB/5.9GB | 6189MiB | 6189MiB | 96% | 96% |
Large: Transcribe | 43m53.155s | 43m53.568s | 15m57.197s | 15m51.501s | 30.6GB/9.6GB | 30.2GB/9.5GB | 11095MiB | 11783MiB | 97% | 94% |
Tiny.en: Transcribe | 2m47.113s | 2m39.976s | 2m38.484s | 2m19.806s | 12.0GB/2.6GB | 10.9GB/2.6GB | 2251MiB | 2503MiB | 59% | 38% |
Small.en: Transcribe | 9m27.976s | 8m43.466s | 5m22.444s | 4m57.191s | 14.6GB/3.4GB | 13.5GB/3.4GB | 3495MiB | 3747MiB | 91% | 63% |
Medium.en: Transcribe | 24m57.375s | 23m0.521s | 10m25.208s | 9m10.111s | 21.0GB/5.9GB | 19.7GB/5.9GB | 6597MiB | 6637MiB | 96% | 83% |
Medium Translate | 9m12.089s | 19.7GB/5.9GB | 6667MiB | 84% |
Comparing the inference times above with yesterday's Russian results for the faster V100 GPU, the Tiny model ran 4.1x faster on this English broadcast than yesterday's Russian broadcast, Small 2.2x faster, Medium 3.5x faster and Large 3.3x faster. Given that this broadcast was 1 hour compared with yesterday's 2.5 hour broadcast, a rate of 2.5x faster would indicate identical speed, meaning only the Small model yielded slower transcription speed. Dividing the total seconds of broadcast airtime by the total seconds Whisper took on its initial V100 run to transcribe it, the table below shows the seconds of airtime transcribed per second of GPU time, showing that English transcription is faster other than for the Small model, but by a modest amount. The English-specific models are slower in all cases, meaning that from an inference time standpoint, using an English-only model will not increase the transcription rate.
Compare With Russian | Russian | English | Ratio | Russian | English | |
Tiny: Transcribe | 599 | 146 | 4.102739726 | Tiny: Transcribe | 15.02504174 | 24.65753425 |
Small: Transcribe | 660 | 307 | 2.149837134 | Small: Transcribe | 13.63636364 | 11.72638436 |
Medium: Transcribe | 1912 | 540 | 3.540740741 | Medium: Transcribe | 4.707112971 | 6.666666667 |
Large: Transcribe | 3118 | 951 | 3.278654048 | Large: Transcribe | 2.886465683 | 3.785488959 |
Tiny.en: Transcribe | 158 | Tiny.en: Transcribe | 22.78481013 | |||
Small.en: Transcribe | 322 | Small.en: Transcribe | 11.18012422 | |||
Medium.en: Transcribe | 625 | Medium.en: Transcribe | 5.76 |
Key findings from the table above:
- English Transcription Is Only Slightly Faster Than Russian. Whisper transcribed this English-language broadcast slightly faster than it did yesterday's Russian-language broadcast, accounting for the differences in their lengths, but only by a modest amount.
- English-Optimized Models Are Slower. The three English-optimized models actually take longer to transcribe the broadcast than the multilingual models.
- High Resource Consumption & Utilization. The models consume substantial host and GPU resources. Like with Russian, the models used only a single vCPU core at 100%, but the GPU was almost fully utilized, meaning only a single broadcast at a time can be transcribed. Though, the Medium English model offers some potential for running two broadcasts at once, since it has 20% GPU capability unused and has sufficient memory to load two copies of the model.
- Consistent Inference Times & Resource Consumption. As with yesterday's results, running Whisper repeatedly on the same model+task combination yields similar inference times and similar host and GPU resource consumption, offering predictability in resource scheduling.
Transcription Results
What do the actual results look like? To compare the results of each model, we have provided their respective transcripts below, along with the original human-typed closed captioning transcription as a comparison gold standard to help evaluate how well these fully automated transcripts compare with live human captioning.
- Human Closed Captioning (original broadcast captioning)
- Tiny: Transcribe
- Tiny.en: Transcribe
- Small: Transcribe
- Small.en: Transcribe
- Medium: Transcribe
- Medium.en: Transcribe
- Medium: Translate
- Large: Transcribe
How do each of the automated transcripts compare with the human captioning? Remember that the human captioning in this case is NOT a true "gold" dataset in that it leaves out some spoken content, such as portions of the sponsorship statement and may have other errors, since it is typically typed live in realtime by a human transcriptionist. Thus, The degree to which the automated transcripts deviate from it is not a measure of their error rate, but rather only their deviation rate – they may be more complete or accurate in some cases.
Since broadcast news captioning typically lacks capitalization (it is almost always entirely upper or lower case), both human captioning and ASR transcript are converted to lowercase, punctuation and timecode information removed, and both are split into words for comparison. Thus, the two are compared as a list of lowercased words.
- Tiny: Transcribe vs Captioning: Changed Words: 1926 / 9741 (19.77%)
- Tiny.en: Transcribe vs Captioning: Changed Words: 1858 / 9776 (19.01%)
- Small: Transcribe vs Captioning: Changed Words: 1777 / 9660 (18.40%)
- Small.en: Transcribe vs Captioning: Changed Words: 1721 / 9712 (17.72%)
- Medium: Transcribe vs Captioning: Changed Words: 1823 / 9818 (18.57%)
- Medium.en: Transcribe vs Captioning: Changed Words: 1658 / 9592 (17.29%)
- Medium: Translate vs Captioning: Changed Words: 1848 / 9849 (18.76%)
- Large: Transcribe vs Captioning: Changed Words: 1678 / 9715 (17.27%)
The human transcript includes typographical errors like "thosassaults" instead of "those assaults" and "por" instead of "power", along with repetition like "laura: russian fire also knocked — amna: russian fire also knocked". The human captioning includes some speaker identification information, with the person's name at the beginning of their speech, which is obviously lacking from the automated ASR.
In numerous cases the ASR is more complete. In one clip, the human captioning transcribed "martinez had stepped down from her post as council president. martinez and two other councilmembers talked about attracting latino voting power d made abusive or marks" whereas the ASR transcribed "Nuri Martinez had already stepped down from her post as council president. On the leaked recording, Martinez and two other council members talked about protecting Latino voting power and made abusive remarks about other groups." While the ASR misspelled her first name as "Nuri" instead of "Nury" it added several words missing from the human transcript including "nuri", "already" and "on the leaked recording in". It corrected "d" to "and" and "or marks" to "remarks". It also corrected "attracting" to "protecting". Later in the same clip it added "California" to "attorney general" which was missing the human version. In another clip, it added "punitive damages" that was missing from the human captioning, while in another it significantly expanded the word "money" in the human version to "nearly $1 billion, $965 million, to be exact" in the ASR version.
In fact, a great deal of the differences are either stylistic ("it is" ves "it's", "10" vs "ten", "e-mailing" vs "emailing") or where Whisper's ASR filled in words missing from the human captioning, meaning the ASR produced a more faithful and overall more accurate transcript than the human-typed captioning.
For each model size, we ran the model three times in sequence over the broadcast on the two GPU VMs in parallel, for a total of six runs. Below you will find each of the six transcripts, along with a pairwise comparison of them that measures how much change there was between each version.
Tiny: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run4 x Run5: Changed Words: 829 / 9606 (8.63%)
- Run3 x Run5: Changed Words: 791 / 9586 (8.25%)
- Run2 x Run5: Changed Words: 765 / 9586 (7.98%)
- Run5 x Run6: Changed Words: 751 / 9628 (7.80%)
- Run4 x Run6: Changed Words: 731 / 9606 (7.61%)
- Run1 x Run5: Changed Words: 726 / 9578 (7.58%)
- Run3 x Run6: Changed Words: 684 / 9586 (7.14%)
- Run1 x Run3: Changed Words: 669 / 9554 (7.00%)
- Run3 x Run4: Changed Words: 647 / 9561 (6.77%)
- Run2 x Run6: Changed Words: 639 / 9577 (6.67%)
- Run1 x Run4: Changed Words: 631 / 9555 (6.60%)
- Run2 x Run4: Changed Words: 629 / 9545 (6.59%)
- Run2 x Run3: Changed Words: 617 / 9532 (6.47%)
- Run1 x Run6: Changed Words: 615 / 9578 (6.42%)
- Run1 x Run2: Changed Words: 569 / 9525 (5.97%)
Tiny.en: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run1 x Run4: Changed Words: 266 / 9459 (2.81%)
- Run1 x Run5: Changed Words: 266 / 9459 (2.81%)
- Run1 x Run6: Changed Words: 266 / 9459 (2.81%)
- Run2 x Run4: Changed Words: 266 / 9459 (2.81%)
- Run2 x Run5: Changed Words: 266 / 9459 (2.81%)
- Run2 x Run6: Changed Words: 266 / 9459 (2.81%)
- Run3 x Run4: Changed Words: 266 / 9459 (2.81%)
- Run3 x Run5: Changed Words: 266 / 9459 (2.81%)
- Run3 x Run6: Changed Words: 266 / 9459 (2.81%)
- Run1 x Run2: Changed Words: 0 / 9326 (0.00%)
- Run1 x Run3: Changed Words: 0 / 9326 (0.00%)
- Run2 x Run3: Changed Words: 0 / 9326 (0.00%)
- Run4 x Run5: Changed Words: 0 / 9445 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9445 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9445 (0.00%)
Small: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run2 x Run4: Changed Words: 163 / 9281 (1.76%)
- Run2 x Run5: Changed Words: 163 / 9281 (1.76%)
- Run2 x Run6: Changed Words: 163 / 9281 (1.76%)
- Run1 x Run4: Changed Words: 154 / 9273 (1.66%)
- Run1 x Run5: Changed Words: 154 / 9273 (1.66%)
- Run1 x Run6: Changed Words: 154 / 9273 (1.66%)
- Run3 x Run4: Changed Words: 153 / 9272 (1.65%)
- Run3 x Run5: Changed Words: 153 / 9272 (1.65%)
- Run3 x Run6: Changed Words: 153 / 9272 (1.65%)
- Run1 x Run2: Changed Words: 33 / 9240 (0.36%)
- Run2 x Run3: Changed Words: 25 / 9236 (0.27%)
- Run1 x Run3: Changed Words: 24 / 9231 (0.26%)
- Run4 x Run5: Changed Words: 0 / 9203 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9203 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9203 (0.00%)
Small.en: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run1 x Run4: Changed Words: 61 / 9360 (0.65%)
- Run1 x Run5: Changed Words: 61 / 9360 (0.65%)
- Run1 x Run6: Changed Words: 61 / 9360 (0.65%)
- Run2 x Run4: Changed Words: 61 / 9360 (0.65%)
- Run2 x Run5: Changed Words: 61 / 9360 (0.65%)
- Run2 x Run6: Changed Words: 61 / 9360 (0.65%)
- Run3 x Run4: Changed Words: 61 / 9360 (0.65%)
- Run3 x Run5: Changed Words: 61 / 9360 (0.65%)
- Run3 x Run6: Changed Words: 61 / 9360 (0.65%)
- Run1 x Run2: Changed Words: 0 / 9353 (0.00%)
- Run1 x Run3: Changed Words: 0 / 9353 (0.00%)
- Run2 x Run3: Changed Words: 0 / 9353 (0.00%)
- Run4 x Run5: Changed Words: 0 / 9349 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9349 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9349 (0.00%)
Medium: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run1 x Run4: Changed Words: 51 / 9516 (0.54%)
- Run1 x Run5: Changed Words: 51 / 9516 (0.54%)
- Run1 x Run6: Changed Words: 51 / 9516 (0.54%)
- Run2 x Run4: Changed Words: 51 / 9516 (0.54%)
- Run2 x Run5: Changed Words: 51 / 9516 (0.54%)
- Run2 x Run6: Changed Words: 51 / 9516 (0.54%)
- Run3 x Run4: Changed Words: 51 / 9516 (0.54%)
- Run3 x Run5: Changed Words: 51 / 9516 (0.54%)
- Run3 x Run6: Changed Words: 51 / 9516 (0.54%)
- Run1 x Run2: Changed Words: 0 / 9488 (0.00%)
- Run1 x Run3: Changed Words: 0 / 9488 (0.00%)
- Run2 x Run3: Changed Words: 0 / 9488 (0.00%)
- Run4 x Run5: Changed Words: 0 / 9515 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9515 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9515 (0.00%)
Medium.en: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run1 x Run4: Changed Words: 196 / 9346 (2.10%)
- Run1 x Run5: Changed Words: 196 / 9346 (2.10%)
- Run1 x Run6: Changed Words: 196 / 9346 (2.10%)
- Run3 x Run4: Changed Words: 190 / 9326 (2.04%)
- Run3 x Run5: Changed Words: 190 / 9326 (2.04%)
- Run3 x Run6: Changed Words: 190 / 9326 (2.04%)
- Run1 x Run2: Changed Words: 118 / 9339 (1.26%)
- Run2 x Run3: Changed Words: 107 / 9314 (1.15%)
- Run1 x Run3: Changed Words: 98 / 9353 (1.05%)
- Run2 x Run4: Changed Words: 78 / 9252 (0.84%)
- Run2 x Run5: Changed Words: 78 / 9252 (0.84%)
- Run2 x Run6: Changed Words: 78 / 9252 (0.84%)
- Run4 x Run5: Changed Words: 0 / 9228 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9228 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9228 (0.00%)
Medium: Translate
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run1 x Run2: Changed Words: 0 / 9557 (0.00%)
- Run1 x Run3: Changed Words: 0 / 9557 (0.00%)
- Run2 x Run3: Changed Words: 0 / 9557 (0.00%)
Large: Transcribe
- Run1 (Transcript)
- Run2 (Transcript)
- Run3 (Transcript)
- Run4 (Transcript)
- Run5 (Transcript)
- Run6 (Transcript)
- Run1 x Run4: Changed Words: 290 / 9425 (3.08%)
- Run1 x Run5: Changed Words: 290 / 9425 (3.08%)
- Run1 x Run6: Changed Words: 290 / 9425 (3.08%)
- Run2 x Run4: Changed Words: 290 / 9425 (3.08%)
- Run2 x Run5: Changed Words: 290 / 9425 (3.08%)
- Run2 x Run6: Changed Words: 290 / 9425 (3.08%)
- Run3 x Run4: Changed Words: 290 / 9425 (3.08%)
- Run3 x Run5: Changed Words: 290 / 9425 (3.08%)
- Run3 x Run6: Changed Words: 290 / 9425 (3.08%)
- Run1 x Run2: Changed Words: 0 / 9322 (0.00%)
- Run1 x Run3: Changed Words: 0 / 9322 (0.00%)
- Run2 x Run3: Changed Words: 0 / 9322 (0.00%)
- Run4 x Run5: Changed Words: 0 / 9376 (0.00%)
- Run4 x Run6: Changed Words: 0 / 9376 (0.00%)
- Run5 x Run6: Changed Words: 0 / 9376 (0.00%)
Conclusions
Whisper yields an English language transcript with extremely fluent capitalization, punctuation and grouping. The resulting automatic transcript is more complete and faithful than the human-generated captioning and significantly more readable and understandable. It captures a number of details missed in the human captioning.
While still suffering from non-deterministic output, Whisper performed far better on this English-language broadcast than it did on yesterday's Russian-language broadcast in terms of significantly more consistent results, especially for the larger models, though there were still a few instances of drop-outs.
Computationally, Whisper performs only slightly faster on English-language content than yesterday's Russian-language material, while its English-optimized models are actually slightly slower. Host and GPU resource consumption and utilization is similar between English-optimized and general models and when running English and Russian language content.
Whisper produces a noticeably more fluent, complete and readable transcript than the current human closed captioning, suggesting a major inflection point has arrived for dramatically improving the accessibility of broadcast television news. Whisper's open nature means it can be readily fine-tuned as well. Its non-deterministic nature means post-correction of proper names may be more complex, with a mixture of fine-tuning for names which regularly appear on the channel and heuristic and manual post-correction for others.
A major challenge to realtime captioning is Whisper's substantial computational requirements, though that could likely be addressed by splitting the audio channel every few seconds and processing through Whisper in parallel, with post reassembly, utilizing much of the machinery Whisper already uses internally to move its internal 30-second analytic window over full-length broadcasts and look across those split barriers.