Last month OpenAI released an open source ASR system called Whisper, trained on 680,000 hours of multilingual data. Earlier this week we demonstrated the eerie human-like fluency of Whisper's automatic translation into English of a 30-second clip from a Russian television news broadcast. What would it look like to run Whisper over an entire Russian television news broadcast and compare the results of each of its model sizes?
To explore this question, in collaboration with the Internet Archive's Television News Archive, we chose the October 13, 2022 5:30-8:00PM MSK episode of 60 Minutes from Russian television channel Russia 1. This broadcast is 2.5 hours long and covers a wealth of different speakers speaking both in Russian and a variety of other languages overdubbed into Russian, along with a rich diversity of audio contexts, from music to combat footage to interviews to panels to overdubbing, making it an ideal testbed to examine Whisper's performance, in terms of transcription and translation output, inference runtime and computational requirements.
Key findings are that the models require substantial computational resources and have lengthy inference times in keeping with their advanced capabilities. The larger models produce extremely fluent translations previously unachievable by most commercial ASR offerings. A surprising finding that will require further research is that all four models are highly non-deterministic in their default configuration, producing different output each time they are run and can enter unpredictable error states that cause them to skip over portions of the broadcast, replace as much as half the broadcast with repeating text and introduce hallucinated text not found in the broadcast.
Technical Configuration
To create a representative hardware environment, we created two brand new VMs in Google Cloud Platform's (GCP)'s Compute Engine (GCE) platform to test two commonly available cloud-accessible GPU architectures: the T4 and V100. Both VMs had 250GB SSD boot disks using GCE's "Deep Learning on Linux: Debian 10 based Deep Learning VM with M98 (Base CUDA 11.3, Deep Learning VM Image with CUDA 11.3 preinstalled)" ("c0-deeplearning-common-cu113-v20220928-debian-10" image) as the disk image, with NVidia native driver installed (version 510.47.03) and CUDA upgraded to 11.6. Final hardware specifications:
- T4 VM: N1 series (Intel Haswell) VM with 48 vCPUs and 312GB RAM with one T4 GPU ("NVIDIA Corporation TU104GL [Tesla T4] (rev a1)")
- V100 VM: N1 series (Intel Skylake) VM with 4 vCPUs and 78GB RAM with one V100 GPU ("NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)")
We created the T4 VM first and gave it a large number of cores (48 vCPUs) in case it needed additional CPU resources to keep the GPU fully utilized, but after running Whisper through an initial set of benchmarks we determined that across all model sizes and both GPUs that after a few initial seconds, Whisper uses only a single CPU core, so we resized the V100 GPU down to 4 cores and observed no speed decrease and tested the T4 under different CPU sizes, again with no performance change. Thus, while the benchmarks here are based on these two differently-sized CPU configurations, Whisper can be safely run on a VM with just 4 vCPUs and possibly as few as 2. We used only a single GPU in each system.
Whisper was installed on both machines using the following install sequence, adapted from OpenAI's documentation and expanded to address installation needs distinct to the GCE Debian deep learning VM image:
#reinstall NVidia driver if needed /opt/deeplearning/install-driver.sh apt-get -y update apt-get -y upgrade apt-get -y install parallel apt-get -y install zip apt -y install build-essential apt-get -y install python-pip pip install --upgrade setuptools apt -y install python-dev pip install setuptools-rust apt-get -y install ffmpeg pip install ffmpeg pip install --upgrade pip pip3 install --upgrade pip pip install torch pip3 install regex pip3 install transformers pip3 install git+https://github.com/openai/whisper.git #verify GPU lspci | grep -i nvidia #verify GPU is being utilized nvidia-smi -l 2
We extracted the broadcast audio into a single-channel FLAC file to mimic our typical processing pipeline:
time find *.mp4 | parallel --eta 'ffmpeg -nostdin -hide_banner -loglevel panic -i ./{} -filter_complex "[0:a]channelsplit=channel_layout=stereo:channels=FL[left]" -map "[left]" -f flac ./{.}.flac'
We then ran Whisper using its built-in CLI in both transcription and translation mode without adjusting any of its parameters, following OpenAI's documentation:
time whisper ./RUSSIA1_20221013_143000_60_minut.flac --model tiny > RUSSIA1_20221013_143000_60_minut.tiny.transcript.txt time whisper ./RUSSIA1_20221013_143000_60_minut.flac --model tiny --task translate > RUSSIA1_20221013_143000_60_minut.tiny.translate.txt
While all kernel-level user quotas were disabled, we took the additional step of running Whisper as root on both VMs to sidestep any additional factors that might have throttled performance in any way. We timed each run using "time" to measure its total inference time from start to finish. We monitored each run and once its initial burst of CPU activity completed and it was entirely GPU-dependent, we measured its CPU and system memory utilization (both VIRT and RES) using "top" and GPU utilization and memory consumption using "nvidia-smi -l 2" to sample every 2 seconds. You can see an example of the GPU statistics for the two GPUs during benchmark runs with the Large model:
#RUNNING "nvidia-smi -l 2" ON T4 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 77C P0 69W / 70W | 11133MiB / 15360MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3374 C /usr/bin/python3 11131MiB | +-----------------------------------------------------------------------------+ #RUNNING "nvidia-smi -l 2" ON V100 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 | | N/A 61C P0 261W / 300W | 11373MiB / 16384MiB | 93% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 26479 C /usr/bin/python3 11371MiB | +-----------------------------------------------------------------------------+
Runtime & Resource Consumption
How long did each model and task take to complete and what was its resource consumption? We ran each model/task twice in a row to measure inference time variability and monitored it using "top" until system memory consumption stopped increasing and GPU utilization appeared to stabilize (this typically occurs within the first 20 seconds). Given that GPU utilization varies second-by-second, we took the median value over a 10 second interval about a minute after start and found that this accurately captured the median utilization over the run. GPU memory and utilization statistics were via "nvidia-smi" and host memory via "top".
The table below summarizes all of these statistics. Each row is a model+task configuration such as "Tiny: Transcribe" or "Tiny: Translate". We report runtimes via "time" for both the first and second run. We did not observe any significant change in host or GPU memory consumption or utilization, so those are only reported once. For host memory we report both VIRT and RES values. In all cases, CPU utilization was 100% of a single vCPU so is not included in the table below.
When evaluating runtime, remember that the source broadcast is 2.5 hours long.
Runtime (T4) | Runtime (T4) (2nd Time) | Runtime (V100) | Runtime (V100) (2nd Time) | Host Mem (T4) (VIRT/RES) | Host Mem (V100) (VIRT/RES) | GPU Mem (T4) | GPU Mem (V100) | GPU Use (T4) | GPU Use (V100) | |
Tiny: Transcribe | 7m28.443s | 10m11.949s | 9m59s | 16m9.023s | 12.9GB/2.9GB | 10.6GB/2.9GB | 2847MiB | 1985MiB | 56% | 40% |
Tiny: Translate | 30m51 | 31m12.351s | 27m46 | 10m43.076s | 11.6GB/2.9GB | 10.7GB/2.9GB | 1627MiB | 2037MiB | 57% | 42% |
Small: Transcribe | 31m32.665s | 31m12.159s | 10m59.553s | 11m35.576s | 14.4GB/3.5GB | 13.3GB/3.6GB | 3137MiB | 3389MiB | 88% | 63% |
Small: Translate | 25m31.885s | 26m9.785s | 13m41.689s | 13m41.689s | 14.4GB/3.5GB | 13.3GB/3.6GB | 3131MiB | 3383MiB | 92% | 60% |
Medium: Transcribe | 82m57s | 81m21.527s | 31m52.593s | 30m56.096s | 20.2GB/6.0GB | 19.5GB/6.0GB | 6061MiB | 6313MiB | 96% | 87% |
Medium: Translate | 58m8.776s | 61m16.805s | 23m44s | 24m4.719s | 20.6GB/6.0GB | 19.5GB/6.0GB | 6073MiB | 6325MiB | 96% | 87% |
Large: Transcribe | 148m18.460s | 148m45.548s | 51m58.015s | 51m34.380s | 30.9GB/9.7GB | 30.1GB/9.7GB | 11133MiB | 11385MiB | 98% | 90% |
Large: Translate | 111m50.249s | 111m24.323s | 38m30.437s | 38m29.795s | 30.7GB/9.7GB | 30.1GB/9.7GB | 11121MiB | 11373MiB | 98% | 93% |
Key findings from the table above:
- Translation Is Faster Than Transcription. Surprisingly, translation is substantially faster than transcription, sometimes by as much as 2x for all models but Tiny.
- Lengthy Inference Time. The V100 ran roughly twice as fast as the T4, but even the V100 required significant runtimes to process this 2.5 hour broadcast across all models, in keeping with their size. Transcription time on the T4 using the Large model was almost exactly realtime (2.5 hours), while on the V100 it was 2.8x realtime. Translation time was faster on both GPUs for both Medium and Large models, roughly 1.3x realtime on the T4 and 3.8x realtime on the V100 for the Large model. If both transcription and translation are required, runtime is 0.57x realtime on the T4 and 1.6x realtime on the V100 for the Large model. To process 24/7 audio content, such as live television news batched hourly, would require substantial GPU resources and incur substantial latency.
- Full GPU Utilization. Other than the Tiny model for both GPUs and the Small model for the V100, the GPU is nearly fully utilized, meaning each GPU must be dedicated to a single broadcast, rather than running multiple broadcasts in parallel. This dramatically increases the hardware required for 24/7 processing.
- High Memory Consumption. While Whisper uses only a single CPU after its initial CPU burst, its memory consumption reaches 10GB for the Large model. While negligible by cloud standards, this does preclude the use of the "n1-highcpu-4" VM type and requires at least an "n1-standard-4" (15GB RAM). The T4 had 15360MiB of available memory, while the V100 had 16384MiB. That is sufficient memory for two Medium models to fit or a single Large model, though GPU utilization is already at saturation so the ability to fit two medium models at once yields no benefits. This also suggests that GPUs with smaller amounts of memory will not be sufficient to run these models.
- Tiny Model Underutilizes GPU. Inference time varied dramatically between runs for the Tiny models across both GPUs and utilized only half of either GPU. Performing translation with the Tiny model was the same speed or slower than the Small model. GPUs were effectively starved.
- Consistent Inference Times. Overall, the inference times were highly similar when running the same model+task on the same system twice back-to-back. The only exception was the Tiny model.
To test this consistency more generally, we ran the Small translate task 8 times in sequence, the Medium transcribe 3 times in sequence and the Medium translate 3 times in sequence, all on the V100. The runtimes below are extremely consistent, suggesting strong inference time stability in the cloud environment:
#V100 SMALL:TRANSLATE 14m58.270s 15m11.418s 14m24.548s 13m31.535s 14m23.213s 12m35.954s 14m30.396s 14m13.713s #V100 MEDIUM:TRANSCRIBE 34m37.440s 32m55.528s 35m15.047s #V100 MEDIUM:TRANSLATE 25m17.214s 24m23.683s 24m38.222s
Transcription & Translation Results
What do the actual transcription and translation results look like? Our brief exploration earlier this week suggests the larger models produce substantially more fluent results, but how does this trend hold up on a full-length broadcast? Below are the transcription and translation produced by each model to make it easy to compare them:
- Tiny: Transcription.
- Tiny: Translation.
- Small: Transcription.
- Small: Translation.
- Medium: Transcription.
- Medium: Translation.
- Large: Transcription.
- Large: Translation.
Russian speakers can compare the results above to the original broadcast in the Visual Explorer and we'd love to hear your evaluations.
Whisper's ability to translate speech from 99 languages into English has attracted considerable interest from the academic and journalism communities, so how do the translations of the four models compare? In this case we do not have a gold translation to evaluate accuracy, but we can at least observe differences among them.
We've included the first 5 minutes of the translation from each of the four models to make it easy to see some of their key differences.
Tiny
The smallest of the four models, this translation is difficult to read, filled with errors and oddities and hallucinations, such as "Samsung", "Guam LCD Yes BCB 71-600 composite" and "electric anti-attent wars athlete". It struggles with the material to the point that its translation is difficult to follow and misses a majority of the key statements from the broadcast, especially with regards to the European Union's response in the second half of the five minute period.
[00:05.800 --> 00:11.360] I don't know what's remaining. [00:11.360 --> 00:14.040] Samsung. [00:35.320 --> 00:38.900] Hello, dear friends! [00:39.780 --> 00:42.420] Hello, my dear friends, puppets and sale routines. [00:42.660 --> 00:44.420] Hello, all, I'll take out my special programs and [00:44.620 --> 00:46.260] the 60-minute notes. [00:46.820 --> 00:48.300] Good to go. [00:48.500 --> 00:51.700] If it'slets from Europe theorem, [00:52.420 --> 00:55.180] I don't have the impression of such preferences [00:55.340 --> 00:57.380] about your owners' desires, [00:58.100 --> 01:01.420] instead of having differentPre我說fs. [01:01.420 --> 01:05.100] that all the Russian army will not be able to do so. [01:05.100 --> 01:08.060] It is also the same as the French president. [01:08.060 --> 01:12.060] Macron has said that he is responsible for the nuclear power of Ukraine, [01:12.060 --> 01:13.500] France does not stand there. [01:13.500 --> 01:16.300] He is the enemy of the world of war. [01:16.300 --> 01:18.540] Why the Macron is not the enemy of the world? [01:18.540 --> 01:22.860] All the Western press is already a plan for the case of the nuclear war. [01:22.860 --> 01:31.100] He is the one who is responsible for the nuclear power of the nuclear power of the nuclear power of the nuclear power of the country. [01:31.100 --> 01:33.860] Towards the orom Star of the United States, [01:33.860 --> 01:36.780] let the Ukrainian Party rest for theura of Syria, Pii, [01:36.780 --> 01:39.980] the primeplicative war, war, [01:39.980 --> 01:43.080] the electric anti-attent wars athlete, [01:43.080 --> 01:45.020] etcetera. [01:45.020 --> 01:48.800] All wars in the world are responsible until now, [01:48.800 --> 01:53.380] but the 1st hot conference in Berlin today has become a capital and cease division [01:53.380 --> 01:57.880] of those fake accounts from the battlefield, [01:57.880 --> 02:00.340] personally left out, [02:00.340 --> 02:01.620] the military personnel. [02:01.620 --> 02:03.880] He came up with some of the most challenging [02:03.880 --> 02:04.720] experiments. [02:04.720 --> 02:08.580] Guam LCD Yes BCB 71-600 composite [02:08.600 --> 02:09.820] designed in European based traffic [02:09.920 --> 02:12.260] it is at insecticide which is [02:12.260 --> 02:15.200] 160 kilograms of торlething [02:15.200 --> 02:16.620] the engine [02:16.620 --> 02:27.220] three thousand speaking [02:27.640 --> 02:31.220] including the car [02:31.440 --> 02:34.360] what is Nissan Yadif in the market [02:34.360 --> 02:38.360] It's not a means to kill them, or to kill them. [02:46.360 --> 02:49.360] Putin believes that he doesn't fly, but doesn't fly. [02:49.360 --> 02:52.360] And we, and the European Union, with the United States, [02:52.360 --> 02:54.360] we believe that any nuclear power in Ukraine, [02:54.360 --> 02:56.360] we believe that we don't have nuclear weapons, [02:56.360 --> 02:58.360] our troops will only be able to fly, [02:58.360 --> 03:00.360] so that all the Russian army will not fly. [03:00.360 --> 03:04.360] But he is contented, a little longer. [03:17.360 --> 03:21.360] President�� General Lajanov was attacked the General Gates [03:21.360 --> 03:22.420] of General otros territory since his death [03:22.420 --> 03:24.360] and owned by nervous people. [03:24.380 --> 03:27.380] But still the time spent by the foreign countries [03:27.380 --> 03:29.860] preparing for their return, [03:29.860 --> 03:34.180] This is a minute, it's decomposed, a strong one. [03:51.200 --> 03:52.560] It's designed for this outfit. [03:53.200 --> 03:57.520] In fact, this suit will cover this body with it!!! [04:57.520 --> 04:59.520] In fact, in the first year of the first year,
Small
These results are markedly better than the Tiny model and far more understandable, but include oddities like "My mother Borrel", "the Genesek of the Allianz, Stoltenberg" and "There is also an opening of the US Newsweek". There is also a repetition sequence of "Putin is saying that he is not bluffing" appearing three times in a row. This version does, however, capture the overall gist of the broadcast, including the European Union's response.
[00:00.000 --> 00:13.000] I know it was, I know it will be, this is with me. [00:30.000 --> 00:51.000] Hello, dear friends and gentlemen, hello in the live broadcast of the program 60 minutes about the most important, some nuclear evening today the light of European diplomacy [00:51.000 --> 01:05.000] My mother Borrel said that the nuclear attack against Ukraine will not cause the response of the nuclear attack on Russia, instead there will be so powerful military answers that the entire Russian army will be destroyed. [01:05.000 --> 01:16.000] About the same thing the French president said, Macron said that the answer to the nuclear attack on Ukraine will not be France, and personally it is against the world war. [01:16.000 --> 01:23.000] However, no one asks Macron, the entire Western press is already writing a plan for the case of the nuclear war. [01:23.000 --> 01:31.000] Reuters reports that NATO is preparing attention to a physical response to the nuclear attack from Russia. [01:31.000 --> 01:38.000] There is also an opening of the US Newsweek, which is closer to Ukraine to victory and closer to the nuclear war. [01:38.000 --> 01:47.000] The former director of the CERU and the exploration of the Ministry of Defense of NATO calls for the Pentagon to be ready to use all the military means against Russia. [01:47.000 --> 01:52.000] Well, the Genesek of the Allianz, Stoltenberg, has already announced the study of the Cold War. [01:52.000 --> 01:57.000] The study is based on the development of the scenario of the beginning of the nuclear war. [01:57.000 --> 02:10.000] It is not difficult to guess with whom. The essence of the nuclear maneuvers of NATO is the use of the US nuclear bomb B61, which is located in the European border. [02:10.000 --> 02:16.000] This is a second to 170 kilograms of trotilla equivalent. [02:16.000 --> 02:20.000] 13 times more powerful, the bomb is dropped to Russia. [02:20.000 --> 02:31.000] Before this, the United Nations and the US Army will study the rapid transportation of nuclear weapons from their undergrounds and their attachment to the fighter-bomber. [02:31.000 --> 02:38.000] It is remarkable that the launch of the nuclear strikes against Russia is working out, including the German Luftwaffe. [02:38.000 --> 02:39.000] Insane. [02:39.000 --> 02:45.000] Putin is saying that he is not bluffing. [02:45.000 --> 02:47.000] Putin is saying that he is not bluffing. [02:47.000 --> 02:49.000] Putin is saying that he is not bluffing. [02:49.000 --> 02:52.000] But we are not bluffing, the European Union, the United States and NATO. [02:52.000 --> 02:55.000] We will respond to any nuclear strike in Ukraine. [02:55.000 --> 02:57.000] We will not respond to nuclear weapons. [02:57.000 --> 03:18.000] The United States will respond so strongly that the entire Russian army will be destroyed. [03:18.000 --> 03:28.000] Genesek Nath Jens Stoltenberg called the nuclear threats of Russia to be excessive and irresponsible, but at the same time next week will be the study of the NATO nuclear forces. [03:28.000 --> 03:30.000] How does it even understand? [03:30.000 --> 03:33.000] Yes, this is the answer to the NATO threat to Putin. [03:33.000 --> 03:50.000] This is the answer to the NATO threat to Putin. [03:50.000 --> 04:04.000] At these studies, the use of nuclear weapons will really be processed in case of escalation. [04:04.000 --> 04:21.000] At the beginning of the launch of the nuclear missile from underground shafts and bunkers, the bombers will also be able to carry nuclear weapons on board, as well as aircraft carriers and refueling aircraft. [04:21.000 --> 04:35.000] We are talking about the American nuclear missiles and bombs from the US arsenal, and at the moment they are placed in Europe, Turkey, North Italy, Belgium and the Netherlands. [04:35.000 --> 04:51.000] This is where the 33rd airfield of the German Luftwaffe is located. It is counted as 1,900 soldiers at once. [04:51.000 --> 05:05.000] At the beginning of the launch of the nuclear missile from Angara, the fighters of the Tornado bombers will start. But since they were placed on the arms of the Bundeswehr in 1981, they are gradually replaced by the new American F-35.
Medium
These results are even more fluent and understandable, but with their own oddities. Here, the Small model's "My mother Borrel" becomes "Sveta, who is a European diplomat, Borel", while "There is also an opening of the US Newsweek" becomes "The American Newsweek is also opening up". The model struggles over several words.
[00:00.000 --> 00:03.000] With a triple loop [00:06.000 --> 00:08.000] I know it was [00:08.000 --> 00:10.000] I know it will be [00:10.000 --> 00:39.000] This is with me [00:39.000 --> 00:42.000] Hello, dear friends, ladies and gentlemen, comrades. [00:42.000 --> 00:43.000] Hello. [00:43.000 --> 00:47.000] On the air, a special issue of the program 60 minutes about the most important. [00:47.000 --> 00:49.000] Some nuclear evening today. [00:49.000 --> 00:57.000] Sveta, who is a European diplomat, Borel said that a nuclear attack against Ukraine will not cause a counter nuclear attack on Russia. [00:57.000 --> 01:04.000] Instead, there will be such a powerful military response that the entire Russian army will be destroyed. [01:04.000 --> 01:13.000] About the same thing, the French president said. Macron said that France will not respond to a nuclear strike on Ukraine. [01:13.000 --> 01:16.000] And personally, he is against the world war. [01:16.000 --> 01:18.000] However, no one asks Macron. [01:18.000 --> 01:22.000] The entire Western press is already writing a plan for a nuclear war. [01:22.000 --> 01:30.000] Reuters reports that NATO is preparing for a physical response to a nuclear strike from Russia. [01:30.000 --> 01:34.000] The American Newsweek is also opening up. [01:34.000 --> 01:38.000] The closer Ukraine is to victory, the closer the nuclear war is. [01:38.000 --> 01:47.000] Former director of the CIA and ex-head of the US Department of Defense, Panetta, calls on the Pentagon to be ready to use all military means against Russia. [01:47.000 --> 01:48.000] Well, the Alliance Agency. [01:48.000 --> 01:52.000] Stoltenberg today announced the study of Stoikiy Polgin. [01:52.000 --> 01:57.000] These studies on the development of the scenario of the beginning of a nuclear war. [01:57.000 --> 02:00.000] With whom it is not difficult to guess. [02:00.000 --> 02:10.000] The essence of NATO nuclear maneuvers is to work out the use of American nuclear bombs B61, located in European bases, on the border territory of Russia. [02:10.000 --> 02:16.000] This is up to 170 kilograms per second in the Trotilov equivalent. [02:16.000 --> 02:20.000] Thirteen times more powerful than nuclear bombs dropped on Hiroshima. [02:20.000 --> 02:31.000] Before that, the allies and the US Army will learn fast transportation of nuclear weapons from their territories and their attachment to the bomber fighters. [02:31.000 --> 02:38.000] It is noteworthy that the application of nuclear strikes on Russia is being worked out, including the German Luftwaffe. [02:38.000 --> 02:39.000] Madness. [02:39.000 --> 02:47.000] Putin is saying that he is not bluffing. [02:47.000 --> 02:50.000] But we are not bluffing either. [02:50.000 --> 02:52.000] The European Union, the United States and NATO. [02:52.000 --> 02:54.000] We will respond to any nuclear strike on Ukraine. [02:54.000 --> 02:56.000] We will respond with non-nuclear weapons. [02:56.000 --> 03:17.000] Our troops will respond so powerfully that the entire Russian army will be destroyed. [03:17.000 --> 03:23.000] NATO Secretary General Jens Stoltenberg called Russia's nuclear threat excessive and irresponsible. [03:23.000 --> 03:28.000] But at the same time, next week will be the exercise of the nuclear forces of NATO countries. [03:28.000 --> 03:30.000] How to understand this at all? [03:30.000 --> 03:33.000] Yes, this is the response of NATO to the threats of Putin. [03:33.000 --> 03:34.000] Putin's threats. [03:34.000 --> 03:58.000] In these exercises, it will indeed be worked out the use of nuclear weapons in the event of an escalation. [03:58.000 --> 04:08.000] For half a day, Germany and 13 other countries of the alliance will take part in the exercise of the preparation for the launch of nuclear missiles from underground mines and bunkers. [04:08.000 --> 04:21.000] In addition, bomber fighters will be involved, capable of carrying nuclear weapons on board, as well as reconnaissance planes and refueling planes. [04:21.000 --> 04:31.000] We are talking about American nuclear missiles and bombs from the US arsenal. At the moment, they are located in Europe, Turkey, Northern Italy, Belgium and the Netherlands. [04:31.000 --> 04:34.000] We do not forget that they are, of course, also located in Germany. [04:34.000 --> 04:39.000] To be precise, on the military airbase Bühel in the federal land of Rheinland-Pfalz. [04:39.000 --> 04:44.000] It is there that the 33rd German Luftwaffe aviation regiment is located. [04:44.000 --> 04:54.000] It has 1,900 soldiers at once. There are four special hangars on the Bühel airbase, from where the bombers of the Tornado will start. [04:54.000 --> 05:02.000] But since they were put into service in 1981, they are gradually replaced by the new American F-35.
Large
Finally, the largest of the four models offers the most fluent and understandable translation. However, in some cases the translation is slightly less clear than the Medium model's For example, the Small model's "Well, the Genesek of the Allianz, Stoltenberg, has already announced the study of the Cold War" becomes "Stoltenberg today announced the study of Stoikiy Polgin." in the Medium model and then "Stoltenberg announced today about the training of a half-day stand." in the Large model, which is less clear. The large model includes the line "reconnaissance aircraft and refueling aircraft. But also surveillance and tank aircraft" which includes the unknown phrase "tank aircraft" and repeats "reconnaissance" as "surveillance", while the Medium model only has "reconnaissance planes and refueling planes" without the "But also surveillance and tank aircraft".
[00:00.000 --> 00:07.000] I know it was, I know it will be. It's with me. [00:30.000 --> 00:43.000] Hello, dear friends, ladies and gentlemen, comrades. Hello. [00:43.000 --> 00:47.000] On the air, a special edition of the 60 Minutes program with the most important thing. [00:47.000 --> 00:49.000] Some kind of nuclear evening today. [00:49.000 --> 00:57.000] Sveta, whose European diplomacy is Borrel, said that a nuclear attack against Ukraine will not cause a counter-nuclear attack on Russia. [00:57.000 --> 01:05.000] Instead, there will be such powerful military responses that the entire Russian army will be destroyed. [01:05.000 --> 01:08.000] The French president said about the same thing. [01:08.000 --> 01:16.000] Macron said that France will not be responsible for a nuclear strike on Ukraine, and personally he is against a world war. [01:16.000 --> 01:23.000] However, no one asks Macron. The entire Western press is already writing a plan for a nuclear war. [01:23.000 --> 01:31.000] Reuters reports that NATO is preparing for a physical response to a nuclear strike by Russia. [01:31.000 --> 01:34.000] The American Newsweek is even more frank. [01:34.000 --> 01:38.000] The closer Ukraine is to victory, the closer a nuclear war. [01:38.000 --> 01:47.000] The former director of the CIA and former US Defense Secretary Panetta calls on the Pentagon to be ready to use all military resources against Russia. [01:47.000 --> 01:53.000] Stoltenberg announced today about the training of a half-day stand. [01:53.000 --> 01:57.000] These are exercises to work out the scenario of the beginning of a nuclear war. [01:57.000 --> 02:00.000] It is not difficult to guess with whom. [02:00.000 --> 02:10.000] The essence of NATO's nuclear maneuvers is to work out the use of American nuclear bombs B61 placed in the storage facilities of European bases along the border with Russia. [02:10.000 --> 02:16.000] This is up to 170 kilograms per second in the Trotilovo equivalent. [02:16.000 --> 02:20.000] 13 times more powerful than the atomic bomb dropped on Hiroshima. [02:20.000 --> 02:31.000] Before that, the US allies and the US Army will learn to quickly transport nuclear weapons from their dungeons and attach them to bomber fighters. [02:31.000 --> 02:38.000] It is noteworthy that the implementation of nuclear strikes on Russia is being worked out by the German Luftwaffe. [02:38.000 --> 02:50.000] Putin assures that he does not bluff, but we do not bluff either. [02:50.000 --> 02:52.000] The European Union, the United States and NATO. [02:52.000 --> 02:54.000] We will respond to any nuclear strike on Ukraine. [02:54.000 --> 02:56.000] We will respond with non-nuclear weapons. [02:56.000 --> 03:18.000] Our troops will respond so powerfully that the entire Russian army will be annihilated. [03:18.000 --> 03:26.000] NATO General Secretary Jens Stoltenberg called Russia's nuclear threats excessive and irresponsible. [03:26.000 --> 03:28.000] But at the same time, next week there will be exercises of the nuclear forces of NATO countries. [03:28.000 --> 03:30.000] How to understand this at all? [03:30.000 --> 03:52.000] Yes, this is the answer of NATO to Putin's threats. [03:52.000 --> 04:00.000] In these exercises, the use of nuclear weapons in the event of an escalation will indeed be worked out. [04:00.000 --> 04:04.000] In the exercises, the German troops and 13 more countries of the Alliance will take part. [04:04.000 --> 04:08.000] Preparation for the launch of nuclear missiles from underground mines and bunkers will be worked out. [04:08.000 --> 04:18.000] In addition, bomber fighters will be involved, capable of carrying nuclear weapons on board, as well as reconnaissance aircraft and refueling aircraft. [04:18.000 --> 04:20.000] But also surveillance and tank aircraft. [04:20.000 --> 04:26.000] We are talking about American nuclear missiles and bombs from the US arsenal. [04:26.000 --> 04:32.000] At the moment, they are located in Europe, in Turkey, Northern Italy, Belgium, and the Netherlands. [04:32.000 --> 04:34.000] Let's not forget that they are also located in Germany. [04:34.000 --> 04:40.000] To be precise, at the Büchel military airbase in the Federal Land of Rheinland-Pfalz, [04:40.000 --> 04:44.000] the 33rd German Luftwaffe Aviation Regiment is deployed there. [04:44.000 --> 04:48.000] It has 1900 soldiers at once. [04:48.000 --> 04:54.000] At the Büchel airbase, there are four special hangars from which the Tornado bomber fighters start. [04:54.000 --> 05:02.000] But since they were put on the arms of the Bundeswehr in 1981, they are gradually replaced by the new American F-35.
The Tiny model produced results that were difficult to understand and less useful as a human-viewable translation of the broadcast. The Small model produced fluent, though stilted, results with one area of repetition, while the Medium and Large models produced noticeably more fluent and understandable results. Yet, while the Large model produced more understandable results for some lines compared to the Medium model, its translations of a few lines were in some cases less readily understandable than the Medium model. Overall, the Small model produces serviceable results for at-scale translation, while the Medium model offers a significant and meaningful improvement and the Large model offers further refinement.
Transcription & Translation Results: The Challenges Of Non-Determinism
In the process of running the earlier timing benchmarks, where we ran each model+task multiple times to assess variability in its inference time, we noticed something very different from traditional ASR and ASR+translation systems: the lengths of the output files were different each time we ran Whisper. In some cases they differed by a substantial amount. In other words, when we ran Medium translate repeatedly on the same file on the same machine again and again, we got different output each time, meaning the model is non-deterministic in its default configuration, in contrast to traditional ASR systems that produce the same result on the same file.
You can compare the output of multiple runs below:
Tiny Transcribe
- Run 1 (First hour is "Редактор субтитров А.Кулакова" repeated, which translates to "Subtitle editor A.Kulakova")
- Run 2
- Run 3
- Run 4 (Opens with English "1995" and "Aim for the Tiger")
Tiny Translate
Small Transcribe
Small Translate
- Run 1
- Run 2 (Last hour is "We are all looking together" repeated)
- Run 3
- Run 4
- Run 5
- Run 6
- Run 7
- Run 8
- Run 9
- Run 10
- Run 11
- Run 12
Medium Transcribe
Medium Translate
Large Transcribe
Large Translate
Spot comparing the different runs above, differences are often readily apparent. To make it easier to more rigorously assess how different the outputs are, we've produced pairwise comparisons of all of the outputs of the model+task combinations using "diff". Whisper often changes how it groups words into sentences across runs which also causes timestamps to vary. It also means that if Whisper were to change how it grouped a single word early in the transcript, that would cause all subsequent lines to differ. Thus, to compare each pair of outputs, we strip all timecode information, split on spaces to divide into words and remove all punctuation appearing at the end of words. This means that "diff" is comparing these two word lists and allows us to report word-level statistics.
Comparison: Tiny Transcribe
- Run1 x Run3: Changed Words: 10866 / 19134 (56.79%)
- Run1 x Run2: Changed Words: 10875 / 19156 (56.77%)
- Run1 x Run4: Changed Words: 10604 / 18901 (56.10%)
- Run3 x Run4: Changed Words: 5943 / 19277 (30.83%)
- Run2 x Run3: Changed Words: 5833 / 19434 (30.01%)
- Run2 x Run4: Changed Words: 5633 / 19438 (28.98%)
Comparison: Tiny Translate
- Run2 x Run3: Changed Words: 25257 / 28731 (87.91%)
- Run2 x Run4: Changed Words: 25943 / 29606 (87.63%)
- Run3 x Run4: Changed Words: 24440 / 28891 (84.59%)
- Run1 x Run4: Changed Words: 24855 / 29621 (83.91%)
- Run1 x Run3: Changed Words: 23752 / 28520 (83.28%)
- Run1 x Run2: Changed Words: 23169 / 28091 (82.48%)
Comparison: Small Transcribe
- Run2 x Run3: Changed Words: 8839 / 18011 (49.08%)
- Run2 x Run4: Changed Words: 8839 / 18011 (49.08%)
- Run1 x Run3: Changed Words: 8775 / 18081 (48.53%)
- Run1 x Run4: Changed Words: 8775 / 18081 (48.53%)
- Run1 x Run2: Changed Words: 2480 / 18383 (13.49%)
- Run3 x Run4: Changed Words: 1 / 10956 (0.01%)
Comparison: Small Translate
- Run2 x Run10: Changed Words: 20547 / 24494 (83.89%)
- Run2 x Run4: Changed Words: 18634 / 23930 (77.87%)
- Run2 x Run5: Changed Words: 15504 / 24224 (64.00%)
- Run2 x Run8: Changed Words: 14642 / 23444 (62.46%)
- Run2 x Run9: Changed Words: 14201 / 23090 (61.50%)
- Run2 x Run6: Changed Words: 13923 / 23148 (60.15%)
- Run2 x Run11: Changed Words: 13988 / 23298 (60.04%)
- Run2 x Run3: Changed Words: 13738 / 22968 (59.81%)
- Run2 x Run7: Changed Words: 13614 / 22767 (59.80%)
- Run1 x Run2: Changed Words: 13341 / 22759 (58.62%)
- Run2 x Run12: Changed Words: 12965 / 22681 (57.16%)
- Run4 x Run9: Changed Words: 9767 / 23976 (40.74%)
- Run4 x Run5: Changed Words: 10042 / 24677 (40.69%)
- Run4 x Run12: Changed Words: 9730 / 24120 (40.34%)
- Run1 x Run4: Changed Words: 9714 / 24195 (40.15%)
- Run4 x Run8: Changed Words: 9758 / 24321 (40.12%)
- Run3 x Run4: Changed Words: 9598 / 23950 (40.08%)
- Run4 x Run7: Changed Words: 9570 / 23897 (40.05%)
- Run8 x Run10: Changed Words: 9782 / 24539 (39.86%)
- Run4 x Run6: Changed Words: 9579 / 24167 (39.64%)
- Run4 x Run11: Changed Words: 9488 / 24079 (39.40%)
- Run1 x Run10: Changed Words: 9577 / 24384 (39.28%)
- Run9 x Run10: Changed Words: 9421 / 24092 (39.10%)
- Run5 x Run10: Changed Words: 9580 / 24675 (38.82%)
- Run6 x Run10: Changed Words: 9358 / 24265 (38.57%)
- Run7 x Run10: Changed Words: 9192 / 24040 (38.24%)
- Run10 x Run11: Changed Words: 9252 / 24247 (38.16%)
- Run3 x Run10: Changed Words: 8887 / 23960 (37.09%)
- Run5 x Run9: Changed Words: 9220 / 24954 (36.95%)
- Run10 x Run12: Changed Words: 8672 / 23910 (36.27%)
- Run8 x Run11: Changed Words: 9001 / 24846 (36.23%)
- Run7 x Run8: Changed Words: 8870 / 24598 (36.06%)
- Run8 x Run9: Changed Words: 8812 / 24605 (35.81%)
- Run5 x Run12: Changed Words: 8861 / 24939 (35.53%)
- Run1 x Run5: Changed Words: 8891 / 25030 (35.52%)
- Run8 x Run12: Changed Words: 8720 / 24668 (35.35%)
- Run5 x Run7: Changed Words: 8753 / 24805 (35.29%)
- Run4 x Run10: Changed Words: 7930 / 22520 (35.21%)
- Run5 x Run11: Changed Words: 8801 / 25047 (35.14%)
- Run3 x Run5: Changed Words: 8745 / 24910 (35.11%)
- Run3 x Run8: Changed Words: 8645 / 24656 (35.06%)
- Run9 x Run11: Changed Words: 8524 / 24367 (34.98%)
- Run7 x Run9: Changed Words: 8438 / 24137 (34.96%)
- Run1 x Run8: Changed Words: 8583 / 24630 (34.85%)
- Run6 x Run9: Changed Words: 8388 / 24343 (34.46%)
- Run6 x Run8: Changed Words: 8488 / 24727 (34.33%)
- Run7 x Run11: Changed Words: 8313 / 24304 (34.20%)
- Run5 x Run6: Changed Words: 8480 / 24976 (33.95%)
- Run6 x Run7: Changed Words: 8207 / 24245 (33.85%)
- Run3 x Run9: Changed Words: 8147 / 24089 (33.82%)
- Run1 x Run7: Changed Words: 8146 / 24199 (33.66%)
- Run1 x Run9: Changed Words: 8110 / 24172 (33.55%)
- Run9 x Run12: Changed Words: 8069 / 24089 (33.50%)
- Run7 x Run12: Changed Words: 8040 / 24112 (33.34%)
- Run1 x Run11: Changed Words: 8107 / 24402 (33.22%)
- Run1 x Run3: Changed Words: 7791 / 24214 (32.18%)
- Run3 x Run11: Changed Words: 7766 / 24244 (32.03%)
- Run6 x Run11: Changed Words: 7772 / 24317 (31.96%)
- Run1 x Run6: Changed Words: 7731 / 24295 (31.82%)
- Run1 x Run12: Changed Words: 7664 / 24156 (31.73%)
- Run3 x Run7: Changed Words: 7550 / 23899 (31.59%)
- Run3 x Run12: Changed Words: 7597 / 24057 (31.58%)
- Run6 x Run12: Changed Words: 7624 / 24174 (31.54%)
- Run11 x Run12: Changed Words: 7535 / 24237 (31.09%)
- Run3 x Run6: Changed Words: 7472 / 24052 (31.07%)
- Run5 x Run8: Changed Words: 6215 / 24504 (25.36%)
Comparison: Medium Transcribe
- Run3 x Run4: Changed Words: 4470 / 18551 (24.10%)
- Run4 x Run6: Changed Words: 4338 / 18491 (23.46%)
- Run2 x Run4: Changed Words: 4108 / 18508 (22.20%)
- Run4 x Run7: Changed Words: 3996 / 18396 (21.72%)
- Run1 x Run4: Changed Words: 3908 / 18282 (21.38%)
- Run4 x Run5: Changed Words: 3895 / 18501 (21.05%)
- Run3 x Run7: Changed Words: 3725 / 18180 (20.49%)
- Run1 x Run6: Changed Words: 3260 / 18038 (18.07%)
- Run3 x Run6: Changed Words: 3138 / 18102 (17.34%)
- Run5 x Run7: Changed Words: 2825 / 18259 (15.47%)
- Run1 x Run5: Changed Words: 2715 / 18201 (14.92%)
- Run2 x Run7: Changed Words: 2560 / 17965 (14.25%)
- Run1 x Run2: Changed Words: 2504 / 17981 (13.93%)
- Run3 x Run5: Changed Words: 2440 / 18273 (13.35%)
- Run1 x Run7: Changed Words: 2318 / 17717 (13.08%)
- Run1 x Run3: Changed Words: 2299 / 17787 (12.93%)
- Run2 x Run3: Changed Words: 2290 / 18090 (12.66%)
- Run6 x Run7: Changed Words: 2182 / 17593 (12.40%)
- Run5 x Run6: Changed Words: 2163 / 18140 (11.92%)
- Run2 x Run6: Changed Words: 1866 / 17875 (10.44%)
- Run2 x Run5: Changed Words: 1769 / 18135 (9.75%)
Comparison: Medium Translate
- Run1 x Run6: Changed Words: 5510 / 22795 (24.17%)
- Run1 x Run5: Changed Words: 5464 / 22751 (24.02%)
- Run1 x Run7: Changed Words: 5390 / 23036 (23.40%)
- Run2 x Run5: Changed Words: 5164 / 22994 (22.46%)
- Run1 x Run4: Changed Words: 5088 / 22755 (22.36%)
- Run2 x Run7: Changed Words: 5119 / 23238 (22.03%)
- Run2 x Run6: Changed Words: 4931 / 22922 (21.51%)
- Run1 x Run3: Changed Words: 4660 / 22698 (20.53%)
- Run2 x Run4: Changed Words: 4463 / 22925 (19.47%)
- Run2 x Run3: Changed Words: 4191 / 22882 (18.32%)
- Run5 x Run7: Changed Words: 4042 / 22456 (18.00%)
- Run6 x Run7: Changed Words: 3974 / 22467 (17.69%)
- Run4 x Run7: Changed Words: 3765 / 22492 (16.74%)
- Run4 x Run6: Changed Words: 3310 / 22074 (15.00%)
- Run3 x Run6: Changed Words: 3291 / 22144 (14.86%)
- Run4 x Run5: Changed Words: 3221 / 22038 (14.62%)
- Run3 x Run5: Changed Words: 3223 / 22116 (14.57%)
- Run3 x Run7: Changed Words: 3186 / 22387 (14.23%)
- Run1 x Run2: Changed Words: 3003 / 22487 (13.35%)
- Run3 x Run4: Changed Words: 2678 / 22044 (12.15%)
- Run5 x Run6: Changed Words: 2122 / 21570 (9.84%)
Comparison: Large Transcribe
- Run1 x Run3: Changed Words: 683 / 17926 (3.81%)
- Run1 x Run4: Changed Words: 683 / 17926 (3.81%)
- Run2 x Run3: Changed Words: 683 / 17926 (3.81%)
- Run2 x Run4: Changed Words: 683 / 17926 (3.81%)
- Run1 x Run2: Changed Words: 0 / 17772 (0.00%)
- Run3 x Run4: Changed Words: 0 / 17769 (0.00%)
Comparison: Large Translate
- Run1 x Run3: Changed Words: 11916 / 21634 (55.08%)
- Run1 x Run4: Changed Words: 11887 / 21604 (55.02%)
- Run1 x Run2: Changed Words: 10784 / 21270 (50.70%)
- Run2 x Run4: Changed Words: 4791 / 22455 (21.34%)
- Run2 x Run3: Changed Words: 4745 / 22452 (21.13%)
- Run3 x Run4: Changed Words: 177 / 20773 (0.85%)
Conclusions
Whisper is a formidable new addition to the ASR landscape and requires equally formidable hardware and inference runtimes. Using it to transcribe or translate 24/7 audio streams like broadcast television news requires substantial hardware and incurs extended latency. At the same time, Whisper's open nature offers unique possibilities around domain adaptation and the ability to extend its capabilities and accuracy over time.
The results of the larger models are extremely fluent, often achieving human-like fluency in their English translations. For both transcriptions and translations, Whisper achieves extremely fluent grouping, punctuation and capitalization, suggesting utility even in transcription-only workflows for which existing realtime human captioning exists, as a mechanism for enhancing the human transcription's readability.
One surprising finding that will require further investigation is that when using Whisper's provided CLI in its default configuration, outputs across both tasks across all four models are highly non-deterministic, with unpredictable and seemingly random error states, repeating text, dropouts and hallucination. Even in cases where differences of transcription or translation are minor and convey the same meaning, the non-deterministic output impedes replicability and makes it difficult to interpret narrative structures.
Dropouts, repeating text and hallucination require special attention and further work will be required to identify configurations that minimize or eliminate these behaviors.