Over the past few months we've transcribed more than 65 million minutes of global television news coverage spanning more than 100 channels from over 50 countries using Google's Speech-to-Text Chirp ASR model. How does transcribing speech at these scales compare with the open Whisper model by OpenAI? We'll set aside Whisper's substantial hallucination, dropout and skipover challenges and focus exclusively on the operational costs of running production speech transcription at scale.
GDELT has two primary transcription needs. The first is continuous 24/7 transcription of realtime video streams chunked into variable blocks of 30 minutes to multiple hours. Here latency is critical and must be minimized as best possible, though with some degree of flexibility. At the same time, cost should be absolutely minimized despite the latency requirements and thus cost takes priority when balancing the two considerations. Our second use case is batch transcription of a vast historical archive of millions of hours of speech where latency is not a primary consideration and cost takes highest priority.
Here we'll focus on the cost to transcribe the continuous near-realtime speech content. The Visual Explorer currently transcribes 25 live channels 24/7 totaling 36,000 minutes of audio per day (25 channels * 24 hours per channel * 60 minutes per hour). Running OpenAI's Whisper locally on V100-equipped VMs, the total cost to transcribe these 25 channels for one year is $218,700. In contrast, GCP's STT V2 API using immediate processing (audio files are processed immediately upon submission) costs $163,680 a year for the 25 channels and using dynamic batching (the API backfills the audio files with a slight delay as it has spare capacity) costs just $38,880 a year. Best of all, the STT V2 API is fully managed, meaning that unlike Whisper there is no need to run a cluster of GPU-equipped VMs and scalability is nearly infinite. Processing a 150 minute Russian television news broadcast through Whisper takes around 50 minutes on a V100 GPU, while the STT V2 API takes just 4m11s in immediate mode and 4m55s in dynamic batching. With STT V2's batch pricing, transcribing an entire year of 24/7 audio (more than half a million minutes of airtime) costs just $1,577, making it financially tractable for the first time to transcribe even the largest audio archives.
OpenAI's Whisper
Whisper supports full GPU acceleration so we spun up a brand new "deep learning" GCP Debian image running on a quad-core high memory N1 VM with 4 Skylake virtual cores, 26GB of RAM and a 250GB SSD root disk to support the IO needs of the large MPG and MP4 video files. For the GPU we used a V100 ("NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)"). While there are vastly faster GPUs available on GCP, many have availability restrictions due to high demand and we wanted to construct a reasonably cheap but performant GPU VM that had minimal operational costs for getting started such as those of a small organization. We installed the Nvidia driver and the latest version of Whisper and its latest models. Despite Whisper being GPU accelerated and thus requiring only a single CPU core, it crashes with OOM below a quad-core 26GB RAM configuration.
Let's try running an episode of the Russian-language 60 Minutes from last year through Whisper, which, despite its name, runs 150 minutes. We'll run it in both transcription and translation tasks using both the original Large-V1 model, the newer Large-V2 model and the Medium model. Both Large models require too much memory to run multiple instances at a time, but the Medium model is small enough that we can fit two on the GPU simultaneously, so we'll test running two transcription tasks at the same time. Beneath each command we list the system RAM and CPU consumption (in all cases it used a single core – it needs the four cores for the RAM). Below that is the output of "nvidia-smi -l 2" (2 second samples) and below that the wallclock times of two non-sequential runs (all commands were run in order once and then again). The only exception is the dual-medium run in which only a single experiment was conducted.
time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model large-v1 --task translate > RUSSIA1_20230821_143000_60_minut.whisper.largev1.translate.1.txt& #9.1G RAM - 101% CPU #| N/A 44C P0 121W / 300W | 11317MiB / 16384MiB | 55% #34m50.516s #34m48.254s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model large-v2 --task translate > RUSSIA1_20230821_143000_60_minut.whisper.largev2.translate.1.txt& #9.1G RAM - 101% CPU #43C P0 117W / 300W | 11113MiB / 16384MiB | 54% #39m15.864s #37m59.183s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model large-v1 --task transcribe > RUSSIA1_20230821_143000_60_minut.whisper.largev1.transcribe.1.txt& #9.1G RAM - 101% CPU #44C P0 131W / 300W | 11305MiB / 16384MiB | 53% #45m36.144s #44m5.799s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model large-v2 --task transcribe > RUSSIA1_20230821_143000_60_minut.whisper.largev2.transcribe.1.txt #9.1G RAM - 101% CPU #43C P0 136W / 300W | 11325MiB / 16384MiB | 53% #46m46.218s #50m0.032s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model medium --task translate > RUSSIA1_20230821_143000_60_minut.whisper.medium.translate.1.txt& #5.4GB / 101% CPU #42C P0 109W / 300W | 6181MiB / 16384MiB | 42% #27m14.325s #32m45.951s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model medium --task transcribe > RUSSIA1_20230821_143000_60_minut.whisper.medium.transcribe.1.txt& #5.4GB / 101% CPU #41C P0 91W / 300W | 6243MiB / 16384MiB | 41% #27m1.389s #23m54.207s time whisper ./RUSSIA1_20230821_143000_60_minut.mp4 --language Russian --model medium --task transcribe > RUSSIA1_20230821_143000_60_minut.whisper.medium.transcribe.dual1.txt& time whisper ./RUSSIA1_20230821_143000_60_minutcopy.mp4 --language Russian --model medium --task transcribe > RUSSIA1_20230821_143000_60_minutcopy.whisper.medium.transcribe.dual1.txt& #5.4GB / 101% CPU #5.4GB / 101% CPU #48C P0 188W / 300W | 12484MiB / 16384MiB | 99% #50m6.403s #51m20.238s
The new Large-V2 model isn't substantially slower than the Large-V1 model. As we've consistently observed, translation is typically a faster task than transcription for Whisper. The Medium model is, as expected, much faster than the two Large models. All three models (Large-V1, Large-V2, Medium) only utilize around 50% of the GPU's capacity. Yet, if we run two Medium models in parallel, we get 99% GPU utilization and two separate 100% CPU utilization threads, but wallclock time extends to just over 50 minutes – roughly double the single-model Medium runtime, suggesting that despite the underutilization of the GPU, we cannot accelerate our total throughput by running multiple models – total throughput is fixed under this architecture. We also tried several other episodes of 60 Minutes from different dates, all of which yielded similar runtimes.
Thus, we can estimate around 50 minutes to transcribe a 150 minute video using the Large-V1 or Large-V2 models – a speedup over realtime of 3x. This means each VM can transcribe theoretically 3 minutes of audio for every 1 minute of wallclock time, meaning in one day a single VM can transcribe 4,320 minutes of audio (1,440 * 3).
The estimated cost of the underlying VM is $2,026 using the GCP Pricing Calculator. This works out to $67.5 a day or around $2.81 an hour. To process only our realtime content, we need to transcribe 36,000 minutes of audio per day and since each VM can handle 4,320 minutes, that means we need 8.3 VMs. Rounding up to 9 VMs that gives us a bit of headroom for the data movement of copying the video files to the local VM disk for processing. However, without spare capacity we have no margin for error during VM failures (GPU VMs do not support live migration). A production application would require additional spare GPU VMs to ensure the system does not fall behind during hardware failures. For 9 VMs that yields a cost of $607.5 a day, $18,225 a month or $218,700 a year.
Google Speech to Text
Google's Speech-to-Text API is a fully managed GCP API, meaning it is invoked through a simple RESTful JSON invocation that gives it the path of the audio file in GCS, the GCS path to output the results and the desired transcription parameters (such as whether to estimate punctuation, etc). No VM is required – audio files are stored in GCS and the STT API reads them directly from GCS and writes its results directly back to GCS.
We use the STT V2 API to access the USM Chirp model, which is Google's LSM (Large Speech Model). STT V2 offers two pricing models: immediate and dynamic batch. Immediate recognition processes the submitted audio immediately, while dynamic batch backfills the recognition job whenever there is spare capacity in the STT V2 backend infrastructure and thus can take up to 24 hours to process an audio file, but offers a substantial discount.
Processing 36,000 minutes of audio per day using immediate recognition is priced at $0.016 a minute, or $576 a day. That pricing applies to the first 500,000 minutes processed each month, whereas we process more than 1,080,000 minutes per month. Under current STT V2 pricing, 0-500,000 minutes per month are priced at $0.016, which is reduced to $0.010/min for 500,000-1M minutes, then to $0.008/min for 1M-2M min a month. According to the GCP Pricing Calculator, that yields $13,640 a month or $163,680 a year.
However, it turns out that dynamic batch jobs almost never reach their worst case potential of 24 hour lag. In fact, in practice over the past eight months in which we have been publicly using Chirp for our realtime transcription, we've seen typical latency of just minutes to tens of minutes even under massive submission load when we are submitting hundreds of thousands of minutes of backfile historical audio files to the API alongside our realtime content. In our specific use case, while we prefer minimal latency, reducing cost takes precedence and even if assorted broadcasts take an hour or two before they queued by the API for processing, we can tolerate that latency in the Visual Explorer. Under dynamic batching, the cost per minute drops precipitously to just $0.003 a minute, meaning the monthly bill for transcribing all 25 channels is just $3,240 a month or just $38,880 a year.
This means that processing a single channel of 24/7 video for an entire year (half a million minutes of audio) costs just $1,577 under dynamic batching (365 days * 24 hours a day * 60 minutes an hour * $0.003 a minute), making it financially tractable for the first time to transcribe even the largest audio archives.
While dynamic batching adds a delay of minutes to hours before the API actually processes a given audio file, both immediate and batch processing take roughly similar amounts of time to process the file. Processing the same 150 minute broadcast above took 4m14s using immediate processing and 4m55s using dynamic batch. Contrast that with 50 minutes for Whisper. This means that even if a batched video is delayed by 45 minutes, it will still complete in the same time as it would have taken Whisper to process it, but at a fraction of the cost and at infinitely greater scalability.
As a managed API, there are no underlying VMs to maintain. No VMs that need constant refreshing and upgrades, with all the attendant upgrade failures and strange errors and security risks. Best of all, the STT V2 API has extremely high scalability characteristics, meaning projects can queue substantial volumes of audio at a fairly high submission rate to the API without having to worry about anything – just submit and forget. In our own workflow, it means we can submit hundreds of thousands of minutes of audio in a fairly short period of time to queue a slice of our historical archive, all while our realtime processing continuous unimpeded.