On Wednesday, we announced in collaboration with the Internet Archive TV News Archive the machine transcription of its complete 2.5 million hours of uncaptioned global television news through GCP's Universal Speech Model Chirp ASR model in what we believe to be one of the largest scholarly applications of multilingual speech recognition to global television news ever performed. Yet, remarkably, it took just 7 days to process this massive archive and had speed been a consideration, we could have finished in a single afternoon – demonstrating the nearly infinite scalability of modern cloud-based speech transcription systems.
How much time would it have taken us to process this massive archive using open source ASR models like OpenAI's Whisper in the cloud on rented GPUs? Different kinds of audio, with different speaking styles, speaking rates, speaker mixes, language mixes, dialects, recording quality, presentation quality, background noise and myriad other factors all impact the speed of modern LSM ASR models and thus every given audio collection will exhibit different inference speeds. In our own work to date on the TV News Archive, we've found that Whisper on widely available V100 GPUs in production at scale over many files achieves an average recognition speedup of 3x over realtime: it is able to recognize 3 minutes of audio for every 1 minute of runtime. Yet that is also constrained by the availability of GPUs and their associated host VMs. To run the complete 2.5 million hours through Whisper on V100's at the inference speeds we have observed in real life at scale on this specific collection in our mass-scale experiments would take 833,334 hours (34,722 days / 96 years) of GPU time, assuming 100% utilization and flawless execution. If we count the total number of executions in which Whisper halted early or entered some form of easily detectable error state in our at-scale tests to date (such as infinite repetition) and not counting existential hallucination that requires a rerun and not counting that often it takes multiple reruns with Whisper to achieve a fully intact and usable output, that number increases by at least a quarter to half or even more from our real world tests to date.
In contrast, we typically see a 38x – 150x+ speedup over realtime from Chirp, meaning it would take between 65,789 and just 16,666 processing hours to complete ASR of the entire collection. Critically, as a hosted API, those processing hours can be run effectively entirely concurrently at what amounts to infinite scalability, meaning that it would theoretically with the necessary quota allocation be possible to complete ASR of the entire 2.5-million-hour collection in just a single afternoon. In our case, we used just a single small CPU-only VM to run the scripts that prepared and submitted each video to Chirp, peaking at just three such machines during our highest-intensity submission experiments. While we could have submitted the entire collection in a single day, we chose to process the collection in brief single-day batches over the last few months in order to be able to process a batch, then deep dive into the results to understand how the model was performing, any limitations or issues we needed to consider, tweaking of submission parameters, etc. Thus, our experience is extremely atypical in just how long we took to process the entire collection since our goal was to deeply evaluate the model as we applied it to successively more complex content mixes – while 7 days might seem fast, it is critical to remember that we could have finished the entire collection in just a single afternoon had speed been our primary consideration. In the end, we submitted full production bursts of content on just 7 total days from just 1-3 small VMs spread over the last few months – typically a 1 day burst at the beginning of the month, followed by several weeks of analysis of the results to evaluate the outputs. Again, it is critical to remember that 7 days is far longer than it should have taken and was only because we purposely ran in small batches for testing purposes – in real life, processing of the entire collection would have been achievable in less than a single afternoon.
Even though we used Dynamic Batch submission to maximally reduce processing costs, even at our maximum submission rate of many hundreds of thousands of hours of speech per day, each video took just a few minutes to complete even under maximal load with zero scaling limits as the underlying GCP infrastructure transparently scaled up to support the fully managed hosted Chirp API.
You read that right: processing the TV News Archive's complete 100-channel 50-country quarter-century archive of more than 2.5 million hours of uncaptioned airtime spanning at least 35 languages and dialects took just 7 days of processing and, had speed been a consideration, we could have completed the entire archive in a single afternoon. That is the true scalability of today's hosted LSM-based ASR models.