AI In Production: How To Think About The Ongoing Cost Of Rerunning Models As They Improve

One of the few certainties in the digital world is that the breathtaking pace of AI advancements over the past few years has brought with it a steady stream of model updates. Nearly every major commercially deployed AI model API (both generative and traditional) undergoes regular updates and improvements. This means that the same input provided to the same model today may yield very different results from last week, the week before that, the week before that and the week before that (setting aside the nondeterministic nature of LLMs/LMMs). For example, there are already two and even three stable versions of some of GCP's GenAI models, not counting its "latest" models. Less obvious is the continuous improvement in more traditional AI models, from speech transcription to OCR to translation. Organizations will small corpuses can simply rerun them on a regular basis, but how might organizations with vast archives think about when and how to tractably rerun their archives periodically to take advantage of the latest models, given that it simply isn't cost-feasible to rerun entire massive content archives through the latest model updates on a regular basis?

Using the Visual Explorer as an example, the complete archive contains more than 6 million hours of broadcasts spanning 100 channels from 50 countries on 5 continents in at least 35 languages and dialects. Two years ago we began transcribing Russian, Ukrainian and Belarusian broadcasts to help scholars and journalists understand the domestic portrayals of Russia's invasion of Ukraine. At the time, GCP's STT models were state of the art. Over the first year of our transcription of those channels, we observed slow and steady quality improvements in keeping with the ASR technology of the time, but while each model update brought improvements, they were not substantial enough to justify a complete reprocessing of historical content and thus we would constantly use the latest model for new content, but leave historical content as-is.

However, with the release of GCP's LSM's USM and Chirp, we observed a truly massive quality improvement over the previous generation models, with Ukrainian gaining full automatic punctuation support and overall accuracy improving dramatically. The leap in accuracy was so massive that we have now gone back and reprocessed our entire historical archive of those channels using Chirp and will be rolling out soon to the Visual Explorer.

The previous generations of GCP's STT ASR yielded acceptable to reasonable results on other channels, but the accuracy of the ASR technology of the time was never strong enough to warrant deploying across our broader selection of channels. Critically, a number of channels contain languages not support by any ASR systems of that time and many channels contain code switching, which has never been robustly supported by classical ASR systems. Extensive testing of Chirp across our entire archive demonstrated both strong support for all languages contained in our archive and superb code switching performance, making it possible to proceed finally with applying ASR to our entire archive via Chirp.

Originally we focused Chirp only on channels that do not already have transcription available from the video provider. After extensive testing, we discovered that a previous-generation non-GCP ASR product that was used by an external organization to generate transcriptions for select channels was yielding excessively poor results in many cases that was actually impacting our analyses of those channels. Testing those channels through Chirp showed such a massive quality improvement that we decided to rerun those entire historical channel archives through Chirp.

The workflow we use internally is:

  • New broadcasts are always transcribed using the absolute latest model to incorporate the latest accuracy improvements. While this can create discontinuity issues as errors are corrected, it allows us to keep pace with the latest SOTA model capabilities. Additionally, Chirp specifically does not offer model versioning, so there is no technical way to fix on a specific model version.
  • We rerun a curated selection of broadcasts that capture a range of ordinary and edge case content monthly through the model and diff their results against the previous few months. If there is a major leap in performance overall or on specific edge cases, we flag this for human confirmation and review.
  • If we determine that the improvement is so substantial or that, while more modest, would have an outsized impact on accuracy for specific high-profile scholarly or journalistic questions, we will evaluate reprocessing our historical archive through the new model.
  • Using the diffs, we will examine whether the model improvement is broad or narrowly impactful of specific edge cases. Perhaps performance on a specific dialect is vastly improved – we might search our archives for typographical errors indicative of that dialect under the previous models and rerun just those broadcasts. Perhaps a major feature upgrade is released for a single language (Chirp added automatic punctuation for Ukrainian which was missing from STT V1) – we will scan for all broadcasts containing that language, etc. This way we can be highly precise about restricting our reprocessing to the smallest possible slice of the full archive.

We've found this workflow to be highly effective in monitoring for the kinds of changes that directly impact our specific use case and minimizing reprocessing costs by narrowly focusing reprocessing on only the smallest portion of content believed to be directly affected by the model update. For provenance and comparison purposes we record each version permanently in a meta log associated with each broadcast that allows us to go back over time to assess improvement gains across specific metrics – more to come on that in the future.