The GDELT Project

AI In Production: How To Think About The Ongoing Cost Of Rerunning Models As They Improve

One of the few certainties in the digital world is that the breathtaking pace of AI advancements over the past few years has brought with it a steady stream of model updates. Nearly every major commercially deployed AI model API (both generative and traditional) undergoes regular updates and improvements. This means that the same input provided to the same model today may yield very different results from last week, the week before that, the week before that and the week before that (setting aside the nondeterministic nature of LLMs/LMMs). For example, there are already two and even three stable versions of some of GCP's GenAI models, not counting its "latest" models. Less obvious is the continuous improvement in more traditional AI models, from speech transcription to OCR to translation. Organizations will small corpuses can simply rerun them on a regular basis, but how might organizations with vast archives think about when and how to tractably rerun their archives periodically to take advantage of the latest models, given that it simply isn't cost-feasible to rerun entire massive content archives through the latest model updates on a regular basis?

Using the Visual Explorer as an example, the complete archive contains more than 6 million hours of broadcasts spanning 100 channels from 50 countries on 5 continents in at least 35 languages and dialects. Two years ago we began transcribing Russian, Ukrainian and Belarusian broadcasts to help scholars and journalists understand the domestic portrayals of Russia's invasion of Ukraine. At the time, GCP's STT models were state of the art. Over the first year of our transcription of those channels, we observed slow and steady quality improvements in keeping with the ASR technology of the time, but while each model update brought improvements, they were not substantial enough to justify a complete reprocessing of historical content and thus we would constantly use the latest model for new content, but leave historical content as-is.

However, with the release of GCP's LSM's USM and Chirp, we observed a truly massive quality improvement over the previous generation models, with Ukrainian gaining full automatic punctuation support and overall accuracy improving dramatically. The leap in accuracy was so massive that we have now gone back and reprocessed our entire historical archive of those channels using Chirp and will be rolling out soon to the Visual Explorer.

The previous generations of GCP's STT ASR yielded acceptable to reasonable results on other channels, but the accuracy of the ASR technology of the time was never strong enough to warrant deploying across our broader selection of channels. Critically, a number of channels contain languages not support by any ASR systems of that time and many channels contain code switching, which has never been robustly supported by classical ASR systems. Extensive testing of Chirp across our entire archive demonstrated both strong support for all languages contained in our archive and superb code switching performance, making it possible to proceed finally with applying ASR to our entire archive via Chirp.

Originally we focused Chirp only on channels that do not already have transcription available from the video provider. After extensive testing, we discovered that a previous-generation non-GCP ASR product that was used by an external organization to generate transcriptions for select channels was yielding excessively poor results in many cases that was actually impacting our analyses of those channels. Testing those channels through Chirp showed such a massive quality improvement that we decided to rerun those entire historical channel archives through Chirp.

The workflow we use internally is:

We've found this workflow to be highly effective in monitoring for the kinds of changes that directly impact our specific use case and minimizing reprocessing costs by narrowly focusing reprocessing on only the smallest portion of content believed to be directly affected by the model update. For provenance and comparison purposes we record each version permanently in a meta log associated with each broadcast that allows us to go back over time to assess improvement gains across specific metrics – more to come on that in the future.