Nearly since the dawn of Russia's invasion of Ukraine, we have used GCP's Cloud STT ASR (today Chirp) and Google Translate to transcribe and translate Belarusian, Russian and Ukrainian television news channels to allow journalists and scholars to study and report on the war. Today that early work has expanded to transcribing in their original languages the entirety of the Television News Archive's quarter-century uncaptioned global archive spanning almost 300 channels from 50 countries on 5 continents in more than 150 languages and dialects totaling more than 2.5M hours of uncaptioned speech. Yet, despite transcribing those millions of hours into their original languages, today we translate just ten of those channels into English due to the immense cost and limitations of traditional machine translation.
While ASR costs have collapsed over the past five years with the rise of LSM and LMM models, classical NMT (Neural Machine Translation) costs have remained fairly constant. More problematically, television news across the world is highly multilingual, featuring rapid code switching, myriad excerpts of speech in other languages and stream-of-consciousness dialog filled with slang and contemporary phrases that push NMT systems past their limits. In particular, most NMT systems support only a single language per document, meaning they are unable to translate much of our heavily multilingual archive.
Today we are immensely excited to report on the first results from an incredibly exciting new initiative: teaching Gemini how to understand the concept of time in ASR broadcast transcripts and using it to translate the vast global archives of the TV News Archive at scale. In these first large-scale experiments, we translated nearly 1M broadcasts totaling 1.6 billion seconds of airtime containing 2.7 billion words (14 billion characters) of speech for just $20,618 using Gemini 2.5 Flash Non-Thinking, placing planetary-scale translation within reach for the first time. In comparison, this same set of broadcasts would have cost more than $280K to translate using our current NMT translation workflow. Only the public enterprise Gemini API was used and no data was used to train or tune any model.
In the end, this incredibly powerful new Gemini translation workflow has now placed the vision of translating the entire Television News Archive within reach for the first time in its quarter-century existence.
Accomplishing this incredible feat required something equally incredibly difficult: teaching Gemini the concept of time in broadcast transcripts. LMM models like Gemini are so effective today at multilingualism that for most use cases they can be provided input text in any language(s) and asked to reason and output about it in any other language(s) – no translation required. For translation itself, Gemini can be given a single input document containing many languages tightly interspersed and it will seamlessly translate the entire document into English without being told anything about the underlying languages.
Most importantly for us, Gemini is incredibly good at understanding the kind of transcription error imposed by ASR models, such as when Chirp mistranscribes a Pashto broadcast into Arabic, Farsi or Urdu using words that sound similar to the Pashto words being spoken but have very different meanings. Much like an expert human translator working with ASR outputs, Gemini is able to recognize these "sounds like" transcriptions and read the Arabic words, understand them as the originally spoken Pashto words they sound like and correctly translate them into English. Even in codeswitching broadcasts in which a speaker rapidly alternates between several languages every few words, Gemini perfectly tracks across languages to translate into a single English translation. None of this was ever possible until the arrival of LLMs and LMMs.
Given a blob of text in most major languages and the proper prompting, Gemini today can rapidly translate it into English with no dropouts (missing text) and no measurable hallucination (hallucination is still a huge issue in other LMM task domains, but within the narrow scope of translation given our specific prompting and architecture it is no longer present in measurable levels in our testing). Gemini 2.5 Flash Non-Thinking is so good by itself that we have observed no significant differences between its translations and those of the most advanced model, Gemini 3 Pro High Thinking, at vastly lower cost.
The problem is that while LLMs like Gemini are superbly adept at translating blocks of text, the concept of time has long eluded them. Broadcast transcripts require embedded time information, since we have to know when each chunk of text in the translation was spoken so that we can properly index it for search and display onscreen in closed captioning. Yet, add time information to the input text for LLMs and suddenly everything breaks down. Given an SRT file as input and appropriate prompting (including many-shot prompting), in more than half of our test runs, Gemini would either get confused by the time information and translate the text as independent brief snippets (completely destroying their meaning and losing context) and/or would massively alter and hallucinate the timecode information on output. In fact, across all Gemini 2.5 models (Flash Non-Thinking, Flash Thinking and Pro) we observed extensive and existential degradation of the translation from the introduction of time information.
In nearly every configuration of prompting, examples and encoding time information into the text, we found that Gemini would extensively and consistently alter the time information on output, repeating timecodes, skipping around, altering the time information, replacing timecodes with gibberish, treating the time as breaking the transcript into independent unrelated snippets, expanding or compressing the timegaps, etc. We tried all of the usual industry standard timecoded transcript formats like SRT and TTXT format, XML and JSON formats and even tried countless varieties of multi-shot prompting, to no avail. We tried breaking transcripts into sentences and translating them by ID, using various timecode formats, timecodes as fractional seconds, as whole seconds, encoded as hex, base64 and other schemes, including complex textual and other coding schemes, all in combination with increasingly elaborate prompting instructing Gemini to preserve time information as-is. We tried breaking the text into various chunks and using chunking IDs and various coding mechanisms for the chunking IDs: again, all to avail. We tried embedding time within the text as HTML, XML and myriad other encoding schemes, again to no avail. Fascinatingly, we found that unexpected languages, like Hungarian, yielded the highest failure and hallucination rates in Gemini for time information.
Finally, after months of experimentation and large-scale testing, we ultimately identified a powerful time encoding schema that embeds time information into the text at 2 second intervals and encodes it using dual sequential coding schemes that increment at similar but slightly different rates that appear to allow Gemini to simultaneously understand that the text is anchored in a time field that must be passed through to the output, while not attempting to alter that time information. The end result allows Gemini to see the underlying transcript as a single unified block of text and translate it using document-wide context, while passing through the timecode information without modification. Word reordering happens seamlessly, with words moving before/after time blocks automatically (languages with ending negation have the "no" move to an earlier timeslot within the sentence as needed, etc).
Most importantly of all, failure states using this workflow occur in just around 10% of all translation attempts and are perfectly correlated with wholesale failure of timecode information to be passed through. In other words, Gemini faithfully translates and timecodes the transcript in 90% of runs and in those runs when it fails, it outputs massive holes in the timecode information, meaning we can simply examine Gemini's output for each broadcast to detect large-scale timecode errors as a nearly perfect signal for translation failure. In contrast, our previous workflow yielded gaps, timecode failures and other errors without any way of reliably detecting them.