Using Google Translate's HTML Support To Create High-Resolution Translated Google Speech To Text SRT Captioning

To date we have transcribed more than 42,000 Russian and Ukrainian language television news broadcasts through Google's Speech-to-Text API, with the resulting 2.1GB of spoken word transcripts made available for researchers and journalists to explore how Russia's invasion of Ukraine has been communicated, especially to the Russian public. The Visual Explorer already displays these transcripts inline as part of its interface and Chrome browser users are able to have them automatically translated into English on-the-fly using Chrome's built-in Google Translate integration. What would it look like to automatically translate these transcribes ourselves using Google Translate?

Google Translate's batch translation support makes it trivial to translate large text files from one language into another. Why then do television news broadcasts pose a unique challenge? The answer is that like most NMT systems, the Translate API is designed for ordinary textual documents, whereas in broadcast news, each word in the source language is associated with a precise subsecond timestamp when it was spoken, which must be carried through to the translated transcript, with the added complication that there is rarely a one-to-one correspondence between source and target language.

One simple approach is to split the transcript into sentences and translate a sentence at a time, collapsing timestamp resolution to the level of sentence start and stop points. Unfortunately, punctuation in transcribed speech is an artificial construct, added by machines or humans to make transcripts more readable, but not a native part of the stream of consciousness nature of the spoken word. This means that sentences can be extremely long, sometimes spanning 30 seconds or more. In contrast, onscreen captioning lines are typically around 50 characters or less, while the Visual Explorer operates at a native 4 second resolution.

How then to preserve STT precision timestamps when using Google Translate?

Let's use this Russia 1 episode of 60 Minutes as an example:

Here is its underlying SRT file:

Let's look at lines 7 to 10 of the SRT:

...
7
00:00:51,000 --> 00:00:53,800
дипломатии борель, сказал, что ядерная атака 

8
00:00:53,800 --> 00:00:56,600
против Украины не вызовет ответного ядерного удара 

9
00:00:56,600 --> 00:01:00,700
НАТО по России вместо этого будут настолько мощные 

10
00:01:00,700 --> 00:01:03,800
военные ответы, что вся Российская армия будет 

11
00:01:03,800 --> 00:01:07,100
уничтожен примерно то же самое сказала французский 

12
00:01:07,100 --> 00:01:11,000
президент макрон заявил, что отвечать на ядерный 

13
00:01:11,000 --> 00:01:14,200
удар по Украине Франция не станет. А лично он 
...

What if we simply pass each line on its own to Google Translate? We get the following:

7 
00:00:51,000 --> 00:00:53,800 
Borel diplomacy, said that a nuclear attack

8 
00:00:53,800 --> 00:00:56,600
against Ukraine will not cause a retaliatory nuclear strike

9 
00:00:56,600 --> 00:01:00,700 
NATO over Russia will instead be so powerful

10 
00:01:00,700 --> 00:01:03,800 
military responses that the entire Russian army would

11 
00:01:03,800 --> 00:01:07,100 
destroyed about the same said french

12 
00:01:07,100 --> 00:01:11,000 
President Macron said he would respond to nuclear

13 
00:01:11,000 --> 00:01:14,200 
France will not strike at Ukraine. And he personally
...

Clearly this translation leaves a lot to be desired and actually changes the meaning entirely of line 13. This is because when we translate each snippet on its own, there is insufficient surrounding text for Google Translate to understand the entire context of the sentence.

Contrast this with the results if we translate the entire sentence:

the luminaries of European, Borel diplomacy, said that a nuclear attack against Ukraine would not cause a
NATO nuclear retaliatory strike against Russia, instead there would be such powerful military responses
that the entire Russian the army will be destroyed about the same thing French President Macron said that
France would not respond to a nuclear strike on Ukraine.

This is much more faithful to the underlying transcript and correctly translates the meaning of line 13 of the SRT transcript. But to get this more accurate translation, we had to look at the entire sentence, which spans from 00:00:51 to 00:01:14, meaning we had to lump 23 seconds of speech spanning 7 SRT lines from the original Russian together. That will cause a lot of problems when we attempt to integrate this back into our Visual Explorer display or connect the transcript back to the onscreen visuals.

What can we do instead?

It turns out that Google Translate supports HTML formatted input, which we can creatively harness to embed our timecode information and pass it through to the translation. Translate supports two kinds of HTML tags. One is an inline tag that is invisible to the actual translation process. These tags, like <span> do not impact the translation at all and can be placed anywhere, including in the middle of proper names and have no impact on translation. Breaking tags like <p> act like punctuation and break translations where they appear, with a <p> tag in the middle of a name splitting it apart and causing it to be translated differently. At the same time, Translate has fixed limits on the length of an HTML block, meaning we need to split transcripts at regular intervals into <p> blocks.

After extensive experimentation, it turns out that the configuration that works best is to collapse the SRT file into a single HTML document, wrapping each SRT line within a <span></span> (shortened to <s>, since HTML tags are charged as part of the input length length even though they are not translated), with <p></p> blocks used to shard the text to remain under the Translate's per-block limits. The actual <p> boundaries don't matter so long as they appear between sentences, so we simply split whenever an SRT line ends on a period.

This yields an output HTML document like:

<p> ... <s id=7> против Украины не вызовет
ответного ядерного удара </s> <s id=8> НАТО по России вместо этого будут настолько мощные </s> <s id=9> военные ответы, что вся Российская армия будет </s> <s id=10> уничтожен примерно то же самое сказала француз
ский </s> <s id=11> президент макрон заявил, что отвечать на ядерный </s> <s id=12> удар по Украине Франция не станет. А лично он </s> ... </p>

You can see the complete output:

Copy this file to GCS.

We then queue this to Google Translate via:

time curl -H "X-Goog-User-Project: [PROJECTID]" -X POST -H "Content-Type: application/json; charset=utf-8" -H "Authorization: Bearer $(gcloud auth print-access-token)" "https://translate.googleapis.com/v3beta1/projects/[PROJECTID]/locations/us-central1:batchTranslateText" -d '{
  "sourceLanguageCode": "ru",
  "targetLanguageCodes": ["en"],
  "inputConfigs": { "mimeType": "text/html", "gcsSource": { "inputUri": "gs://[BUCKET]/input.html" } },
  "outputConfig": { "gcsDestination": { "outputUriPrefix": "gs://[BUCKET]/BATCHOUTPUT/" } }                   
}'

Note a couple of key points about the above. The HTML file we create must have an extension of ".html" or it will fail. The contents of "gs://[BUCKET]/BATCHOUTPUT/" must be empty. If there are any other files in that path, it will fail.

The command above will output a JOBID, which you can then use to track the progress via:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://translation.googleapis.com/v3/projects/[JOBID]"

This will tell you the progress of the translation process.

The final result is a translated version of the HTML file, which can then be read in, reassembled with the line IDs and time codes of the original native SRT, and used to construct a final translated SRT file that is precisely aligned with the original SRT transcript!

We've pulled this entire pipeline into a simple Perl script that automates this entire process:

Run it like:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1_20221013_143000_60_minut.transcript.srt
time ./demo_translatesrt.pl ./RUSSIA1_20221013_143000_60_minut.transcript.srt ru

It will output:

Note that this script is merely a technical demonstration of the underlying pipeline and lacks any kind of error handling and may incorrectly parse or process some lines. Despite this, it offers a proof of concept demonstration of high-resolution translation of SRT files constructed from SST or other transcription workflows from one language to another!