Experiments With In-Browser Speech Synthesis For Reading Translated Transcripts

As we continue to explore fully automated transcription + translation pipelines for Russian and Ukrainian television news, a natural question is how best to present the translated transcripts to end users. At present, we are displaying the automated native language transcripts via three interface components: a right-hand running transcript for intra-broadcast search, a clip-length transcript beneath playable clips and realtime captioning overlays synced to the speech. For translated broadcasts we will simply replace the native transcript in these three places with the translated transcript, but one question is whether automated speech synthesis has reached a point where we could also overlay an automated English-language spoken version of the translated transcript for further ease of use?

Most modern browsers now support built-in native speech synthesis. Invoking this machinery requires just three lines of JavaScript:

var synth = window.speechSynthesis
var utterance = new SpeechSynthesisUtterance('against Ukraine would not cause a NATO nuclear retaliatory strike against Russia, instead there would be such powerful military responses that the entire Russian the army will be destroyed about the same thing French President Macron said that France would not respond to a nuclear strike on Ukraine. And personally, he is against the world war, however, no one asks the macron. The entire Western Press is already writing a plan in case of a nuclear war Reuters. Together. That NATO ')
synth.speak(utterance)

You can see live results in this online demo that also allows you to select from among the voices available on your local computer.

One major challenge is that live synthesized speech still sounds highly stilted and robotic compared with offline synthesized speech from state-of-the-art cloud-based models. The native browser-based synthesis also offers only a single rate control, making it difficult to precisely time speech to the start/end stamps corresponding to the video or to recreate the prosody of the original speech. There is, however, support for Speech Synthesis Markup Language (SSML) that allows partial restoration of features like prosody. Unfortunately most speech to text systems do not currently record prosody or other secondary information in their transcripts, but as such models evolve, there is the possibility of passing this through.

The greatest challenge, however, is that television news consists of myriad speakers, often speaking in multi-speaker environments such as debates or dialogues. Having the same robotic voice represent multiple speakers or even leveraging all available English voices on a user's machine to differentiate the voices, yields a result that just doesn't work.

Seeing well-known public figures with instantly recognizable voices represented by an artificial voice that sounds nothing like them yields an experience that is so disconcerting as to cause the viewer to stop the video.

In contrast, using cloud-based speech synthesis, it is possible to carry through the unique vocal characteristics of a voice and transfer to the translated transcript in a form of "deep fake," but this raises enormous ethical considerations that preclude such a use case.

Instead, after considerable experimentation with various options, we have determined that the best option at this time is simply to display our translated transcripts in the same fashion we presently display the native language transcripts and simply accept the added cognitive load of managing the native language spoken dialogue on top of the translated captioning overlay. Given that anyone that has ever watched a translated movie has experience with this metaphor and interface format, we believe this modality will work well.