Experiments With GCP's Chirp: An Example Of Multilingual Transcription

Television news across the world frequently contains excerpts of languages other than the dominate language of the broadcast, either through codeswitching or, most commonly, through a few seconds of the original audio played before overdubbing begins. For example, an interview on US television might feature President Macron speaking in French for a few seconds before the dubbed English translation begins, just to give a sample of the original speech. Historically, speech transcription systems often handled multilingual speech by attempting to locate the most similar-sounding word in the original language, meaning that a French language excerpt in a Russian broadcast would be transcribed as the most similar-sounding Russian words, yielding inaccurate results. Fascinatingly, GCP's new Chirp ASR model, despite not officially supporting multilingual speech, does appear to yield at least partial results.

Take this clip from Taiwanese television news channel TTV that features a brief excerpt of speech before the overdubbing begins. Chirp transcribes it as:

Aktualisht sot mund të themi që 80% mund të jen t shipt全球對許多意大利人來說國內旅遊三緣更讓遊客果足不錢 阿爾亞海距離80公里旅利人 anni fa, abbiamo sentito inizialmente parlare dell'Albania, delle belle coste, anche dei prezzi buoni, indbiamente.中海海式影響,強調果,海鮮,感奶油,結和土爾地Papa王餐亞利總理梅洛尼跟家人來度假當面投速

In this case even though there is a single speaker, Chirp transcribes the first words as Albanian (translated as "Currently today we can say that 80% can be t shipt" via Google Translate) , then "For many Italians around the world, domestic tourism is one of the most difficult situations in the world, and tourists are short of money. 80km traveler" in Mandarin, then "years ago, we initially heard about Albania, the beautiful coasts, also the good prices, of course." in Italian.

Interestingly, the onscreen captioning is "今天8成都是外國觀光客 其他的遊客才是本國人 或科索沃來的阿爾巴尼亞人 " which Google Translate translates to "Today, 80% of Chengdu is foreign tourists. Other tourists are locals or albanians from kosovo", while the chyron states "Tourists flock to Albania, the number of visits increased by 50% compared with last year."

Another example is the preceding broadcast, with the first part this English clip being correctly transcribed as "it's my first time here and basically my group of friends and high likes to spend time together and we just thought why not", though it then cuts off.

While imperfect and in this case conflating languages, Chirp offers a glimpse of the future of multilingual transcription, even though it is not officially supported by the model.