Transcribing 2.5M Hours Of TV News: First Experiments With Applying Language Detection To Chirp's Multilingual Speech Transcription

Last September we examined how Google's new Universal Speech Model called Chirp was the first automated speech transcription system we had ever tested to support truly multilingual speech transcription. Chirp is a large speech model speech transcription system that offers state-of-the-art speech transcription across more than a hundred languages and dialects. One of its most powerful aspects is that as it transcribes a video, if it encounters speech in languages other than the expected primary language, it will seamlessly and transparently transcribe each word in its correct language. Thus, a 60 second clip from Chinese television news in which Arabic, Chinese and English were all intermixed was seamlessly transcribed by Chirp, even though the presence of Arabic and English was entirely unexpected and it was told that the video would only be in Chinese.

As we've begun to run language detection tools on the Chirp transcripts, we've begun to find fascinating and often highly unexpected examples of multilingual speech across the world's television news. For example, in the same Chinese television clip in which we had previously manually identified Arabic, Chinese and English, our automated language detection scan turned up a brief clip of Indonesian later in the same broadcast that we had previously missed.

At the end of a Southern Sudan broadcast, it flagged the multilingual nature of the channel, with an Arabic-language broadcast being followed by an English-language one.

These are just two brief examples we've already discovered as we ramp up our language detection analysis of this massive new archive of 2.5 million hours of transcribed global television news.