Google's Chirp & Truly Multilingual Global Speech Transcription: An Example Of Three Languages In 60 Seconds

Google's new Universal Speech Model called Chirp is a large speech model speech transcription system that offers state-of-the-art speech transcription across more than a hundred languages and dialects. While the model does not officially support multilingual transcription, in practice it has proven surprisingly adept at handling codeswitching and multiple speakers in different languages. Below is an incredibly fascinating example of just how well Chirp handles multilingual audio. In a single segment of less than three minutes we have Chinese then English then Chinese then Arabic then Chinese again, back-to-back-to-back. You can see the raw Chirp transcript below and compare it with the actual clip. In this case, Chirp was told to expect only Chinese in this broadcast since we were not anticipating any English or Arabic in this broadcast, yet Chirp seamlessly handled the unexpected occurrences and seamlessly codeswitched. While Chirp does not officially support such multilingual transcription, these results demonstrate the immense power and potential of large speech model systems:

… I have been there for two times already when I was 15 years ago when I still um first mayor then of Tagaytay so now it's definitely it's very beautiful and very developed this will be a different concept now uh the… beauty and the combination of the beauty of the city, the natural beauty and the high technology combination of both, the lake, artificial intelligence, all the the new technologies now in Hango, before it was different, much more today,本菲的運動有400名多, four years ago, five years ago in in Indonesia. 今 年 科 威 特 男 子 双 向 飞 迪 相 目 共 有 三 名 运 动 员 参 赛 。 这 位 在 训 练 建 系 优 险 品 查 的 运 动 员 叫 阿 布 都 拉 拉 西 敌 , 是 克 菲 特 飞 碟 射 击 队 的 传 期 人 物 , 曾 获 得 过 多 个 世 界 比 赛 冠 军 , 两 界 亚 运 会 设 计 比 赛 的 金 牌 。 在 2010 年 的 广 洲 亚 运 会 上 , 他 第 一 次 实 现 了 自 己 的 亚 运 金 牌 梦 。 今 年 60 岁 的 他 将 继 续 出 展 , 和 儿 子 曼 苏 尔 拉 西 敌 一 起 代 表 科 微 特 参 加 韩 州 亚 运 阿 布 独 拉 的 儿 子 曼 苏 尔 是 2018 年 亚 加 亚 运 会 , 男 子 双 向 飞 蝶 个 人 相 目 的 冠 军 。 بانها تكون في الصين لاني في 2010 اخذت البطوله في الصين والان ان شاء الله راح ارجع الصين واخذها بالصين انا بطل اسياد اسيا السابق 2018 فا ودي احرص على لقبي في هذه البطوله. وكانت مع بحثنا في القريه الاولمبيه الاسيويه اللي في الصين كانت مشوقه وكانت تعطينا طابع حلو انه نكون نشارك ونشوف المرافق الصينيه بابداعاتهم الحلوه لقد تم تجهيز امتعتي بالكامل وانا جاهز با 近 日 国 家 计 算 机 病 毒 应 集 处 理 中 心 和 …

View Clip.

 

For those interested in the actual Chirp config used to generate this broadcast, we first use ffmpeg to convert the MP4 video file to a mono MP3 audio file, since the STT V2 API does not support MP4 files natively. To optimize the transcoding process, we use GCS streaming reads and writes buffered with mbuffer to smooth bursty IO. This means the local VM or compute surface requires no local writeable disk – the MP4 is streamed directly from GCS into ffmepg in memory and the resulting MP3 is piped directly back to GCS, all without any data touching local disk:

apt-get -y install ffmpeg
apt-get -y install mbuffer
timeout -s 9 60m gsutil -q cat "gs://[YOURGCSPATH]/CCTV13_20230914_040000_30.mp4" | ffmpeg -nostdin -hide_banner -loglevel panic -f mp4 -i pipe: -ac 1 -f mp3 pipe: | mbuffer -q -m 15M | gsutil -q cp - "gs://[YOURGCSPATH]/CCTV13_20230914_040000_30.mp3"

We then use the following request to GCP's STT V2 to apply Chirp to the MP3 audio file. Note that under language_codes, we only specified Chinese, since that was the only language we expected to find in this file. Yet, Chirp still seamlessly transcribed the English and Arabic it encountered in the file even though they were not listed:

{
  "config": {
    "auto_decoding_config": {},
    "language_codes": ["zh-Hans-CN"],
    "model": "chirp",
    "features": { "enableAutomaticPunctuation": true, "enableWordTimeOffsets": true, "maxAlternatives": 30 },
  },
  "files": [ {"uri": "gs://[YOURGCSPATH]/CCTV13_20230914_040000_30.mp3"}],
  "recognitionOutputConfig": { "gcsOutputConfig": { "uri": "gs://[YOURGCSPATH]/CCTV13_20230914_040000_30/asr/" } }
}

That's literally all there is to apply Chirp to any video!