Earlier this month in collaboration with the Internet Archive's TV News Archive, we completed the machine transcription of its complete 23-year archive spanning more than 2.5 million hours of global television news from 50 countries using Google's Chirp ASR model. As we work to make this massive new transcript archive searchable, one of the first questions we have to answer is the linguistic breakdown of the archive. After all, there are many words that are shared by multiple languages that have vastly different meanings in each language, meaning it can be important to have the ability to limit searches by language. Even more importantly, many keyword search engines like ElasticSearch are built around the concept of space-delineated "words" and require specialized workflows for scriptio continua languages, so it is important for us to understand the broad contours of the linguistic landscape of this quarter-century archive in order to plan out and optimize how we organize it for search.
Chirp is an LSM (Large Speech Model) model, meaning it is able to seamlessly transcribe multilingual content. If a broadcast intermixes multiple languages in rapid succession, Chirp typically does an excellent job of tracking those changes and correctly transcribing each language. Unfortunately, due to the way in which LSM's work, while Chirp correctly transcribes each language, it does not annotate its transcription to record that a given chunk of audio spans multiple languages or to record which language each word represents. In other words, Chirp will correctly transcribe a mixed Arabic-Chinese-English segment, but there will be nothing in the transcript to indicate that the segment contains multiple languages or which word is in which language – there will simply be a block of text in multiple languages. This means that to get a sense of the linguistic breakdown of the archive, we will have to run third-party language detection algorithms over each transcript.
Given the scope and scale of the Television News Archive, we know there are likely to be a wide range of languages represented, especially underrepresented languages that aren't robustly recognized by many language detection tools, meaning we need a detector that supports a large number of languages. Critically, not all language detectors are designed for robust detection over intermixed and code switching speech in which multiple languages are tightly interwoven, sometimes every sentence or every few sentences. Many of the language detectors we tested failed to properly segment the kind of rapid-fire language changes that often characterize multilingual television news content, which is often in the form of multiple rapidly-spliced clips of different people speaking in different languages. Detectors would often fail to spot even trivial segmentations, such as an isolated English sentence in the middle of a block of Chinese text, where characterset analysis alone would be sufficient to identify the language shift. Complicating matters, LSMs can also introduce quirks into their transcriptions, such as Chirp's propensity to randomly space-segment scriptio continua languages: it will correctly transcribe a block of text, then suddenly inject spaces between every ideogram, then revert back to correct transcription, then back to spacing. These idiosyncrasies seem to challenge many language detectors, causing them to label text with the wrong language or simply fail to assign a label at all.
The end result is that of the various language detectors we tested, the decade-old CLD2 actually outperformed the more modern tools we tested (even LLMs proved too unstable with various hallucinations and errors to use effectively), with extremely precise segmentation. Moreover, CLD2's simplicity means that for edge cases where the segmentation correctly isolated a chunk of text but was unable to assign a label, we can run CLD2 just on that isolated chunk and utilize its various internal scoring tables to arrive at the correct answer in the majority of cases.
What does this all look like in real life? Let's take this Chinese-language broadcast we've used in our past experiments. Chirp was told that the broadcast was exclusively in Chinese, since that was our expectation, but if you skim through the broadcast, you will see that it seamlessly transcribed the brief excerpts of other languages that appear in several other places scattered through the broadcast, though there is nothing in the Chirp transcript to tell us that those excerpts are in other languages, meaning we need to use CLD2 to annotate which word is in which language.
Here is the final CLD2 language breakdown of the broadcast, showing that it is 87.19% Chinese ("ChineseT" + "Chinese" to use CLD2's language codes), with 714 characters of Arabic, 503 characters of English and 142 of Indonesian. Excerpts in each of these languages appear in the broadcast. CLD2 is also correct that there are also a handful of Marathi and Hindi characters in the transcription, but in this case those are Chirp errors – having CLD2 flag this can help us understand the contexts under which those errors occur:
{ "bytes" : 9683, "chars" : 5083, "lang" : "ChineseT", "percBytes" : 48.5704253611557, "percChars" : 47.4072001492259 }, { "bytes" : 8303, "chars" : 4267, "lang" : "Chinese", "percBytes" : 41.6482744783307, "percChars" : 39.7966797239321 }, { "bytes" : 1282, "chars" : 714, "lang" : "ARABIC", "percBytes" : 6.43057784911718, "percChars" : 6.65920537213207 }, { "bytes" : 503, "chars" : 503, "lang" : "ENGLISH", "percBytes" : 2.52307383627608, "percChars" : 4.69128893863085 }, { "bytes" : 142, "chars" : 142, "lang" : "INDONESIAN", "percBytes" : 0.712279293739968, "percChars" : 1.32437977989181 }, { "bytes" : 13, "chars" : 9, "lang" : "MARATHI", "percBytes" : 0.0652086677367576, "percChars" : 0.0839395635142697 }, { "bytes" : 10, "chars" : 4, "lang" : "HINDI", "percBytes" : 0.0501605136436597, "percChars" : 0.0373064726730088 }
For those interested in how CLD2 segments the transcript, you can see the chunks in order below, along with up to the first 50 characters of each chunk. Fascinatingly, this shows that there are actually two separate Arabic excerpts at two different points in the broadcast:
{ "lang" : "Chinese", "txt" : " 观眾朋友中看中和播出的新文30分 來看到 文台 9月14日向2023北 习 近近指久的優同話就有開..." }, { "lang" : "MARATHI", "txt" : "लै 60826 ..." }, { "lang" : "ChineseT", "txt" : "亿 千 畫 , 同 比 增 长 ,5.0%937亿 。 将 集 中 展 示 全 球 执 生 机 領 ..." }, { "lang" : "ENGLISH", "txt" : "I have been there for two times already when I was..." }, { "lang" : "ChineseT", "txt" : "本菲的運動有400名多, ..." }, { "lang" : "ENGLISH", "txt" : "four years ago, five years ago in in Indonesia. ..." }, { "lang" : "Chinese", "txt" : "今 年 科 威 特 男 子 双 向 飞 迪 相 目 共 有 三 名 运 动 员 参 赛 。 这 位 ..." }, { "lang" : "ARABIC", "txt" : "بانها تكون في الصين لاني في 2010 اخذت البطوله في ا..." }, { "lang" : "Chinese", "txt" : "近 日 国 家 计 算 机 病 毒 应 集 处 理 中 心 和 360 公 司 对 一 款 名 为 ..." }, { "lang" : "INDONESIAN", "txt" : "tadi mencoba nyaman dan pada kecepatan tadi 350 ti..." }, { "lang" : "Chinese", "txt" : "亚鐵是印度和南和民城全142 高 铁 动 车 组 是 由 中 方 企 业 采 用 十 速 350 公..." }, { "lang" : "ARABIC", "txt" : "في البحر. جماعي. حتى المقابر ماشالنهمش. ومازت نقول..." }, { "lang" : "ChineseT", "txt" : "現 在 内 的 西 部 大 部 分 地 区 , 国 民 代 表 大 会 与 国 民 军 结 盟 主 ..." }, { "lang" : "HINDI", "txt" : "राय ..." }, { "lang" : "Chinese", "txt" : "高 端 白 酒 新 选 择 , 来 自 刺 水 和 上 流 , 貴 州 金 沙 。 东 西 皮 菜 ..." }