A First Glimpse Of Using Language Detection To Identify Language Examples From The TV News Archive

A year ago we explored applying language detection to selections from the Chirp-generated ASR transcripts of the TV News Archive, finding that CLD2 produced the most robust results, despite its age and simplicity. We are scaling up applying CLD2 to the entire 3 million hours of machine transcriptions in the Archive, including having it segment multilingual transcripts into their component languages. We are now storing those per-broadcast language breakdown statistics in our Bigtable digital twin, where we can then use BigQuery to rapidly scan the entire corpus for lists of broadcasts containing excerpts in specific languages, including just how much of that language appears in that broadcast. Here we use this workflow to instantly identify two sample broadcasts on a given day that contained Burmese and Vietnamese speech, respectively, using a single SQL query. Note that while Burmese is a scriptio continua language, Chirp's ASR output attempts to segment it into discrete words, while the Vietnamese "wordcount" is actually a syllable count and thus character counts are more accurate representations for these languages. As we scale this workflow across the entire Archive, we are tremendously excited to see how this approach can help us understand multilingualism and the patterns of speech of each language represented in the Archive's quarter century of broadcasts.

Searching for Vietnamese speech on September 1, 2022, we find that the broadcast with the most Vietnamese speech is the following Al Jazeera broadcast in which 6.7% of the transcribed speech is Vietnamese, such as this clip.

Row	id	lang	langBytes	langChars	langWords	langCharPercent
1	ALJAZ_20220901_223000_101_East__Trafficked_From_Vietnam_To_Europe	VIETNAMESE	1900	1441	335	6.7041965199590585

Searching for Burmese speech on the same day, we find the following broadcast, in which 15.4% of the transcribed speech is Burmese, such as this clip.

Row	id	lang	langBytes	langChars	langWords	langCharPercent
1	ALJAZ_20220901_060000_Exiled	BURMESE	10799	4095	727	15.44350580781415

The GDELT Project

A First Glimpse Of Using Language Detection To Identify Language Examples From The TV News Archive

Archives