Behind The Scenes: A First Glimpse At ASR Statistics From 2.5 Million Hours Of Global TV News Spanning 50 Countries & A Quarter Century

Last year we announced the successful completion of Large Speech Model (LSM)-powered ASR over the totality of the uncaptioned Television News Archive, applying GCP's state of the art Chirp speech to text technology to more than 2.5 million hours of television news from 50 countries spanning portions of the past quarter-century. Today we are excited to report the first statistics from that immense initiative. In all, we ASR'd 8.8 billion seconds (146.9M minutes / 2.45M hours) of airtime, of which 6.856 billion seconds (77.8%) contained at least one recognizable word, with the rest of the airtime consists of silence, music, sound effects, noises, ambient field sound and unintelligible or unrecognized human speech. Of the recognized speech, the resulting transcripts contain 111.3GB of text consisting of 78.1 billion characters representing 14.18 billion words (Chirp segments scriptio continua languages into discrete "words" so this wordcount correctly represents languages like Chinese).

The timeline below shows the total seconds of uncaptioned airtime per year across the TV News Archive, showing an initial burst in 2001 with the 9/11 Archive, then no uncaptioned collection until 2007, ramping up to a peak in 2012, declining through 2021, then ramping back up. The year 2024 is incomplete at this time as we work to catch up processing of the last three months of the year.

With the launch of the new TV Explorer we will be incorporating this enormous ASR archive to make the complete TV News Archive fulltext searchable for the first time since its inception nearly a quarter-century ago!