Transcribing 2.5M Hours Of TV News: 4.25M Global Broadcasts Processed Using GCP’s Chirp LSM

Two weeks ago we unveiled the first glimpse of our massive collaboration with the Internet Archive's Television News Archive to transcribe its complete uncaptioned global archive: all 2.5 million global hours of television news have been transcribed using GCP's Chirp LSM model. As we begin the process of examining this unprecedented global transcription archive, we've already discussed how it took just 7 days to transcribe the entire archive and could have been completed in under 12 hours had we focused on speed. One interesting new statistic is that in all, those 2.5 million hours of airtime are spread across 4.25 million broadcasts. As of a year and a half ago there were 5.25 million broadcasts totaling 3.43 million hours of airtime in the entire archive, of which a portion already contained broadcaster-provided human-transcribed closed captioning. While the Archive has continued to grow substantially in the year and a half since those statistics, that suggests that as much as 75% of the Archive's total holdings were uncaptioned and thus will soon be rendered searchable for the first time!

We are currently underway examining the language breakdown of the transcriptions and will have some exciting findings to share there soon!

The GDELT Project

Transcribing 2.5M Hours Of TV News: 4.25M Global Broadcasts Processed Using GCP's Chirp LSM

Archives