A First Glimpse At Combining ASR & OCR For Archive-Scale Corrupt & Blank Broadcast Detection

In any archive as large as the TV News Archive, there will invariably be a small percentage of broadcasts that capture moments where a channel was off-the-air or experiencing transmission issues, resulting in blank or corrupted broadcasts. Ideally, we'd love to remove these broadcasts since they are not useful to journalists and scholars, but they have historically been highly resistant to at-scale detection given the sheer variety of ways that video corruption can manifest itself. It turns out that combining ASR and OCR information together can provide an extremely robust signal for detecting wide swaths of these corrupt broadcasts.

Corrupt broadcasts take on every form imaginable, from the most recognizable "pixelated smearing" to imagery that looks almost like a vintage video game. These images are rarely static, with the montage at bottom showing the sequence of a typical corrupt broadcast and the diversity of imagery it contains. Worse, while the montage below looks fairly consistent across frames, in reality the video is actually showing a high degree of pixel change from frame to frame, making traditional velocity-based metrics less useful. The unique nature of corrupted video signals also means that most traditional image detection algorithms, from color histograms to interframe change to pixel stability and velocity all result in unacceptably high false positive or negative rates.

The only consistent pattern across all these corrupted videos is the lack of intelligible spoken audio or recognizable onscreen text. Applying high quality ASR via GCP's Chirp and OCR via GCP's Cloud Vision AI API, we can immediately identify broadcasts that lack any meaningful spoken or written words, making it trivial to identify the most obviously corrupted videos, such as the ones above.

We've identified more than 71,000 broadcasts to date using this methodology and we are now exploring applying various NLP and LLM approaches to examine the ASR and OCR data of corrupt broadcasts that have recognizable speech and onscreen text, but where it is too garbled to be understandable and we'll be sharing some preliminary findings from that work soon!