The GDELT Project

Visual Explorer: A Layered Workflow For Detecting Corrupt & Empty Broadcasts Using FFMPEG & Chirp ASR

Any archive of television news coverage spanning hundreds of channels from more than 50 countries over more than a quarter century will inevitably contain some percentage of blank and corrupted broadcasts, from signal loss to recording glitches. As we prepare to relaunch the Visual Explorer, we have been working to detect these at scale across the archive. Complicating matters is the sheer variety of ways in which such broadcasts can be presented: detecting off-the-air broadcasts involves identifying blank, colorbars and tone, logo and music, ads, preview clips, subaudible tones and nearly every possible permutation of video and audio signal that can be imagined, multiplied by the varied broadcast practices of more than 50 nations that have changed repeatedly over the more than 25 years contained in the archive. Detecting corrupt broadcasts spans a visual and audible landscape as large as the broadcasting medium itself. Yet, with some clever observations we've been able to eliminate a vast swath of these problematic broadcasts.

The first stage of filtration occurs at the initial ingest stage. We first scan each MPG or MP4 container using ffprobe/ffmpeg t0 assess the existence, duration and specifications of its audio and video channels. In all, we discarded 155,184 broadcasts through this filtering:

The second stage involves more detailed assessment of the audio channel to determine if it should be passed onward to ASR. In all, we discarded 127,294 broadcasts at this stage. Note that only videos that have successfully passed the first stage filtering are examined here:

Finally, after performing ASR, if it contained audio, but there is not even a single word of recoverable speech anywhere in the broadcast (typically instrumental music, massive audio channel corruption, excessive audio artifacts (clipping, dropout, improper expander/gate/compressor/equalization/levels/etc) or blank/colorbars with tones outside the range of human hearing), we mark the broadcast, discarding a final list of 99,662 broadcasts. A much larger set of broadcasts contain just a few recognizable words, but at present we retain these in the Visual Explorer without ASR transcripts.

Note that at this time we retain broadcasts that have valid recoverable audio, but corrupt or blank video. This means that broadcasts where the audio is transcribable, but the video is lost will remain in the Visual Explorer, since it is still possible to understand what the speakers are saying.

With just this relatively simple workflow we have been able to remove the majority of the blank and corrupt broadcasts from the archive.