The GDELT Project

At-Scale OCR Of Television News: 18.8 Billion Seconds Of Global Television News OCR'd For $71K Vs $47M

We are tremendously excited to announce today that in collaboration with the Internet Archive's Television News Archive, we have completed full-screen OCR over the complete Archive spanning more than 300 channels from 50 countries on 5 continents over the past quarter-century, totaling 18.8 billion total seconds of airtime (314M minutes / 5.24M hours) over 7.6 million broadcasts in what we believe to be one of the largest multilingual OCR initiatives ever applied to global television news. This enormous initiative was only possible due to a powerfully innovative new OCR workflow based on GCP's Cloud Vision AI API that reduced the cost from $47.14M to just $71,236. Had we OCR'd the entire archive through GCP Cloud Video, it would have cost $47.14M ($0.15/min) (though with per-frame results rather than per-second), while OCR'ing the videos as 1fps still frames through Cloud Vision would have cost $11.32M ($1.50/1000 images for the first 5M images, then $0.60/1000 images). Instead, using our powerful new montage workflow, we were able to OCR all 18.8 billion 1fps frames through just 123.73 million montage frames, reducing the final cost to just $71,236. We are currently analyzing the massive resulting archive of OCR JSON annotation files produced and over the coming weeks will be reporting our first statistics and initial analyses from this vast archive. It is a truly incredible testament to the groundbreaking potential of modern computer vision AI systems that through a creative and novel workflow we were able to OCR the entire 18.8-billion-second TV News Archive for the first time in its quarter-century history and do so for just $71,236.