The GDELT Project

At-Scale OCR Of Television News Experiments: Early Statistics From 2% Of Shows

While we have OCR'd almost the totality of the TV News Archive through GCP's Cloud Vision AI API, those OCR results are in the form of massive montage-level JSON files, containing the character-level OCR information for as many as 200 seconds of airtime at a time. We have finally begun compiling all of that massive character-level data into per-frame OCR summaries to get a sense of just how much text is contained with the Archive's quarter century of global television news. As of this afternoon we have processed just over 2% of the entire Archive and here are some initial statistics from that small sample of 180K broadcasts: of 442M 1fps frames OCR'd, 374M contained at least some recognizable text (84%). Instead of OCR'ing 442M individual images, our montage-based workflow means we actually only OCR'd 2.9M distinct montage images, with an average of 148 frames per montage, accelerating the OCR process and reducing costs by 148x. There were 407M text blocks containing 36.7B characters making up 6.8B words, with an average of 98 characters / 18 words per frame.

As a reflection of the sheer magnitude of OCR at these scales, the API examined 143 trillion pixels in just this small sample, of which 10 trillion contained text (7.4% of onscreen space), resulting in 42TB of JSON.