At-Scale OCR Of Television News Experiments: 19 Billion Seconds & 294 Billion Words In 1.8 Petabytes Of OCR JSON From 6 Quadrillion Pixels

We are tremendously excited to announce today that we have completed processing of the GCP Cloud Vision API OCR results for the complete 300+ channel 50-country quarter-century-long TV News Archive. Reflecting the sheer enormity of the scale of this effort, Cloud Vision OCR'd more than 6 quadrillion pixels across 19.1 billion images, resulting in 1.8 petabytes of JSON OCR annotations. In total, more than 19.1 billion seconds of global television news have been OCR'd through Cloud Vision, of which 16.2 billion (84.7%) contained at least some recognizable text, reflecting just how universal onscreen text is on television news. At the same time, of those 6 quadrillion OCR'd pixels, just 453 trillion pixels contained text (7.46%), showing that while text is ubiquitous, it occupies relatively little screen space. In all, 1.8TB of text was extracted from those 19.1 billion seconds of airtime, consisting of 1.5 billion characters making up 294 billion words organized into 17.6 billion contiguous blocks. The average frame contained just 98 characters in 18 words, while the average word was 5.38 characters long. We believe this is one of the largest OCR efforts ever performed on a global television news archive and we will be diving into this immense dataset in the coming weeks to begin exploring what it tells us about the landscape of onscreen text in television news across the world and over the past quarter century.