At-Scale OCR Of Television News Experiments: First Results & Broadcast-Level Statistics

To date, we have OCR'd more than 18.8 billion seconds of global television news spanning 300 channels from 50 countries over the past quarter century. Thus far we have merely recorded all of this OCR JSON, but have yet to actually analyze it to generate basic statistics, such as how much text is found in a typical broadcast, what percentage of the screen real estate depicts text, etc. This is largely due to the complex and novel workflow that has made it computationally tractable to OCR this much material this quickly, reducing our costs from $47M to just $71K. While the montage-based workflow reduces the cost by up to 200x, it also requires a vastly more complex workflow to reassemble all of the extracted text back into per-frame OCR results, adjusting pixel offsets and dealing with various complexities. We have just run our first two hour-long OCR'd broadcasts through this analysis workflow, one from 2009 and one from today, to see just how much text can be found in an American television news broadcast.

The hour-long 2009 SD resolution broadcast required more than 1 billion pixels to be OCR'd by the Cloud Vision API, while the HD broadcast required the OCR of more than 3 billion pixels, reflecting the sheer enormity of archive-scale OCR. The 2009 broadcast contained 50K onscreen words compared with just 8.5K spoken words, meaning there was 5.9x more onscreen than spoken text, while the 2025 broadcast contained 13.2x more onscreen words (130K) than spoken (9.8K). Of course, many of these words may be fixed station-specific text, such as the name of the channel or show, the current date and time and other decorative text, but still capture just how much onscreen text there really is. These numbers, if borne out by the full 18.8-billion-second archive, offer a first glimpse at the sheer scale of what current spoken-word-focused video search is missing.

Our first analyzed broadcast is a 2009 MSNBC broadcast of The Ed Show. In all, the raw OCR JSON of the hour-long broadcast totals 271MB, which we reduce to 15MB by using a more optimized storage approach (such as concatenating vertices into single strings rather than storing as arrays of objects, etc). You can see the full breakdown below, which notes that this hour-long 640×480 SD resolution broadcast sampled at 1fps results in 1 billion pixels that had to be OCR'd by the Cloud Vision API, of which 6.64% contained text. In all, 239K of text was extracted totaling 50,663 words. In contrast, the closed captioning of this broadcast contains just 8,571 words, meaning there is 5.9 times more text onscreen than is spoken aloud.

totFrames: 3600
width: 640
height: 480
frameCntHasTxt: 3325
pixelsTot: 1021440000
pixelsTxt: 67788053
pixelsTxtPerc: 6.64
txtBytes: 239903
txtChars: 238652
txtParas: 3665
txtWords: 50663

Our second broadcast is a CNN Newsroom broadcast from yesterday. The raw OCR JSON of the hour-long broadcast totals 731MB (almost as large as the 946MB MP4 video file), which we reduce to 41MB. This HD broadcast required more than 3 billion pixels to be analyzed by Cloud Vision, of which 14.13% contained text, resulting in 690K of extracted text totaling 130,295 words. The closed captioning contains just 9,867 words, meaning this broadcast had 13.2 times more onscreen than spoken text.

totFrames: 3662
width: 1280
height: 720
frameCntHasTxt: 3330
pixelsTot: 3068928000
pixelsTxt: 433487069
pixelsTxtPerc: 14.13
txtBytes: 690336
txtChars: 688583
txtParas: 4280
txtWords: 130295