
To date, we have OCR'd more than 18.8 billion seconds of global television news spanning 300 channels from 50 countries over the past quarter century. Thus far we have merely recorded all of this OCR JSON, but have yet to actually analyze it to generate basic statistics, such as how much text is found in a typical broadcast, what percentage of the screen real estate depicts text, etc. This is largely due to the complex and novel workflow that has made it computationally tractable to OCR this much material this quickly, reducing our costs from $47M to just $71K. While the montage-based workflow reduces the cost by up to 200x, it also requires a vastly more complex workflow to reassemble all of the extracted text back into per-frame OCR results, adjusting pixel offsets and dealing with various complexities. We have just run our first two hour-long OCR'd broadcasts through this analysis workflow, one from 2009 and one from today, to see just how much text can be found in an American television news broadcast.
The hour-long 2009 SD resolution broadcast required more than 1 billion pixels to be OCR'd by the Cloud Vision API, while the HD broadcast required the OCR of more than 3 billion pixels, reflecting the sheer enormity of archive-scale OCR. The 2009 broadcast contained 50K onscreen words compared with just 8.5K spoken words, meaning there was 5.9x more onscreen than spoken text, while the 2025 broadcast contained 13.2x more onscreen words (130K) than spoken (9.8K). Of course, many of these words may be fixed station-specific text, such as the name of the channel or show, the current date and time and other decorative text, but still capture just how much onscreen text there really is. These numbers, if borne out by the full 18.8-billion-second archive, offer a first glimpse at the sheer scale of what current spoken-word-focused video search is missing.
Our first analyzed broadcast is a 2009 MSNBC broadcast of The Ed Show. In all, the raw OCR JSON of the hour-long broadcast totals 271MB, which we reduce to 15MB by using a more optimized storage approach (such as concatenating vertices into single strings rather than storing as arrays of objects, etc). You can see the full breakdown below, which notes that this hour-long 640×480 SD resolution broadcast sampled at 1fps results in 1 billion pixels that had to be OCR'd by the Cloud Vision API, of which 6.64% contained text. In all, 239K of text was extracted totaling 50,663 words. In contrast, the closed captioning of this broadcast contains just 8,571 words, meaning there is 5.9 times more text onscreen than is spoken aloud.
totFrames: 3600 width: 640 height: 480 frameCntHasTxt: 3325 pixelsTot: 1021440000 pixelsTxt: 67788053 pixelsTxtPerc: 6.64 txtBytes: 239903 txtChars: 238652 txtParas: 3665 txtWords: 50663
Our second broadcast is a CNN Newsroom broadcast from yesterday. The raw OCR JSON of the hour-long broadcast totals 731MB (almost as large as the 946MB MP4 video file), which we reduce to 41MB. This HD broadcast required more than 3 billion pixels to be analyzed by Cloud Vision, of which 14.13% contained text, resulting in 690K of extracted text totaling 130,295 words. The closed captioning contains just 9,867 words, meaning this broadcast had 13.2 times more onscreen than spoken text.
totFrames: 3662 width: 1280 height: 720 frameCntHasTxt: 3330 pixelsTot: 3068928000 pixelsTxt: 433487069 pixelsTxtPerc: 14.13 txtBytes: 690336 txtChars: 688583 txtParas: 4280 txtWords: 130295