At-Scale OCR Of Television News Experiments: What Have We Learned So Far?

Kalev Leetaru

8 months ago

In collaboration with the Internet Archive's TV News Archive, we are working to OCR the Archive's entire 7-million-hour quarter-century archive spanning 50 countries and 150 spoken languages to allow keyword search of all onscreen text using the TV Explorer. To date we have completed machine transcribing all 2.5 million hours of uncaptioned airtime using GCP's Chirp LSM model, vividly capturing the first time the prevalence of code switching at these scales.

We are now working on augmenting that spoken word archive by transcribing the Archive's onscreen text. Television news OCR is many orders of magnitude more complex than print OCR: text is superimposed on top of highly visually complex scenes, there are multiple fonts/colors/backgrounds in the same image, both foreground and background text, which may be partially obscured, and text appears in multiple angles to the screen, from straight on to a significant angle to the screen and even mixed vertical and horizonal, left-to-right and right-to-left text all in the same image. Multiple languages can appear at the same time in a single frame and image resolution can be as low as 640×480 pixels and even lower in some countries, meaning extractable text might only be a few tens of pixels high.

OCR'ing a single frame at a time using GCP's Cloud Vision API results in pretty remarkable accuracy, but to achieve a 100-200x speedup and reduction in cost, we are using a montage-based workflow and accepting a small accuracy reduction. Given the sheer scale of our OCR work (OCR'ing the complete Archive would cost more than $75M using a traditional workflow), we've gotten a lot of interest in what lessons we've learned to date, especially around open source and free models like Tesseract and EasyOCR, LMMs like Gemini, ChatGPT 4 and PaliGemma and dedicated CV models like GCP Cloud Vision API, topics like block reconstruction and stability, and the initial results from our montaging workflows. Below you can find a few relevant links summarizing our work to date:

For those interested in using LMMs like ChatGPT 4 and Gemini:

Open source and free OCR like EasyOCR and Tesseract and PaliGemma:

Others in the series:

Our montaging series: