The GDELT Project

Using Thumbnail Montages To Optimize AI-Based OCR Speed & Costs: Part 7 – What Have We Learned So Far?

What have we learned so far from our six-part series (Part 1, Part 2, Part 3, Part 4, Part 5, Part 6) on using image montages to dramatically improve the speed and reduce the cost of OCR for large video archives? While native video OCR such as the Cloud Video AI API offers the highest possible OCR accuracy due to its ability to look across nearby frames to reduce the impact of jitter and smearing, there are few good options for reducing its cost given that SOTA video OCR models tend to be highly sensitive to changes of the video source and thus highly resistant to many kinds of optimizations (such as frame reduction, intelligent resampling, etc). In contrast, sampling the video into a sequence of still frames enables the use of highly accurate massively multilingual OCR using SOTA still frame OCR models that yields results largely indistinguishable from video OCR for most use cases, but at vastly lower cost with some creative optimizations. Here's what we've learned to date:

Thus, the final recommendation is to take an input source video and sample it using a fixed 1fps or 1/2fps framerate (1fps allows matching the results to most common analytic frameworks and methodologies, but 1/2fps cuts total costs in half). The images should be tiled into a vertical montage at their original native resolution. This simple workflow yields results that largely equal SOTA video OCR other than in a rare set of edge cases with severe motion blur and jitter – though even in those cases often Cloud Vision's OCR is capable of recovering much of the text.