Using Thumbnail Montages To Optimize AI-Based OCR Speed & Costs: Part 7 – What Have We Learned So Far?

What have we learned so far from our six-part series (Part 1, Part 2, Part 3, Part 4, Part 5, Part 6) on using image montages to dramatically improve the speed and reduce the cost of OCR for large video archives? While native video OCR such as the Cloud Video AI API offers the highest possible OCR accuracy due to its ability to look across nearby frames to reduce the impact of jitter and smearing, there are few good options for reducing its cost given that SOTA video OCR models tend to be highly sensitive to changes of the video source and thus highly resistant to many kinds of optimizations (such as frame reduction, intelligent resampling, etc). In contrast, sampling the video into a sequence of still frames enables the use of highly accurate massively multilingual OCR using SOTA still frame OCR models that yields results largely indistinguishable from video OCR for most use cases, but at vastly lower cost with some creative optimizations. Here's what we've learned to date:

  • Sampling Rate. The first step is to sample the video file into a sequence of still frames. Of the various sampling methods we tested (1fps, 1/2fps, etc fixed framerate and variable framerate algorithmically-determined scene change detection), fixed 1fps and 1/2fps sampling yielded the best results. Sampling rates beyond 2fps ran the risk of missed text in some cases (especially for fast-moving text). Scene detection sampling (only outputting a frame when it differs more than a certain percentage from the previous one) missed too much text and the variable sizes of the skip gaps vastly complicates analytic understanding of the results. Thus, using ffmpeg to sample the video at a fixed 1fps or 1/2fps framerate appears to yield the best results.
  • Sample Resolution. Text-light channels that feature only a chyron and crawl can typically be downsampled down to 500×500 pixel resolution, but anything below that yields measurable accuracy reduction. Text saturated channels begin to exhibit OCR accuracy reduction fairly quickly. Thus, processing video frames at their original native resolution appears to yield the best results.
  • Montage Layout. Arranging the resulting frames into a vertical, horizontal or grid layout appears to yield identical block-level recognition results with Cloud Vision's OCR. The major difference is that Cloud Vision's automatic reassembly of those blocks into a single cohesive image-level text block yields accurate results only for a single-column vertical layout. For all other layouts, text from unrelated adjacent frames often bleeds together. While this could be corrected using custom post-processing code, such a workflow would have to accommodate both space-segmented and scriptio continua languages. Using a simple single-column vertical layout for the image montage sidesteps all of this and allows relatively straightforward frame segmentation code.

Thus, the final recommendation is to take an input source video and sample it using a fixed 1fps or 1/2fps framerate (1fps allows matching the results to most common analytic frameworks and methodologies, but 1/2fps cuts total costs in half). The images should be tiled into a vertical montage at their original native resolution. This simple workflow yields results that largely equal SOTA video OCR other than in a rare set of edge cases with severe motion blur and jitter – though even in those cases often Cloud Vision's OCR is capable of recovering much of the text.