Earlier this week we explored the ability of GCP's Cloud Vision API to robustly OCR a grid of 224 video frames extracted at 1fps from a television news broadcast, speeding up the OCR process more than 200x compared with OCRing each frame individually. As we reviewed those results, we found a number of fascinating examples, including two of particular note, one involving the ability of OCR to read text that is hard even for a human to discern, while the second reinforces the inconsistencies that can result from OCR at this density.
The first example is this frame, in which the "crawl" text is folding upwards in a video transition. This particular frame was extracted mid-transition and thus the text is difficult to fully discern, especially the second word. Yet, remarkably, Cloud Vision was able to accurately transcribe the text as "UPDATE Facebook bug disables members' accounts".
The second is the instability at times in the grouping of letters. In the immediately following sequence of frames, Cloud Vision alternates between transcribing "TSA chief likely" and "TSA chieflikely" (coming "chief" and "likely" together into a single word). There is no immediate difference in the two frames below – even the moirĂ© effect that colorizes the text does not appear to substantially differ between the frames. In this case, it is not even a spacing issue, as a manual review of the API raw JSON output shows that it transcribes it as a single word with no spacing between the characters in those instances. Transcribing each of the images by themselves as a single image eliminates this oddity, but increases the cost of archive-scale OCR to intractable levels. This is simply one of the tradeoffs of such performance and cost optimizations: there will be some reduced accuracy in edge cases and occasional oddities, but overall performance will be on par with single-image OCR.