As we continue to scale up our work OCR'ing a quarter century of global television news broadcasts, one of the many technical challenges we've encountered lies in the divide between the image-processing stages of the pipeline and the AI API calling stage. While adding considerable technical complexity, we ultimately separated these two stages for bulk historical processing, while allowing them to converge for realtime production processing.
To OCR each television news broadcast, we must stream its full-resolution MPEG or MP4 video file from GCS to the local VM, run ffmpeg to burst it into 1fps frames, use ImageMagick's montage to construct those still frames into montage grid sequences, then write those montages back into GCS and finally call GCP's Cloud Vision API to OCR each frame. This typically takes several minutes per broadcast, up to tens of minutes for longer broadcasts. It is also extremely unpredictable, with two different broadcasts with nearly identical metadata taking vastly different times to process due to variations in JPEG compression and the regional CV API load. This produces extremely unstable API submission rates, which make it impossible to maximize quota usage, even when using both intra-VM and inter-VM coordination and call orchestration. To address this, we built a single OCR engine that supports both self-contained end-to-end montage+OCR for ongoing realtime production OCR or split workload mode for bulk historical OCR.
For ongoing realtime production OCR, the engine simply performs the entire pipeline from start to finish, since realtime processing API workloads place us nowhere near our maximum API quotas.
For bulk historical OCR we need to achieve API submission rates that sit just barely under the maximum hard quota for each API. To achieve this, the complete fleet of processing nodes perform the image processing montaging and exit after uploading the montage frames to GCS and writing the necessary info about those frames to BigTable. Every single machine runs as many montage engines as it has cores (slightly oversubscribed to overlap any network-related tasks). Upon montage completion, each OCR engine writes the necessary info to BigTable and appends the broadcast to a second OCR-only queue.
Of the entire OCR fleet, four 32-core VMs are designated for API submission. They evenly distribute the maximum per-minute Cloud Vision API quota and maintain fixed submission rates that bring us to just under the maximum API submission quota.
Thus, the overall architecture looks like this. There are two queues: MONTAGE and OCRONLY. The complete OCR fleet runs as many MONTAGE engines as there are cores on each VM. As each MONTAGE process finishes, it writes its frames to GCS, updates the BigTable entry for the broadcast with the necessary frame info and appends it to the OCRONLY queue. Four VMs run OCR engines that monitoring only the OCRONLY queue and ONLY submit the frames to the Cloud Vision API and record the results back to GCS – updating BigTable upon completion.
This allows us to sufficiently decouple the CPU-intensive image processing from the API-driven OCR process to maintain near-constant load.
You can see the final results below. The first graph shows the best-possible throughput under our previous coordination system in which all OCR processes performed both montaging and OCR and attempted to globally coordinate to maintain a fixed API throughput. The second shows the load of the new system. In fact, the new system is so efficient that the sharp downward spikes you see are the montaging fleet being unable to keep up with the API submission rate and we are rapidly expanding the montaging fleet to address this.