As we ramp up our work to expand OCR to the complete Internet Archive Television News Archive, the key to making OCR at this scale tractable is to sample each broadcast at 1fps and pack those image frames into a single montage image grid to be OCR'd all at once, rather than attempting to OCR each image individually. Previous experiments have demonstrated minimal difference in accuracy between single-frame OCR and packed-image OCR, but those efforts focused on traditional text overlays. What about a far more difficult case of partially obscured angled text that is blurry and suffers from heavy video artifacts in a low resolution SD broadcast?
When presented with a single isolated frame, GCP's Cloud Vision API was able to transcribe the majority of the text in the image. Yet, when presented the image as part of a grid montage layout, the API exhibited markedly poorer performance. No single factor appears to result in better or poorer performance – even reducing the number of total images in the montage has minimal impact compared with what images surround the given image in the montage. This suggests that while a montage-based approach to at-scale video OCR can reduce costs by levels upwards of 200x, it will come at the cost of reduced accuracy on edge case background text under challenging circumstances. At the same time, television news OCR use cases are primarily focused on standard onscreen text, meaning this presents less of a challenge to our specific needs.
Let's use this specific frame, at the 59 second mark in the broadcast, as our benchmark and attempt to extract the text in the background using GCP's Cloud Vision:
This yields the single-frame benchmark results of:
รวมสิรินธรเพื่อการฟื้นฟูสมรรถภา cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม - แรมเซ็นทราศูนย์ราชการและค
What if we pack the first 61 seconds of the broadcast into an 8×8 grid to allow us to OCR many frames at once?
We'll use the following code:
mkdir /dev/shm/tmp/ timeout -s 9 130m gcloud -q --no-user-output-enabled storage cat gs://[BUCKET]/video.mp4 | ffmpeg -nostdin -threads 1 -hide_banner -loglevel panic -t $ARG_VIDEODUR -i - -vf "fps=1,scale=iw*sar:ih" -start_number 0 "/dev/shm/tmp/frame-%06d.jpg" time montage /dev/shm/tmp/frame-%06d.jpg[0-224] -tile 15x15 -geometry +10+10 -background black ./montage.jpg; ls -alh montage* gsutil cp ./montage.jpg gs://[BUCKET]/montage.jpg time curl -s -H "Content-Type: application/json; charset=utf-8" -H "x-goog-user-project:[YOURPROJECTID]" -H "Authorization: Bearer $(gcloud auth print-access-token )" https://vision.googleapis.com/v1/images:annotate -d '{ "requests": [ { "image": { "source": { "gcsImageUri": "gs://[BUCKET]/montage.jpg" } }, "features": [ {"type":"TEXT_DETECTION"} ] } ] }' > ./montage.json
Which yields:
This time that same frame yields similar text, but is lacking the first line:
ing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม - *แรมเซ็นทราศูนย์ราชการและค
What if we reduce our window to just the 9 seconds ending in this sequence with a 3×3 grid?
This yields nearly the same results as the single frame, though with a few characters difference:
สามสิรินธรเพื่อการฟื้นฟูสมรรถภ ing Rehabilitation Service ระหว่างวันที่ ๓๑ มีนาคม แรมเซ็นทราศูนย์ราชการและค
What if we go the opposite direction and pack 15×15 images into a single sheet (the maximum that will fit under Cloud Vision's limits):
This time we get similar results as we got for the 61 second grid, but slightly worse:
ing Rehabilitation Servic มารา มิ แอร์ประจานวงการและค
What if instead of JPG, we use WEBP for our montage format? This reduces the filesize from 18.3MB down to just 4.8MB:
Unfortunately, the text extraction doesn't improve, even though the text is slightly clearer:
ing Rehabilitation Servic มาราต้น อ มิกลม
What if instead of generating JPG frames from the source video, we generate WEBP frames? Here we generate WEBP frames from the video and then apply montage to them. Interestingly, this actually yields worse results:
wing Rehabilitation Service มา
What if we increase the quality of the underlying WEBP montage? Unfortunately, using lossless yields a file that is too large (56MB):
-define webp:lossless=true
What if we set the compression method to 6, which consumes the most CPU time, but yields the highest compression/quality tradeoff:
-define webp:method=6
This yields an image of 4.73MB vs the original 4.79MB and only minimally improves the output, though it does restore more of the Thai text, suggesting that WebP compression is the culprit:
ing Rehabilitation Service แนวทารกช่วงคอและที
What if we adjust its "-quality" setting? Using a quality of 100, we get a 58.3MB file, with 50 we get 3.6MB, with 75 we get 4.6MB, with 90 we get 9.2MB, with 95 we get 14MB and with 99 we get 19.49MB. Let's repeat our WEBP frames -> WEBP montage using a quality setting of 99 for the montage. Unfortunately, this actually yields worse results:
ing Rehabilitation Service มาน เช
What if we use the original JPG frames, but this time create a WEBP montage with 99 quality? Again, minimal change:
wing Rehabilitation Service มหานที เบ
What if we resize the 640×480 SD frames up to 1024×768 in FFMPEG as it outputs them, using WEBP as both the frame and montage formats? Given that WEBP has far more limited dimensions (16K x 16K vs JPEG's 65K x 65K), we'll also reduce our grid to 9×9 to keep under the API's 75MP limit:
ffmpeg -i - -vf "fps=1,scale=iw*sar:ih,scale=1024:768" -start_number 0 -c:v libwebp -lossless 1 "/dev/shm/tmp/frame-%06d.webp"
This yields a 5.6MB file, with not much improvement:
cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม - และ
What if we output 1024×768 JPEG frames and output to a JPEG montage? This again requires reducing to a 9×9 grid. Unfortunately, this yields very poor quality results:
cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม - ແລະວັນພຍາຄຸນທ່ານສະດີ 1000
What if we boost the ffmpeg frame quality ot "-q:v 1" and the montage quality to 95? Again, not much quality change:
cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม -
What if we have FFMPEG output standard frames and then resize them using ImageMagick, while having it output 95 quality images?
mkdir /dev/shm/tmp-RESIZED/ time find /dev/shm/tmp/*.jpg | parallel --eta 'convert {} -resize 1024x768 -quality 95 /dev/shm/tmp-RESIZED/{/}' time montage /dev/shm/tmp-RESIZED/frame-%06d.jpg[0-80] -tile 9x9 -geometry +10+10 -background black ./montage.jpg; ls -alh montage*
And we're back to our problematic output:
cing Rehabilitation Service ระหว่างวันที่ 20 มีนาคม - ຂນແລະວັນພາສູບທ່າທາຍແລະດີ
What if we use the same workflow, but output to a WEBP montage? This doesn't improve things much:
wing Rehabilitation Service ระว่างวันที่ ๒๐ มีนาคม
What if we use a 4×20 grid with the upscaled JPEG frames and WEBP montage? We're actually almost to the single-frame output, minus the first line:
ing Rehabilitation Service หว่างวันที่ ๒๐ มีนาคม - และเพราศูนย์ราชการและค
What if we use our upscaled JPEG frames and a 1×80 JPEG montage? This is the best we've gotten so far from our montage approaches, though it is missing the first line and the last line appears to contain high error:
ing Rehabilitation Service ระหว่างวันที่ 20 มีนาคม ที่แรมเซ็นทราศูนย์ราชการและค รอยสนธรเพื่อการเป็นผู้สมองเป็น
What if we go back to the original SD JPEG frames and 1×80 JPEG montage? The results are little better:
cing Rehabilitation Service มาว่างวันที่ 20 มีนาคม - ที่แรมเซ็นทราศูนย์ราชการและค สวยเป็นธรเพื่อการขึ้นฟูสมองเบ
What if we make a 1×13 montage using the source SD JPEG frames and a JPEG montage? This time we are finally back to our single-frame results:
อสิรินธรเพื่อการฟื้นฟูสมรรถภ cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม ผู้แรมเซ็นทราศูนย์ราชการและค
What if we go back to our 1×80 montage, but use 120 pixels of black space between frames instead of 20?
ing Rehabilitation Service มาว่าวันที่ 2 มีนาคม กรมเซ็นทราศูนย์ราชการและๆ
The results are identical, suggesting it is not the gap between text blocks that is the problem, but rather something else.
What if we use an 80×1 grid? This time the results are really poor:
ing Rehabilitation Service ราชิ มิน
What about a 1×50 grid using frames 16-65 and back to 10 pixel borders? This is nearly identical to the single-frame OCR:
ของสิรินธรเพื่อการฟื้นฟูสมรรถภ cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม มีแรมเซ็นทราศูนย์ราชการและค
What if we keep everything the same, but move up 25 images in the sequence, so the frames in question are in the middle, giving us frames 41-90? This time we're back to our poorer results. This suggests that it is not the number of frames in the image, but rather the overall mixture of images that affects the results:
cing Rehabilitation Service ระหว่างวันที่ ๒๐ มีนาคม แรมเซ็นทราศูนย์ราชการและค
What if we use the same workflow, but use a border of 800 pixels between images? Strangely this triggers the API to output only English for the entire montage sheet, capturing just how sensitive the API is to the overall image mixture:
ing Rehabilitation Service