Looking back at the lessons we've learned over the past week optimizing the speed and cost of video OCR, why did we embark upon this series of experiments? The answer lies in the intractable costs of applying SOTA OCR to the Internet Archive's Television News Archive's complete quarter-century archive of global television news spanning 327 channels across 50 countries on 5 continents in over 150 languages. Applying the most advanced video OCR to the complete archive would cost more than $70M, while even downsampling the complete archive to 1fps still images and applying image OCR would cost more than $42M. Using the creative highly optimized workflow outlined over the past week, we can reduce that cost down to an estimated $363,000 and likely far less given that a large portion of the archive is in SD rather than HD resolution. Over the coming weeks we'll be putting this workflow into production and sharing our experiences here, while we will be using the eventual archive-scale OCR results to power a new generation of the TV Explorer that will allow dual keyword search of both spoken word and onscreen textual transcripts.
Unlike speech ASR, video and image OCR is vastly more expensive due to the need to process exponentially more data. Gold standard moving image OCR costs $0.15/minute, meaning OCR'ing a single hour-long video costs $9, while using ffmpeg to split it into 1fps frames and OCR'ing each frame individually through Cloud Vision would cost $6. To OCR a full 24 hour day of a single channel would cost $216 for Cloud Video or $144 for Cloud Vision. At the scale of the complete quarter-century TV News Archive, such costs simply become intractable.
In contrast, representing videos as montages opens the door to batch processing large numbers of frames at once in a single request, dramatically reducing the cost and massively increasing throughput. Using our single-column vertical layout and native resolution recommendations and Cloud Vision's 75M pixel image maximum for OCR, this means an HD broadcast (1280 x 720) sampled at 1fps can be distributed into image montage sheets containing 75000000 / 1290 / 720 = ~80 frames (we use 1290 instead of 1280 to account for the 10 pixel black border we put between adjacent images to ensure OCR robustly separates them). At this rate, to OCR the video at full native HD resolution at 1fps would take 3665 / 80 = 46 montage images for around $0.069 instead of $6 if we OCR'd each image individually.
In contrast to the $144 to $216 per day per channel OCR costs above, this works out to $1.66 a day a channel to OCR at native resolution at 1fps. At the scale of the entire Archive, this works out to $362,480 to OCR the entire quarter-century Archive at native resolution – even less considering that as much as half the Archive is in SD resolution, which dramatically increases the number of frames per montage we can fit.
Stay tuned as we ramp up our experiments with reducing the costs of applying advanced SOTA AI models like ASR and OCR to make massive video and image archives searchable for journalists and scholars trying to better understand our global world.