TV Visual Explorer: Using Still Image Video Surrogates With Video AI

Last month we explored what Google's Cloud Video API sees in an episode of Russian television's "Antifake" show. In that case, we analyzed the original MPEG4 archival video object, so the API was processing native video content. Neural video models are designed specifically for moving image content, raising the question of how a state of the art video model might fare on a sequence of still image snapshots reconstructed into a video. In essence, what might it look like to use a surrogate video object constructed from 1/4fps screen captures of the broadcast in place of the video itself and could this present a reasonable approach to allowing neural video models to be non-consumptively applied to sequences of 1/4fps still images of a broadcast?

With the launch of the TV Visual Explorer, we make available a ZIP file of the full resolution preview images used to generate the thumbnail grid for each broadcast. The ZIP file consists of essentially a full resolution screen capture of a single frame taken every 4 seconds throughout the entire course of the broadcast, capturing its core visual narrative arc. In the case of the June 2, 2022 Antifake broadcast from above, view it in the TV Visual Explorer and click on the download icon at the top-right of the Explorer page. Then click on the "Download Full Resolution Thumbnails" link to download the ZIP file of the images.

Convert the sequence of still image captures into a surrogate MP4 video object using ffmpeg:

ffmpeg -i ./VIDEO/%d.jpg -vcodec libx264 -y -r X -an video.mp4

Here is the resulting video surrogate for the Antifake show above, constructed from the preview images and assembled to display each image of the preview images for 2 seconds, to yield a roughly 5m7s movie. In short, we took the preview images for this broadcast, which sample the broadcast at one frame every 4 seconds and arranged them into an MP4 file that displays each of them in sequence for exactly 2 seconds:

Analyzing this through the Video AI API yields the following JSON annotations:

The results, while reasonable, aren't as good as those of the native video file because the API is designed for true motion video, rather than a sequence of still images concatenated together.

Similarly, here is the same process, but where the images are displayed even faster, with each of the preview images being displayed for just 2 seconds, yielding a surrogate video of 2m33s:

Analyzing this through the Video AI API yields the following JSON annotations:

These results are more sparse than those of the 4fps, reflecting the neural model's struggle to reconcile the sparse sampling rate of the still images with its dependence on motion sequences, artifacts and transitions.

As a comparison, here is a 30 second clip from the source video that contains several textual sequences:

And the resulting JSON annotations:

Comparing this sequence, which can also be found in the full-broadcast annotation file, showcases the power of video OCR, in which the neural model is able to use motion and artifacting over time to maximize its recovery of onscreen text. Download these two MP4 and JSON files and view them using the Video Intelligence API Visualizer to see just how powerful state of the art video-based OCR is.

In conclusion, reconstructing sampled still image sequences into surrogate video files unsurprisingly yields less accurate results with neural video models than applying them to the original source video content. This is exactly as expected, since video models are designed to take advantage of the motion of video to achieve greater accuracy. Instead, a better approach is to analyze the Explorer's preview image sequences through a tool designed specifically for still images: GCP's Cloud Vision API, which has the added benefit of being able to perform the equivalent of a reverse Google Images search across the open web for each frame to find connections between the online and television news worlds.