Scanning Television News At Scale To Find An Excerpted Clip

Earlier today, a clip was posted to Twitter that excerpted a Russian television news broadcast, but the tweet did not include any information about which broadcast, channel or time of day the clip was from. How might we scan Russian television news at scale to locate this clip?

In this particular case, the distinctive color scheme of the clip readily identified it as from NTV and the Visual Explorer's thumbnail grid interface meant it took less than a minute to rapidly skim the last two days to identify the source broadcast:

The clip in question is highly similar to quite a few sequences in this two hour broadcast. Rather than manually review all of those sequences to find the most similar, is there a way that we could use simple image similarity search to scan the entire broadcast automatically?

First, we'll use youtube-dl to download the video from the Twitter post:

apt-get -y install youtube-dl
pip install --upgrade youtube_dl
youtube-dl --output video.mp4

Note that after installing youtube-dl even on a brand new VM, you'll need to run "pip install –upgrade" to upgrade it to the latest version that includes the latest extraction templates for Twitter, otherwise you'll get an error about not being able to identify the video.

This will save the video to "video.mp4" in the local directory.

Now we'll install ffmpeg and ImageMagick:

apt-get -y install ffmpeg
apt-get -y install imagemagick

Now we'll download the preview image ZIP file from the Visual Explorer for that broadcast. Remember that this contains the full-resolution versions of the images that make up the thumbnail grid, which are sampled one frame every 4 seconds:


In order to optimize the speed of our similarity scanning, we'll need to resize the Twitter video to the size of the original television broadcast, so we check its size:

identify ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000001.jpg
>./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000001.jpg JPEG 1024x576 1024x576+0+0 8-bit sRGB 32786B 0.000u 0:00.000

This tells us that the broadcast was at 1024×576 resolution. We'll use this number in a moment.

Now, to scan the broadcast preview images from the ZIP file for matches with the Twitter video, we'll need to extract a single frame from the Twitter video to scan for. Let's start by extracting the first frame and resizing it down to the broadcast resolution of 1024×576:

ffmpeg -i video.mp4 -qmin 1 -qscale:v 1 -vframes 1 -f image2 findframe.jpg
convert findframe.jpg -resize 1024x576 findframe.resized.jpg
identify findframe.resized.jpg

We don't technically have to resize the Twitter frame – we could leave it in its original resolution and the comparisons will work just fine, but by resizing the image we achieve a 3x speedup.

This yields the following frame:

We can then use ImageMagick's "compare" utility to compare this frame to every frame in the Visual Explorer's ZIP file to score its similarity, then compile those together into a single list and sort them by similarity:

time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.jpg" | parallel --eta 'compare -channel all -metric MSE {} findframe.resized.jpg null 2> {}.sim'
rm FRAMESIMS; time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.sim" | parallel --eta "perl -ne 'print \"\$1\t{}\n\" if /\(([\d\.]+)\)/' {} >> FRAMESIMS"

Of course, "similarity" can be defined a number of ways and compare supports several different metrics. In this case we'll use MSE ("mean error squared, average of the channel error squared") across all three (RGB) channels.

On a 48-core N1 VM this took 33 seconds to essentially scan a 2-hour broadcast for this clip. The most similar frames are:

0.102174 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-001375.jpg.sim
0.115978 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-001163.jpg.sim
0.116128 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000481.jpg.sim
0.116368 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000705.jpg.sim
0.116521 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000527.jpg.sim
0.11655 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000271.jpg.sim
0.116632 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000405.jpg.sim
0.116799 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000988.jpg.sim
0.117049 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000179.jpg.sim
0.117067 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000643.jpg.sim

Converting from frame numbers to time offsets, we get the top three most similar clips:

Alternatively, compare also supports perceptual hashing, which might be more robust to slight image changes, such as camera movement, etc:

time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.jpg" | parallel --eta 'compare -channel all -metric phash {} findframe.resized.jpg null 2> {}.sim'
rm FRAMESIMS; time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.sim" | parallel --eta "perl -ne 'print \"\$1\t{}\n\" if /([\d\.]+)/' {} >> FRAMESIMS"

Note that the phash metric outputs just a single numeric score, rather than the two numbers of MSE, so the parsing code is slightly adjusted. Note too that phashes are more expensive to compute so this takes 1m52s to complete. It yields:

6.76445 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000179.jpg.sim
7.6352 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000260.jpg.sim
8.98304 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000419.jpg.sim
9.69809 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-001375.jpg.sim
9.74784 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-001622.jpg.sim
10.6769 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000226.jpg.sim
10.8733 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-001238.jpg.sim
10.9124 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000657.jpg.sim
11.2085 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000279.jpg.sim
12.0758 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000734.jpg.sim

The top three most similar clips are:

One challenge with the matches above is that the first frame of the Twitter video is a generic wide shot of the studio that appears throughout the broadcast. Using a more distinctive shot from the clip might help us narrow down our matches.

Scroll through the video clip on Twitter until you see a specific moment that appears somewhat distinctive and might help more uniquely identify the clip. In this case, 44 seconds into the broadcast there is a close-up of one of the presenters with a corresponding chyron. Using that timecode we can then extract that specific frame from the video:

rm findframe.jpg; ffmpeg -i video.mp4 -ss 00:00:44 -frames:v 1 findframe.jpg
convert findframe.jpg -resize 1024x576 findframe.resized.jpg

Note how we added "-ss 00:00:44 -frames:v 1" to tell ffmpeg to extract a single frame at offset 44 seconds into the video.

This yields the following frame:

Now let's repeat our fast MSE metric with this frame:

time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.jpg" | parallel --eta 'compare -channel all -metric MSE {} findframe.resized.jpg null 2> {}.sim'
rm FRAMESIMS; time find ./NTV_20221028_110000_Mesto_vstrechi/ -maxdepth 1 -name "*.sim" | parallel --eta "perl -ne 'print \"\$1\t{}\n\" if /\(([\d\.]+)\)/' {} >> FRAMESIMS"

Which yields:

0.0469903 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000742.jpg.sim
0.0641648 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000741.jpg.sim
0.0659819 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000743.jpg.sim
0.0660478 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000766.jpg.sim
0.0665276 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000740.jpg.sim
0.0665374 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000760.jpg.sim
0.0685784 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000738.jpg.sim
0.0692427 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000309.jpg.sim
0.0701141 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000739.jpg.sim
0.0701304 ./NTV_20221028_110000_Mesto_vstrechi/NTV_20221028_110000_Mesto_vstrechi-000753.jpg.sim

This time we can see that there are actually two clusters of frames here:

Putting this all together, we've shown how we can start with a video from Twitter and scan a 2-hour television broadcast in just half a minute to find appearances of it.

Making this far more powerful is that it would be trivial to scale this process up to scan an entire day or even week or month of coverage across many channels to find all appearances of a given clip, meaning this can serve as a generic template to identify where video clips have been sourced from.

Rather than performing the brute-force similar scanning we do here, a production application could compute perceptual hashes or surface embeddings for each image and load the results into an ANN database platform to permit truly realtime scanning where you could simply upload any clip and instantly receive back a list of all of the places the clip or pieces of it were found across the TV News Archive and even if the clip was actually assembled from multiple excerpts glued together.

For misinformation researchers, the template outlined here has enormous potential for allowing arbitrary video clips from television news to be authenticated and connected back to their surrounding contexts, restoring provenance and identifying any modifications that have been made.