Visual Explorer: Clustering An Entire Russian TV News Broadcast By Frame-Level Visual Similarity

Kalev Leetaru

2 years ago

In collaboration with the Internet Archive, the Visual Explorer extracts one frame every 4 seconds from each broadcast to create a "visual ngram" that non-consumptively captures the core visual narratives of the broadcast. What if we took all of those images for a given Russian TV news broadcast and pairwise compared each image to every other image in that broadcast based on pixel-level visual similarity (using a perceptual hash)? The end result would allow us to not only identify contiguous sequences (marking "shot changes"), but, most importantly, to identify repeated content that makes an appearance multiple times throughout a broadcast, ranging from a clip that is aired multiple times at different points in the broadcast to repeated advertisements.

Let's apply this analysis to an episode of Russia 1's 60 Minutes from this past Monday.

First we'll download the every-four-seconds preview images from that broadcast from the Visual Explorer:

wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1_20230313_083000_60_minut.zip
unzip
find RUSSIA1_20230313_083000_60_minut -depth -name "*.jpg" | wc -l

This broadcast contains 2,265 frames. Clustering the entire broadcast is as simple as the following two lines using an off-the-shelf tool called "findimagedupes":

apt-get -y install findimagedupes 
time find ./RUSSIA1_20230313_083000_60_minut/ -depth -name "*.jpg" -print0 | findimagedupes -f ./FINGERPRINTS-RUSSIA1_20230313_083000_60_minut.db -q -q -t 95% -0 -- - > MATCHES.RUSSIA1_20230313_083000_60_minut

Amazingly, despite computing perceptual hashes for 2,265 images and pairwise comparing every image to every other image, the entire process from start to finish takes just 21 seconds on a 64-core VM. Note that findimagedupes has the ability to save its fingerprints to an ondisk database which we enable here. This means that in future we can hand it a new image or set of images and have it compare them against this broadcast without recomputing all of the fingerprints – in practice this yields comparisons of new images in 0.1 seconds or less.

Here we look for 95% similarity, which is an extremely high threshold that captures only near-duplicates, but this could be reduced to capture more images that are "similar" but not identical.

The end result is a text file called "MATCHES.RUSSIA1_20230313_083000_60_minut" in which each line is a group of image filenames separated by spaces that were all judged to be above the similarity threshold based on the tool's perceptual hash representation of each.

We will then process the matches file using a simple Perl script to separate the groups into contiguous sequences of frames (ie, a single camera shot) versus groups where images have gaps between them, suggesting a clip, camera angle or advertisement being returned to.

For example, here is an advertisement that aired twice, at 12:46PM and 1:02PM that the workflow clustered together:

Clip 1 (12:46PM)
Clip 2 (1:02PM)

And another ad that aired three times:

At the same time, even a 95% similarity threshold can yield false positives, like these two images with surprisingly similar color schemes, saturation and layouts. Though, a closer look suggests the use of a saturation filter that may suggest there is more to this similarity than pure coincidence:

Sequences with large gaps are strongly suggestive of advertisements, making this a potential workflow for identifying key advertising trends on Russian television news and how the ad economy is adjusting in the face of global sanctions. The ability to identify strikingly similar images and palettes suggests that other forms of analysis like histogram-based palette analysis, filter detection, texture and structural analyses would all offer powerful insights into the visual narratives of Russian television news.

Perhaps most critically, the efficiency of findimagedupes and its ability to pairwise compare an entire broadcast in just 21 seconds on a 64-core CPU-only system resets expectations around the tractability of scaling such analyses upwards to larger and larger volumes of content.

For those interested in exploring further, the following Perl script processes the above MATCHES file:

#!/usr/bin/perl

open(FILE, $ARGV[0]);
open(OUTS, ">$ARGV[0].seq");
open(OUTG, ">$ARGV[0].gap");
open(OUTC, ">$ARGV[0].crossshow");
while(<FILE>) {
    my @arr = sort (split/\s+/, $_); $arrlen = scalar(@arr);

    $lastshow = ''; $lastframe = 0; $hasgap = 0; $crossshow = 0;
    for($i=0;$i<$arrlen;$i++) {
        ($show, $frame) = $arr[$i]=~/.*\/(.*?)\-(\d+)\.jpg/;
        #$arr[$i]=~s/^.*\///;
        $arr[$i] = $frame;
        if ($i > 0 && ($show ne $lastshow || $frame > ($lastframe+1))) { $hasgap = 1; };
        if ($i > 0 && $show ne $lastshow) { $crossshow = 1; };
        $lastshow = $show; $lastframe = $frame;
    }

    if ($crossshow == 1) { print OUTC "@arr\n"; next; }
    if ($hasgap == 1) { $gap = $arr[-1] - $arr[0]; print OUTG "$gap\t@arr\n"; } else { print OUTS "@arr\n"; };

}
close(FILE);
close(OUTS);
close(OUTG);
close(OUTC);

This generates three files. One containing any cross-show matches (not relevant here), one containing sequential matches (a sequence of images) and one containing matches with gaps between them.

Matches with gaps between them, with the first column indicating the number of intervening frames:

4 001654 001656 001657 001658
20 001602 001622
1049 001176 001384 001415 002009 002194 002225
818 001397 001994 002215
2 000238 000240
4 000590 000591 000593 000594
948 000637 001585
2 002071 002073
14 000522 000523 000524 000525 000532 000533 000534 000535 000536
6 000816 000819 000822
5 000057 000062
30 001204 001205 001234
4 002143 002147
6 000871 000877
814 001389 002203
10 001467 001477
7 000665 000672
5 001630 001635
3 000508 000511
2 000367 000369
12 001469 001474 001475 001481
2036 000020 002056
707 001041 001748
1043 001159 002202
33 001051 001084
818 001368 002006 002186
597 001398 001995
34 001268 001302
1043 001161 002204
18 000423 000441
234 001153 001387
5 000614 000618 000619
917 000197 001049 001052 001054 001083 001108 001109 001110 001112 001113 001114
638 001369 002007
7 001218 001220 001222 001225
29 000417 000418 000420 000424 000427 000429 000430 000432 000443 000446
223 001191 001414
7 000058 000063 000064 000065
12 002114 002121 002122 002123 002124 002125 002126
3 002117 002120
2 001088 001090
5 000277 000282
12 002096 002108
6 001470 001476
6 001242 001243 001248
179 002008 002187
26 000993 001019

Similarly, sequential matches:

002069 002070
000352 000353 000354
002088 002089 002090
000318 000319
000361 000362
000575 000576 000577 000578
001765 001766 001767 001768 001769
000339 000340 000341
000364 000365
000358 000359 000360
000156 000157 000158
000615 000616
000251 000252
001251 001252
000235 000236 000237
001853 001854 001855 001856 001857 001858 001859 001860 001861 001862 001863 001864 001865 001866 001867 001868
001439 001440
001697 001698
000596 000597
000331 000332
001972 001973
000580 000581
001487 001488
001548 001549
001649 001650
000173 000174
000435 000436 000437 000438 000439
000586 000587 000588
001551 001552 001553 001554 001555 001556
001560 001561 001562
000186 000187
000283 000284
000259 000260 000261 000262
001669 001670 001671 001672
002132 002133
001525 001526
000406 000407 000408 000409 000410
000911 000912
000570 000571 000572
000882 000883
000854 000855 000856 000857 000858 000859 000860 000861 000862 000863 000864 000865 000866 000867 000868
001181 001182
000373 000374
001212 001213
001617 001618
001826 001827 001828 001829 001830 001831 001832 001833 001834 001835 001836
001199 001200 001201
001528 001529
000474 000475
000089 000090
000083 000084
000337 000338
000350 000351
001673 001674
000219 000220 000221 000222 000223 000224 000225 000226
000888 000889

We hope this inspires you to explore new kinds of visual analyses on this collection!

This analysis is part of an ongoing collaboration between the Internet Archive and its TV News Archive, the multi-party Media-Data Research Consortium and GDELT.