Natural Language Visual Search Of Television News Using OpenAI's CLIP

What might it look like to visually search television news using not a pre-defined list of categories, but arbitrary natural language searches? Last year OpenAI released the  CLIP system, which is an open architecture and model that was trained from images and their associated captions, learning to associate words and phrases with their underlying visual representations. Applying CLIP to two sample Russian television news broadcasts, we find that we can enter textual searches in natural language and receive back a list of clips in the broadcast that match our query. While the results are not perfect, they demonstrate the incredible potential for fundamentally new kinds of search and discovery over vast television news archives that leverage the rapid innovations and advances in open visual AI systems.

To explore how systems like CLIP might enable arbitrary visual search of television news, we use a convenient tool called "rclip" that wraps CLIP in a simple command line interface, making it trivial to explore on any image collection.

First we'll install rclip following its installation instructions. Simply browse to its releases page and download the latest release's ".AppImage" file and rename to "rclip":

wget https://github.com/yurijmikhalevich/rclip/releases/download/v1.2.4/rclip-v1.2.4-x86_64.AppImage
mv ./rclip-v1.2.4-x86_64.AppImage ./rclip
chmod 755 rclip

We also install ImageMagick so we can create thumbnail grid summaries of the results:

apt-get -y install imagemagick

To explore this idea of visual television news search, we'll use today's episode of Russia 1's 60 Minutes 11:30AM-2PM MSK.

Remember that the Visual Explorer samples each broadcast by extracting one frame every 4 seconds to construct its thumbnail grid and makes the full-resolution versions of the images underlying the thumbnail grid available for download as a ZIP file, with more than 1.9 billion images now available.

We'll use this representation of the broadcast to analyze it. We'll start by downloading the broadcast's ZIP file from the Visual Explorer, unpacking it and making a copy of rclip in the directory:

mkdir RUSSIA1_20221024_083000_60_minut
cd RUSSIA1_20221024_083000_60_minut
wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1_20221024_083000_60_minut.zip
unzip RUSSIA1_20221024_083000_60_minut.zip
cp ../rclip .

That's it! Now its time to visually search the collection in plain English!

Given the attention on nuclear issues, let's search for "nuclear". By default, rclip returns only the first 5 results, so we'll ask for the top 100:

time ./rclip -t 100 "nuclear"

Rclip outputs the top 100 matches, including for each image its score and full path.

score filepath
0.284 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000737.jpg"
0.281 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000800.jpg"
0.279 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000798.jpg"
0.278 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000212.jpg"
0.270 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000211.jpg"
0.265 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001425.jpg"
0.263 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001260.jpg"
0.262 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001685.jpg"
0.261 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000025.jpg"
0.259 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001678.jpg"
0.258 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000635.jpg"
0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001939.jpg"
0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000538.jpg"
0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001761.jpg"

Rclip can also output just the filenames without their scores using its "-f" option. We can combine that output with ImageMagick to trivially generate a collage of the results:

rm MATCHES
time ./rclip -f -t 100 "nuclear" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

You can see the results below, in order of their scores, with the strongest match at top left. The majority of the matches feature either the nuclear warning symbol or footage of what appear to be nuclear explosions.

What about a "television studio"?

rm MATCHES
time ./rclip -f -t 100 "television studio" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

While there are a few false positives, the imagery captures both the main circular set and at least a few more traditional news anchor desk scenes.

Or a "drone"?

rm MATCHES
time ./rclip -f -t 100 "drone" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

The results are a mixture of images of drones, images of drone controllers and imagery from drones, though there are also a number of false positives.

Searching for "CNN logo" returns a number of CNN results, but also a number of false positives:

rm MATCHES
time ./rclip -f -t 100 "cnn logo" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

What about a more abstract concept like "two people talking"?

rm MATCHES
time ./rclip -f -t 100 "two people talking" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

Or a "crowd"?

rm MATCHES
time ./rclip -f -t 100 "crowd" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

The host Olga Skabeyeva wears a distinctive blue sports jacket that sets her apart from the other presenters. Could we search on that sports jacket to catalog her appearances?

rm MATCHES
time ./rclip -f -t 100 "blue suit" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

What about "China"?

rm MATCHES
time ./rclip -f -t 100 "china" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

This appears to pick up onscreen textual mentions of China in addition to imagery of its leader and flag.

If we instead search for "Chinese flag" we get more finely-focused results.

rm MATCHES
time ./rclip -f -t 100 "chinese flag" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

What if we take a different broadcast? In this case, a 1TV broadcast from September 28, 2022 at 12:15PM? This broadcast features a number of maps:

rm MATCHES
time ./rclip -f -t 100 "map" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

What about a more complicated query like "soldier in front of Russian flag"?

rm MATCHES
time ./rclip -f -t 100 "soldier in front of russian flag" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

What about "Tucker Carlson"? This appears to be largely recognizing onscreen textual mentions of his name, though it does feature images with his face as stronger matches.

rm MATCHES
time ./rclip -f -t 100 "tucker carlson" > MATCHES; wc -l MATCHES
time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg'
perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS
montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg
rm *.thumb.jpg

In the end, while the results are not perfect, they demonstrate the incredible potential for fundamentally new kinds of search and discovery over vast television news archives that leverage the rapid innovations and advances in open visual AI systems.