What might it look like to visually search television news using not a pre-defined list of categories, but arbitrary natural language searches? Last year OpenAI released the CLIP system, which is an open architecture and model that was trained from images and their associated captions, learning to associate words and phrases with their underlying visual representations. Applying CLIP to two sample Russian television news broadcasts, we find that we can enter textual searches in natural language and receive back a list of clips in the broadcast that match our query. While the results are not perfect, they demonstrate the incredible potential for fundamentally new kinds of search and discovery over vast television news archives that leverage the rapid innovations and advances in open visual AI systems.
To explore how systems like CLIP might enable arbitrary visual search of television news, we use a convenient tool called "rclip" that wraps CLIP in a simple command line interface, making it trivial to explore on any image collection.
First we'll install rclip following its installation instructions. Simply browse to its releases page and download the latest release's ".AppImage" file and rename to "rclip":
wget https://github.com/yurijmikhalevich/rclip/releases/download/v1.2.4/rclip-v1.2.4-x86_64.AppImage mv ./rclip-v1.2.4-x86_64.AppImage ./rclip chmod 755 rclip
We also install ImageMagick so we can create thumbnail grid summaries of the results:
apt-get -y install imagemagick
To explore this idea of visual television news search, we'll use today's episode of Russia 1's 60 Minutes 11:30AM-2PM MSK.
Remember that the Visual Explorer samples each broadcast by extracting one frame every 4 seconds to construct its thumbnail grid and makes the full-resolution versions of the images underlying the thumbnail grid available for download as a ZIP file, with more than 1.9 billion images now available.
We'll use this representation of the broadcast to analyze it. We'll start by downloading the broadcast's ZIP file from the Visual Explorer, unpacking it and making a copy of rclip in the directory:
mkdir RUSSIA1_20221024_083000_60_minut cd RUSSIA1_20221024_083000_60_minut wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1_20221024_083000_60_minut.zip unzip RUSSIA1_20221024_083000_60_minut.zip cp ../rclip .
That's it! Now its time to visually search the collection in plain English!
Given the attention on nuclear issues, let's search for "nuclear". By default, rclip returns only the first 5 results, so we'll ask for the top 100:
time ./rclip -t 100 "nuclear"
Rclip outputs the top 100 matches, including for each image its score and full path.
score filepath 0.284 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000737.jpg" 0.281 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000800.jpg" 0.279 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000798.jpg" 0.278 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000212.jpg" 0.270 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000211.jpg" 0.265 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001425.jpg" 0.263 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001260.jpg" 0.262 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001685.jpg" 0.261 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000025.jpg" 0.259 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001678.jpg" 0.258 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000635.jpg" 0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001939.jpg" 0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-000538.jpg" 0.256 "/RCLIP/RUSSIA1_20221024_083000_60_minut/RUSSIA1_20221024_083000_60_minut-001761.jpg"
Rclip can also output just the filenames without their scores using its "-f" option. We can combine that output with ImageMagick to trivially generate a collage of the results:
rm MATCHES time ./rclip -f -t 100 "nuclear" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
You can see the results below, in order of their scores, with the strongest match at top left. The majority of the matches feature either the nuclear warning symbol or footage of what appear to be nuclear explosions.
What about a "television studio"?
rm MATCHES time ./rclip -f -t 100 "television studio" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
While there are a few false positives, the imagery captures both the main circular set and at least a few more traditional news anchor desk scenes.
Or a "drone"?
rm MATCHES time ./rclip -f -t 100 "drone" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
The results are a mixture of images of drones, images of drone controllers and imagery from drones, though there are also a number of false positives.
Searching for "CNN logo" returns a number of CNN results, but also a number of false positives:
rm MATCHES time ./rclip -f -t 100 "cnn logo" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
What about a more abstract concept like "two people talking"?
rm MATCHES time ./rclip -f -t 100 "two people talking" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
Or a "crowd"?
rm MATCHES time ./rclip -f -t 100 "crowd" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
The host Olga Skabeyeva wears a distinctive blue sports jacket that sets her apart from the other presenters. Could we search on that sports jacket to catalog her appearances?
rm MATCHES time ./rclip -f -t 100 "blue suit" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
What about "China"?
rm MATCHES time ./rclip -f -t 100 "china" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
This appears to pick up onscreen textual mentions of China in addition to imagery of its leader and flag.
If we instead search for "Chinese flag" we get more finely-focused results.
rm MATCHES time ./rclip -f -t 100 "chinese flag" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
What if we take a different broadcast? In this case, a 1TV broadcast from September 28, 2022 at 12:15PM? This broadcast features a number of maps:
rm MATCHES time ./rclip -f -t 100 "map" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
What about a more complicated query like "soldier in front of Russian flag"?
rm MATCHES time ./rclip -f -t 100 "soldier in front of russian flag" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
What about "Tucker Carlson"? This appears to be largely recognizing onscreen textual mentions of his name, though it does feature images with his face as stronger matches.
rm MATCHES time ./rclip -f -t 100 "tucker carlson" > MATCHES; wc -l MATCHES time cat MATCHES | parallel --eta 'convert {} -resize 250x {}.thumb.jpg' perl -ne 'print "$1.thumb.jpg\n" if /^(.*?)$/' ./MATCHES > MATCHES.THUMBS montage @MATCHES.THUMBS -geometry +0+0 -tile 6x -background black ../results.jpg rm *.thumb.jpg
In the end, while the results are not perfect, they demonstrate the incredible potential for fundamentally new kinds of search and discovery over vast television news archives that leverage the rapid innovations and advances in open visual AI systems.