Adding Confidence Scores To Tracking A Year Of Tucker Carlson On Russia 1's "60 Minutes"

Last month, in collaboration with the Internet Archive's TV News Archive, we demonstrated scanning a year of Russia1's "60 Minutes" for all appearances of Tucker Carlson. Let's repeat that analysis with a more advanced tool that also generates a distance score of the extracted face compared with the source face, allowing us to post-filter to remove false positives, identify the strongest matches, etc. Here we analyze all 342 episodes of 60 Minutes archived by the Archive from May 19, 2022 to March 17, 2023, totaling 757,559 images representing 50,504 minutes of airtime. Of those images, 237,551 (31.4%) were identified as containing no human faces, meaning almost two-thirds of 60 Minutes' airtime over the past year contained at least one human face – a reminder of the importance of faces to television news. In total, 760,887 unknown faces and 1,921 faces identified as Tucker Carlson were identified.

First, we'll install a few helper apps we'll need:

apt-get -y install parallel
apt-get -y install jq

Now we'll download the inventory file of all of the Russia 1 shows archived by the Archive and compile just the 60 Minutes episodes (NOTE that since the Archive's EPG data for Russia 1 does not begin until May 2022, we'll miss the first two months of broadcasts):

start=20220101; end=20230318; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
mkdir JSON
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1.{}.inventory.json -P ./JSON/'
rm IDS.tmp; find ./JSON/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS.tmp'
grep '60_minut' IDS.tmp | sort > IDS

Now we'll download all of the video ngram files for these broadcasts and unpack them (given the number of small files this is best done in a RAM disk on a large-memory VM):

mkdir IMAGES
time cat IDS | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/{}.zip -P ./IMAGES/'
time find ./IMAGES/ -depth -name '*.zip' | parallel --eta 'unzip -n -q -d ./IMAGES/ {} && rm {}'
time find ./IMAGES/ -depth -name "*.jpg" | wc -l

Now we'll download a package called "face_recognition":

apt-get -y install build-essential
brew install cmake
pip3 install face_recognition

Create a subdirectory called "KNOWN" that contains one image of each of the faces you wish to search for, with the filename you want to appear as the recognized face. Here we'll download our Tucker Carlson face:

mkdir KNOWN
wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/2022-tve-facedetect-scan-tuckerface.png 
mv 2022-tve-facedetect-scan-tuckerface.png KNOWN/TuckerCarlson.png

You can have multiple faces in the KNOWN directory, allowing you to search for multiple faces at once in a single pass.

The off-the-shelf CLI tool called "face_recognition" requires a single flat directory of images, rather than nested subdirectories, so rather than modify the script, we'll simply move all of the images to a single directory:

mkdir FRAMES
time find ./IMAGES/ -depth -type d -name 'RUSS*' | parallel --eta 'mv {}/*.jpg ./FRAMES'

Then, running advanced facial recognition over the entire archive requires just a single command:

time face_recognition --cpus -1 --show-distance true --tolerance 0.52 ./KNOWN/ ./FRAMES/ > RESULTS.ALL.TXT&

We used a 64-core CPU-only VM with 400GB of RAM, allowing a 200GB RAM disk to maximize IO. Here we set the "cpus" parameter to "-1" to use all 64 cores.

Based on several small-scale initial tests, we discovered that distance scores (how "dissimilar" the face is from the example face – a lower score means they are more similar) of more than 0.52 were exclusively false positives, so we set tolerance to 0.52 (only faces with distance equal to or lower than this are kept) and use "show-distance" to record all distances. In all, this takes just over 4 hours to complete.

This produces a massive CSV file of 1 million lines with 3 columns (filename, identified face, distance of the face from the example face). For each frame if no faces were found it appears on a single line with "no_persons_found".  For each identified face, the frame will be listed on a line with either "unknown_person" or "TuckerCarlson". If four faces are found in a given frame, it will appear four times in the file. Note that for unknown reasons (likely due to a multiprocess issue), a handful of lines are corrupt, with two lines concatenated together. For faces identified as matching Tucker Carlson, the last column will contain the distance score of the face from the Tucker Carlson example.

You can download the complete file:

Here we extract and sort just the Tucker Carlson matches:

grep Tucker RESULTS.ALL.TXT | sort > 60MINUTES-20220326-20230314-face_recog-TUCKERCARLSON.TXT

This yields the final list of Tucker Carlson matches:

The more sophisticated facial extraction and recognition workflow used here reduces the false positive rate of our earlier example. Most importantly, the addition of the distance scores in the third column makes it possible now to post-filter the results to remove any final false positives by simply filtering the CSV file by a lower threshold, rather than reprocessing the entire image archive.

Process the full list of frame-level matches into distinct "clips" using a simple PERL script:

#!/usr/bin/perl

use POSIX qw(strftime);
use Time::timegm qw( timegm );

open(FILE, $ARGV[0]);
while(<FILE>) {
    ($image, $name, $dist) = split/,/, $_; $dist+=0; if ($dist > $ARGV[1]) { next; }; ($ID, $FRAME) = $image=~/^(.*?)\-(\d+)\.jpg/;
    ($year, $mon, $day) = $ID=~/_(\d\d\d\d)(\d\d)(\d\d)/; $TIMELINE{"$mon/$day/$year"}++;
    if ($ID ne $LASTID || ($FRAME - $LASTFRAME) > 4) {
    	($CHAN) = $ID=~/^(.*?)_\d\d\d\d/;
	($year, $mon, $day, $hour, $min, $sec) = $ID=~/(\d\d\d\d)(\d\d)(\d\d)_(\d\d)(\d\d)(\d\d)/;
	$timestamp = timegm( $sec, $min, $hour, $day, $mon-1, $year-1900 ) + ( ($FRAME - 1) * 4);

	$date = strftime('%m/%d/%Y %H:%M:%S', gmtime($timestamp));

	$dist = sprintf("%0.2f", $dist);
	print "<li><a href=\"https://api.gdeltproject.org/api/v2/tvv/tvv?id=$ID&play=$timestamp\" TARGET=_BLANK>$CHAN: $date UTC (Frame: $FRAME) (Dist: $dist)</A></li>\n";
	$CLIPS++;
    }
    $LASTID = $ID;
    $LASTFRAME = $FRAME;
    $UNIQ_IDS{$ID}++;
    $TOTFRAMES++;
}
close(FILE);

open(OUT, ">./TIMELINE.TXT");
foreach $key (keys %TIMELINE) { print OUT "$key\t$TIMELINE{$key}\n"; };
close(OUT);

$sec = $TOTFRAMES * 4;
$cnt = scalar(keys %UNIQ_IDS); print "Found $cnt IDs & $CLIPS Clips & $TOTFRAMES Frames = $sec seconds...\n";

Which is run as (retaining our current distance threshold of 0.52):

./makeintoclips.pl ./60MINUTES-20220326-20230314-face_recog-TUCKERCARLSON.TXT 0.52

We can see that while this removes some of the false positives from our previous analysis, it still leaves ones like this one in which one of the angry Biden images in the background is conflated as a Tucker Carlson image:

Looking up that entry in the "60MINUTES-20220326-20230314-face_recog-TUCKERCARLSON.TXT" we can see that it has a score of 0.494917665824298, which is right at the extreme of our threshold. Let's reduce the cutoff a bit to 0.50:

./makeintoclips.pl ./60MINUTES-20220326-20230314-face_recog-TUCKERCARLSON.TXT 0.50

This may remove some genuine matches, but also reduces false positives. Of course, you can use grep to see what matches would be removed at each threshold to fine-tune the cutoff:

./parse_facematch.pl ./60MINUTES-20220326-20230314-face_recog-TUCKERCARLSON.TXT 0.50 | grep '0.50)'

Here are the final matches using 0.50 as a cutoff, matching 91 distinct episodes across 185 Clips totaling 1890 Frames representing 7560 seconds of airtime:

That's all there is to it!

 

This analysis is part of an ongoing collaboration between the Internet Archive and its TV News Archive, the multi-party Media-Data Research Consortium and GDELT.