A Massive New Year-Long Face Detection Dataset Over Belarusian, Russian & Ukrainian TV News: Who Is Telling The Invasion Story?

Today we are excited to announce a massive new dataset applying open source facial detection tools to a quarter-billion seconds of Belarusian, Russian & Ukrainian television news spanning more than a year of invasion coverage, tabulating more than 45 million appearances of human faces in an effort to understand who is telling the story of Russia's invasion of Ukraine. The table below shows the end result of this enormous effort, scanning 56M Visual Explorer preview images (the one-frame-every-4-seconds images that form the Visual Explorer thumbnail grid) representing 224M seconds (3.7M minutes / 62,385 hours / 2,600 days / 7.1 years) of airtime through open source face detection tools to identify every sufficiently large human face from the channel's start through the end of April 2023. For each face appearance, a 128-dimension embedding vector is generated that allows clustering and search, meaning you can take a photograph of President Biden, for example, and catalog every appearance of him across the seven channels below. The Internet Archive's TV News Archive began monitoring the channels at different times, so the table also lists the first available date of broadcasts analyzed here (Russia Today extends back more than a decade, but only 2022-present is included here). Due to technical issues, there are gaps in the monitoring of some channels, in some cases extending several days or even a week or more in a few cases, but overall this dataset represents a nearly complete picture of who told the invasion story over the past year:

Channel Start Date Total Airtime (Sec) Total VE Images Images With Face %Frames With Faces Total Faces Avg Faces Per Image
1TV 3/26/2022 31,900,644 7,975,161 4,575,136 57.37 6,588,148 1.44
BelarusTV 5/16/2022 22,925,680 5,731,420 3,133,677 54.68 4,315,129 1.38
Espreso 4/26/2022 31,612,496 7,903,124 5,833,239 73.81 9,128,455 1.56
NTV 3/26/2022 34,106,072 8,526,518 4,345,343 50.96 5,618,571 1.29
Russia 1 3/26/2022 30,929,972 7,732,493 4,549,309 58.83 6,474,474 1.42
Russia 24 4/26/2022 30,860,056 7,715,014 4,480,763 58.08 6,070,841 1.35
Russia Today 1/1/2022 42,252,444 10,563,111 6,185,045 58.55 8,421,122 1.36
GRAND TOTAL 224,587,364 56,146,841 33,102,512 58.96 46,616,740 1.41

A fascinating insight that immediately leaps from the table above is just how much Espreso differs from the others in the prevalence of human faces, with 74% of its total airtime over the past year featuring at least one human face, compared with just 51% of NTV. Russia 1, Russia Today, Russia 24 and 1TV all hover around 58%, with BelarusTV being slightly lower at 55% and NTV at 51%. This likely reflects Espreso representing wartime television footage, with a heavy emphasis on informing society, whereas Russian channels have continued to attempt to project normalcy in terms of programming distribution and presentation.

Espreso also leads in the average number of faces per frame. In other words, of the images in which at least one human face is visible, does the typical frame depict a single person speaking to the camera in anchor-style newsreadering or field reporting, or does it feature multiple visible faces? The high density of split-screen, panel-style and multi-speaker broadcasts on Espreso is reflected in its average of 1.56 faces per image, compared with 1.29 on NTV.

Searching The Dataset

Searching this massive dataset is extremely simple, using the face searching script we released last month. First, spin up a VM with sufficient disk or RAM disk to house the total size of the embedding JSON files for the channels and time period you are interested in. The total dataset is around 150GB, so requires a significant amount of local storage to use.

Now, install these two dependencies needed by the search script:

pip3 install argparse
pip3 install numpy

And download the script itself:

wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/search_faceembeddings.py
chmod 755 ./search_faceembeddings.py

Now compile the list of persons that you want to search for. Here, we'll download official portraits for Biden, Putin and Zelensky to search for the three of them:

mkdir KNOWNFACES
wget https://upload.wikimedia.org/wikipedia/commons/9/9c/Volodymyr_Zelensky_Official_portrait.jpg
mv Volodymyr_Zelensky_Official_portrait.jpg KNOWNFACES/VolodymyrZelenskyy.jpg
wget https://www.whitehouse.gov/wp-content/uploads/2021/04/P20210303AS-1901-cropped.jpg
mv P20210303AS-1901-cropped.jpg KNOWNFACES/JoeBiden.jpg
wget http://static.kremlin.ru/media/events/press-photos/orig/41d3e9385e34ebc0e3ba.jpeg
mv 41d3e9385e34ebc0e3ba.jpeg KNOWNFACES/VladimirPutin.jpg

Now we'll need to install the embedding server to create a 128-dimension facial embedding vector representation of each of these known faces. First, we'll install the necessary prerequisites:

apt-get update
apt-get -y upgrade
apt-get update
apt-get -y upgrade
apt-get -y install build-essential
apt-get -y install python3-pip
apt-get -y install cmake
apt-get -y install jq
apt-get -y install parallel
apt-get -y install unzip
pip3 install face_recognition
pip3 install requests

Then we'll install the server itself:

wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/server_faceembeddings.py
chmod 755 server_faceembeddings.py

And run a single copy of it (you can run more copies if you have a large number of known faces to scan for):

time seq 1 1 | xargs -n1 -P0 sh -c 'while true; do ./server_faceembeddings.py 8088 || [ $? -eq 129 ] && break; sleep 1; done &'

Now create the face embedding for each of these three known faces:

time find ./KNOWNFACES/ -depth -name "*.jpg" | parallel --eta "curl -s -f -X POST http://localhost:8088/faceembed -H 'Content-Type: application/json' -d '{\"id\": \"{/.}\", \"file\":\"{}\"}' >> {.}.json"

This converts each known face into the same 128-dimension embedding vectors the dataset uses so that we can search for them. Note that you only have to do this when you make changes to the known persons directory, adding or updating the sample images for each person. When you are done, you can turn off the server – it is only needed to generate these known face embeddings. To exit the server:

time seq 1 $(ps -ef | grep server_faceembeddings.py | wc -l) | parallel -j 1 --eta "curl -s -f -X POST http://localhost:8088/exitserver -H 'Content-Type: application/json' -d '{}'"

Once you have the known face embeddings, you can copy the resulting JSON files from the KNOWNFACES directory to any other VM. In this way, you can use a larger VM with sufficient memory to run the embedding server to create the initial known faces embeddings, then copy them to a cluster of small VMs to perform the actual search for scalability.

Alternatively, you can run the embeddings server on a web server and allow users to upload a sample image via a web interface, which would then internally analyze the image via the embeddings server to generate its embedding, then search for it in realtime via an ANN (Approximate Nearest Neighbor) search engine like GCP's Vertex Matching Engine or ElasticSearch ANN search. For now, we'll just use our simple command line search script.

Now, let's download the embeddings for the channel and time period we are interested in searching. In this case, we want to search for Biden, Putin and Zelensky across all of Russia 1 airtime from March 1 to 31, 2023 (this takes around 1.4GB of space):

start=20230301; end=20230331; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
mkdir JSON
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1.{}.inventory.json -P ./JSON/'
rm IDS; find ./JSON/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
mkdir EMBEDDINGS
cat IDS | parallel 'curl -s https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer_lenses/{}.faceembed.json -o ./EMBEDDINGS/{}.faceembed.json'

And finally, run the search:

time ./search_faceembeddings.py --knownfaces ./KNOWNFACES/ --searchfaces ./EMBEDDINGS/ --threshold 0.52 --outfile ./matches.json

This uses only a single core, but takes just 52 seconds to search for all three known faces across 491,955 human faces that appeared in a full month of Russia 1. In total, it yields 4,595 matches, meaning that just 0.93% of the total human face appearances of Russia 1 contained one of these three faces during the entire month of March.

What is the total breakdown of the three leaders on Russia 1 in March 2023?:

cat matches.json | jq -r .face | grep VladimirPutin | wc -l
cat matches.json | jq -r .face | grep JoeBiden | wc -l
cat matches.json | jq -r .face | grep VolodymyrZelenskyy | wc -l

Unsurprisingly, Putin leads with 2,863 matches, compared with Biden's 1,231 matches and Zelenskyy's 501 matches.

You can download the complete matches.json results to see the full list of matches the search uncovered.

The dataset currently runs only through the end of April 2023, but we are exploring expanding it.

Creating The Dataset

To create this immense dataset, we usedĀ theĀ face_recognition Python library that wraps dlib's facial analysis functions. To perform the actual detection and embedding creation, we used the server wrapper we created last month for face_recognition. First, spin up a high memory VM with as many cores as possible. In our case, we used a cluster of 64-core VMs with 400GB of RAM each. The memory is required both for the large memory-resident models and the need to use RAM disk due to the large number of small files and high IO.

Now install the following packages (we use an alternating update/upgrade sequence at start due to some upgrade cycles requiring a repeated process):

apt-get update
apt-get -y upgrade
apt-get update
apt-get -y upgrade
apt-get -y install build-essential
apt-get -y install python3-pip
apt-get -y install cmake
apt-get -y install jq
apt-get -y install parallel
pip3 install face_recognition
pip3 install requests

Then download the server:

wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/server_faceembeddings.py
chmod 755 server_faceembeddings.py

Now we'll launch 1.5x times as many servers as we have CPUs. This oversubscription helps mitigate overhead and achieve nearly 100% CPU utilization and substantially improved throughput compared with one-to-one matching:

time seq 1 $(( $(nproc) + ($(nproc) / 2) )) | xargs -n1 -P0 sh -c 'while true; do ./server_faceembeddings.py 8088 || [ $? -eq 129 ] && break; sleep 1; done &'

Note the additional "while…done" wrapper around the invocation of the server, which is new compared to our initial release of the server. Under high system load, Python scripts can encounter unrecoverable execution scenarios and exit. While the image being processed at that moment will be lost, this new wrapper will automatically restart the server so a new one takes its place. The exception is when the server exits with a SIGHUP, which is used internally by the server to allow itself to manually exit when requested.

To exit the servers, just use:

time seq 1 $(ps -ef | grep server_faceembeddings.py | wc -l) | parallel -j 1 --eta "curl -s -f -X POST http://localhost:8088/exitserver -H 'Content-Type: application/json' -d '{}'"
ps -ef| grep server_faceembeddings | wc -l

If you encounter issues, you can hard-kill them via:

ps -ef | grep server_faceembeddings | awk '{print $2}' | parallel 'kill {}'

This will launch the facial detection and embedding servers on the local VM, which will run as daemons monitoring port 8088, each ready to accept an image to process. In this case, on our 64-core VM, this yields 96 server processes running. Recall that we set SO_REUSEADDR and SO_REUSEPORT on our port bind so that all 96 servers are monitoring port 8088 and the kernel manages the queue of incoming requests and round-robins them across the servers for us. To absolutely minimize IO requirements, we'll use the server's ability to read local images directly via filesystem IO, where we simply pass the server a fully-qualified filename from the local filesystem – this minimizes network traffic and allows the most efficient possible image loading. Coupled with the use of RAM disk, this minimizes file IO latency.

Remember that these are live servers monitoring port 8088 on the VM on which they are running. They feature only minimal error handling so should only be used on a VM for which port 8088 traffic is restricted only to the localhost or a trusted VPC network.

To use these servers, we'll need a script that takes a broadcast from the Visual Explorer, downloads its preview image ZIP, unpacks that ZIP, runs all of its images through the embedding servers, packages the output together into a single JSON file and writes the results back to GCS.

The following script does all of that. Save it as "embed_show.sh":

#!/bin/sh
wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/$1.zip -P ./IMAGES/
unzip -n -q -d ./IMAGES/ ./IMAGES/$1.zip
rm -f ./IMAGES/$1.zip
rm -f ./EMBEDDINGS/$1.faceembed.json
#find ./IMAGES/$1/ -depth -name "*.jpg" > ./EMBEDDINGS/$1.lst
find ./IMAGES/$1/ -depth -name "*.jpg" | parallel -j 96 --line-buffer "curl -s -f -X POST http://localhost:8088/faceembed -H 'Content-Type: application/json' -d '{\"id\": \"{/.}\", \"file\":\"{}\"}' | flock -x -w 20 ./EMBEDDINGS/{= s/.*\/(.*?)\-\d+\.jpg\$/\$1/ =}.faceembed.json -c \"cat >> ./EMBEDDINGS/{= s/.*\/(.*?)\-\d+\.jpg\$/\$1/ =}.faceembed.json\""
rm -rf ./IMAGES/$1/
gsutil -q cp ./EMBEDDINGS/$1.faceembed.json gs://[YOURBUCKET]/
rm ./EMBEDDINGS/$1.faceembed.json

Despite all of the tasks it performs, the script is fairly simple. It accepts a single command-line argument (the Archive ID of the broadcast to process) and downloads and unpacks its ZIP file. The line that is commented out inventories all of the frames to be analyzed and saves this as a "lst" file. This can be uncommented and used to count the total number of processed frames, etc. The real workhorse of the script is the line right beneath it. This line compiles a list of all images from the Visual Explorer ZIP file and feeds them in parallel (using 96 processes) to our servers. Buried within that line is a deceptively simple flock one-liner. Since all 96 processes are writing to the same output JSON file for the broadcast, if we simply appended the server outputs in parallel to the file using a standard file append, even with the use of line-level buffering, lines will overlap with one another, resulting in corrupt JSON. This inline flocking prevents that, allowing only one server to write its output at a time to the shared output file.

Then make the script executable:

chmod 755 embed_show.sh

Now you can run this on a single show via:

mkdir IMAGES
mkdir EMBEDDINGS
./embed_show.sh RUSSIA1_20230313_083000_60_minut

In practice, IO, network assembly and management overhead means only a fraction of our servers will be in their hot compute core at any moment, so on our 64-core GCE VMs we typically run 5 embed_show.sh processes in parallel. In theory this yields 480 inflight requests at any moment, but in reality, processes are typically in different stages, from downloading and unpacking a new broadcast to uploading the completed results from a finished broadcast to waiting on a file lock.

To analyze all Russia 1 shows on a single VM using 5 concurrently running processing scripts, you can use:

start=20230301; end=20230331; while [[ ! $start > $end ]]; do echo $start; start=$(date -d "$start + 1 day" "+%Y%m%d"); done > DATES
mkdir JSON
time cat DATES | parallel --eta 'wget -q https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1.{}.inventory.json -P ./JSON/'
rm IDS; find ./JSON/ -depth -name '*.json' | parallel --eta 'cat {} | jq -r .shows[].id >> IDS'
mkdir IMAGES
mkdir EMBEDDINGS
rm PAR.LOG
cat IDS | parallel -j 5 --resume --joblog ./PAR.LOG './embed_show.sh {}'&

That's literally all there is to it to perform complete facial detection and embedding at scale over television news.