Using Google's Cloud Vision API To Contextualize An Entire News Broadcast Through Reverse Image Search

Google Cloud Vision API's "Web Entities" feature performs what amounts to a reverse image search across the open web, accepting as its input an image and identifying exact matches and highly similar images across the web, returning their URLs and even analyzing their captions and returning a list of the top topics and entities used to caption the image in its appearances across the web across all of the languages Google supports. This is an extraordinary capability that allows us to take a frame from a news broadcast and identify its context and provenance, including detailed descriptions in the EXIF metadata of copies found on the web, from protest imagery to a COVID-19 video clip.

What would it look like to apply this process at scale in fully automated fashion to a television news broadcast? In other words, to take a news broadcast, convert it into a series of image frames, one per second, search those using Cloud Vision's Web Entities, compile a list of all appearances of those frames across the open web and their top captioned topics and entities and fetch each of those images and scan them for EXIF metadata? The end result would be a fully contextualized chronology of a television news broadcast with the provenance of each clip and a deeper description of its contents, along with pointers to the same footage and imagery on the open web.

Fully automated without any human intervention, such a system could completely reimagine what its like to fact check television news, opening the door to automated systems that could scan for contradictions between how a clip is described in the onair textual overlays and spoken word narration and how it is described on the web and offering viewers contextual links to more detail about the clip's provenance.

COVID-19 news coverage offers a particularly rich testing ground for this approach given how much coverage in the early part of the pandemic was sourced from other countries that permitted news cameras into their hospitals (US hospitals were more reluctant to permit filming due to medical privacy) and how the mixture of US and international footage was rarely credited on air in a way that allowed viewers to know precisely what they were seeing.

For those that just want to jump right to the output, see the final results:

To explore what this automated contextualization might look like, let's use the Visual Global Entity Graph 2.0 (VGEG 2.0) to rank all of the news programming on BBC News London, CNN, MSNBC and Fox News thus far this year by the total seconds of air time devoted to medical imagery. Using the following query we will get back a list of shows this year ranked by how much of their footage depicted medical content of some kind:

SELECT iaShowID, count(1) cnt FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) >= "2020-01-01" and (entity.name='medical' OR entity.name='physician' OR entity.name='surgeon' OR entity.name='hospital' OR entity.name='medical equipment') group by iaShowID order by cnt desc

This yields:

ShowID Seconds
CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon 3532
CNNW_20200406_010000_CNN_Tonight_with_Don_Lemon 3470
BBCNEWS_20200805_023000_Stacey_Dooley_and_the_Lockdown… 2695
BBCNEWS_20200424_023000_Panorama 2429
BBCNEWS_20200531_183000_Inside_the_Red_Zone 2366
BBCNEWS_20200530_093000_Inside_the_Red_Zone 2330
BBCNEWS_20200807_023000_Stacey_Dooley_and_the_Lockdown… 2276
BBCNEWS_20200411_050000_Breakfast 2108
MSNBCW_20200406_020000_Dateline_Extra 1762
MSNBCW_20200406_040000_Dateline_Extra 1728

The broadcast with the most medical imagery overall was a CNN special edition of CNN Tonight With Don Lemon on April 5, 2020 from 10-11PM PDT in which almost the entirety of the hour-long broadcast (3,532 seconds) depicted medical scenes of some kind.

To maximize the accuracy of Cloud Vision's visual search, we generated a series of full-resolution preview images, one per second, using ffmpeg:

mkdir CACHE_THUMBS
gsutil cat [GCS_VIDEODIR/VIDEO.MP4] | ffmpeg -hide_banner -loglevel panic -f mp4 -i pipe: -vf "fps=1" -start_number 0 ./CACHE_THUMBS/CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-%6d.jpg

These were then uploaded back to a private GCS directory:

gsutil -m -q cp ./CACHE_THUMBS/*.jpg [GCS_THUMBSDIR]

We then compiled a list of all of the frames:

rm FRAMELIST.TXT
find ./CACHE_THUMBS/*.jpg | parallel --eta -j 1 'echo {/.} >> FRAMELIST.TXT'

To manually test what the Web Entities results might look like for a given second of airtime, use the following CURL request:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '
{
  "requests": [
    {
      "image": {
        "source": {
          "gcsImageUri": "[GCS_THUMBSDIR/CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-000141.jpg]"
        }
      },
      "features": [
        {
          "maxResults": 200,
          "type": "WEB_DETECTION"
        },
      ]
    }
  ]
}' > ./CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-000141.json

This can be collapsed down to a single line (note the escaped quote marks) and used with GNU Parallel to run the complete collection of 3,600 frames through Cloud Vision with 20 in flight at any moment (adjust this depending on your quotas):

mkdir CACHE_JSON
cat FRAMELIST.TXT | parallel -j 20 --eta "curl -s -H \"Content-Type: application/json\" -H \"Authorization: Bearer $(gcloud auth print-access-token)\" https://vision.googleapis.com/v1/images:annotate -d '{ \"requests\": [ { \"image\": { \"source\": { \"gcsImageUri\": \"[GCS_THUMBSDIR/]{}.jpg\" } }, \"features\": [ { \"maxResults\": 200, \"type\": \"WEB_DETECTION\" }, ] } ] }' > ./CACHE_JSON/{}.json"

To skip this step, we've made the contents of CACHE_JSON available as a gzipped tar file.

When this completes, download the "parsecvwebentities_step1_compileurls.pl" PERL script and run it. It will read through the Cloud Vision Web Entities output for each frame and compile a list of all of the "fullMatchingImages" and "partialMatchingImages" entries. In this case we want only exact matches for the image, not similar looking images, so we skip the "visuallySimilarImages" section. It will compile a unique list of all of the URLs and randomize them and then write the final list to a file called "MASTERIMAGELIST.TXT". Each row will contain a unique image URL with the first column being the actual URL and the second being its MD5 hash that we will use in a moment as an ondisk filename to store its EXIF metadata into. Even if we sanitize URLs by removing all non-letters/numbers, the resulting filenames are too long for Linux in many cases, so using the MD5 hashes as their filenames solves this issue.

./parsecvwebentities_step1_compileurls.pl "./CACHE_JSON/*.json"
wc -l MASTERIMAGELIST.TXT

Next install the open source ExifTool:

apt-get -y install exiftool

This will install a PERL script that allows you to call it as "exiftool ./image.jpg" to extract out all of its EXIF metadata.

We now use GNU Parallel again to fetch each of the images from above and pipe them to ExifTool and write its output to disk (note that you will need to adjust the parallelism depending on whether the images cluster onto a smaller set of domains rather than being spread across the web):

mkdir CACHE_EXIF
cat MASTERIMAGELIST.TXT | parallel --eta -j 200 --colsep '\t' 'curl -s {1} | exiftool - > ./CACHE_EXIF/{2}.txt'

When this completes, you will have a directory filled with the extracted and computed EXIF metadata for all of the images that were found by Cloud Vision across the open web that it determined to be exact matches of the imagery in the broadcast.

To skip this step, we've made the contents of CACHE_EXIF available as a gzipped tar file.

There are myriad relevant EXIF metadata fields, but four of the most relevant we will focus on here include:

  • Artist. This gives us the photographer/agency that captured the image, giving us provenance.
  • Copyright. This lists the name of the image provider or creative agency that owns the copyright, giving us additional detail in tracking down the full details of the image.
  • Date/Time Original. This lists the date and time that the author claims the image was taken. Note that this information can be easily overridden and so is not a guarantee of the image's creation date/time, but for professionally produced imagery may be helpful in narrowing down when it was taken.
  • Description. This field is only present on a small subset of images, but if provided typically provides a rich verbose description of what the image depicts in what amounts to an extended image caption, often with the date, time, location, and subject. Often keyword searching a search engine with this text will yield the original source of the image.
  • Image Description. This is used as an alternative to the "Description" field.

Note that the information above can also occur in several other fields, but for the purposes of this pilot we only look in the precise fields listed above. Note that for the purposes of this pilot we do not attempt to correct damaged metadata entries in which the character encoding is invalid (in the Visual Global Knowledge Graph we perform advanced decoding recovery and provide the full path to each field).

Next download our second PERL script "parsecvwebentities_step2_makefinalvideosummary.pl" that will parse through these EXIF files, combine them with the computed topical entities output by Web Entities and compile a final spreadsheet:

time ./parsecvwebentities_step2_makefinalvideosummary.pl "./CACHE_JSON/*.json" "./CACHE_EXIF/" CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-CONTEXTSHEET

This outputs two files: an HTML chronology and a final tab-delimited file named "CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-CONTEXTSHEET.TXT" with the following columns:

  • ShowID. This is the Internet Archive's unique ID for this show.
  • Offset. The offset in seconds from the start of the broadcast.
  • Topics. The list of topics assigned by Cloud Vision based on how instances of the image were captioned across the open web. Note that for images with few online matches, these topics may have more limited relevance.
  • Date/Time Original. The contents of the EXIF metadata field stating the claimed capture date of the imagery.
  • Photographer. The contents of the EXIF metadata field stating the photographer.
  • Copyright. The contents of the EXIF metadata field stating the copyright holder of the image.
  • Description. The contents of the EXIF metadata field containing the description of the image.
  • ThumbnailURL. A low resolution preview thumbnail of the frame.

Looking through this file you'll see entries like this:

CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon 576 Mask,Indonesia,2019-20 coronavirus pandemic,Coronavirus disease 2019,Coronavirus,Health,Surgical mask,Mask,Centers for Disease Control and Prevention,Photograph,Stock photography 2020:03:04 00:00:00+00:00 Ulet Ifansasti 2020 Getty Images YOGYAKARTA, INDONESIA – MARCH 04: A teacher shows students how to use a mask made from tissue at the Cinta Bangsa kindergarten as Indonesia reports two cases of COVID-19 on March 4, 2020 in Yogyakarta, Indonesia. Two women in Indonesia have tested positive for Coronavirus after having had contact with a Japanese national and are being treated at Sulianti Saroso infectious diseases hospital in Jakarta. The news comes as heath experts express concerns that cases are going unreported in Indonesia, a country that has close economic ties to China including being one of the top ten destinations for travellers from Wuhan, the epicentre of the epidemic. (Photo by Ulet Ifansasti/Getty Images) *** BESTPIX *** http://data.gdeltproject.org/televisionexplorer/thumbnails/CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon-000576.jpg

In this case, at 10:09PM PDT during the broadcast there is an image of a teacher placing a mask on a student at the front of a classroom. The data above says that the Vision API found other copies of this image across the web that were captioned with topics like "Mask,Indonesia,2019-20 coronavirus pandemic,Coronavirus disease 2019,Coronavirus,Health,Surgical mask,Mask,Centers for Disease Control and Prevention,Photograph,Stock photography" (remember these are not assigned by visually analyzing the image – these labels come from examining what topics are used to caption copies of the image on the web), that the image claims to be taken March 4, 2020, that its photographer was Ulet Ifansasti, it is owned by Getty Images and its description is "YOGYAKARTA, INDONESIA – MARCH 04: A teacher shows students how to use a mask made from tissue at the Cinta Bangsa kindergarten as Indonesia reports two cases of COVID-19 on March 4, 2020 in Yogyakarta, Indonesia. Two women in Indonesia have tested positive for Coronavirus after having had contact with a Japanese national and are being treated at Sulianti Saroso infectious diseases hospital in Jakarta. The news comes as heath experts express concerns that cases are going unreported in Indonesia, a country that has close economic ties to China including being one of the top ten destinations for travellers from Wuhan, the epicentre of the epidemic. (Photo by Ulet Ifansasti/Getty Images) *** BESTPIX ***."

Another example occurs at 10:14PM EDT in which a clip of a technician looking through a microscope is actually a stock video clip from iStock by Getty Images dated February 5, 2020 and created by JHDT Productions.

The script also outputs an HTML chronology of the broadcast, letting you scroll through it second by second to see how Cloud Vision API's Web Entities saw the broadcast:

Instead of processing every second of the broadcast could we reduce the computational load by just analyzing the first frame of each camera shot? The Cloud Video API produces a list of all of the "shot changes" in each broadcast in which the scene dramatically changes. Using the following query we can compile the list of shot changes for this broadcast (despite its name, the VGEGV2 field counts the number of "shots" not "shot changes" so we have to look for fields in which there were two shots within a second of airtime:

SELECT showOffset FROM `gdelt-bq.gdeltv2.vgegv2_iatv` WHERE DATE(date) = "2020-04-06" and iaShowId='CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon' and numShotChanges>1 order by showOffset asc

This reduces the total number of seconds of airtime we need to process down from 3,600 to 925 – a reduction to just 25% of its original frame count.

The chronology below shows the same results as above, but with shot transitions highlighted in grey. Note that since the preview image represents the first frame of the second of airtime, it may not capture the moment of transition during that second of airtime:

To create this chronology we downloaded the results of the SELECT query above and saved to a file called "OFFSETS" and used a modified PERL script "parsecvwebentities_step2_makefinalvideosummary-withshot.pl":

./parsecvwebentities_step2_makefinalvideosummary-withshot.pl "./CACHE_JSON/*.json" "./CACHE_EXIF/" CNNW_20200406_050000_CNN_Tonight_with_Don_Lemon

Looking closely at this version, it is clear that limiting the analysis to only shot changes will miss quite a bit of the available information, suggesting that the precise angle of a shot is critically important and looking just at a single frame from a shot, no matter how consistent that shot is, will result in significant amounts of missed results.

Putting this all together, we started with a television news broadcast and used Cloud Vision API's Web Entities to search the open web for matches for each second of airtime, scanning those online copies for EXIF metadata and analyzing their captions to produce a second-by-second contextualized chronology of the broadcast. In this case the older age of the broadcast means there has been time for many of the clips to circulate across the web, ensuring the availability of rich metadata compared with a live broadcast of novel footage (though the reliance on stock imagery means many of these images would have had sufficient metadata even at the time of broadcast), but nonetheless, this demonstrates a powerful approach to enriching our understanding of television news using a fully automated workflow.

Take a look through the final contextualization spreadsheet yourself and see what you find!