Using Visual Explorer & Google's Cloud Vision API To Non-Consumptively Annotate A Russian Television News Broadcast

What would it look like to use the latest neural computer vision approaches to non-consumptively scan global television news and identify the objects and activities depicted, perform OCR in more than 300 languages, identify recognizable geographic landmarks, count the number of faces, identify major logos and, most powerfully of all, perform a reverse Google Images search across the open web to identify where visuals from that broadcast have appeared across the open web and how they were captioned? Today we explore combining the TV News Visual Explorer and Google's Cloud Vision API to do precisely that with a Russian television news broadcast.

This past June in collaboration with the Internet Archive's Television News Archive we released the Television News Visual Explorer, which makes television news "skimmable" by converting broadcasts into interactive thumbnail grids. Each news broadcast is converted into a sequence of preview images, with one frame extracted every 4 seconds through the broadcast. The result is a grid of images sampling the broadcast every 4 seconds, capturing the core visual narrative of the broadcast in a non-consumptive representation, similar to ngrams for text. While this representation is fairly coarse and a lot can happen in the gap between those 4 seconds, it represents an optimal balance that captures the core narrative of a broadcast and was determined through extensive experimentation across a range of broadcasts from around the world.

When viewing a broadcast in the Visual Explorer, you can download a ZIP file containing the full resolution versions of the sampled 4 second images seen in the thumbnail grid that makes it possible to run automated visual analysis tools over this non-consumptive representation. How can we combine this ZIP file with the Vision API?

For this example, we will use the broadcast "Факты" that aired on Russia24 this past Tuesday at 7PM Moscow time. View this broadcast in the Visual Explorer and click on the download icon at the top right of the page and copy the URL of the preview images ZIP file, then download it and unzip it to a directory of images:

wget http://data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA24_20220830_160000_Fakti.zip
unzip RUSSIA24_20220830_160000_Fakti.zip

Next, compile a list of the filenames of all of the preview frames and write to FRAMELIST.TXT and then upload the images to your GCS bucket:

rm FRAMELIST.TXT
find ./RUSSIA24_20220830_160000_Fakti/*.jpg | parallel --eta -j 1 'echo {/.} >> FRAMELIST.TXT'
time gsutil -m -q cp ./RUSSIA24_20220830_160000_Fakti/*.jpg gs://[YOURBUCKET]/RUSSIA24_20220830_160000_Fakti/

This particular demo unpacks the images to a temporary GCS bucket to make the API submission simpler, but you could also BASE64 encode and submit the image data directly to the API as well.

Next, we'll run a single test image through the Vision API to make sure everything works. We're going to request Web Entities (reverse image search across the open web), identify geographic landmarks, labels (depicted objects and activities), major logos, count the number of faces and perform 300+ language OCR.

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://vision.googleapis.com/v1/images:annotate -d '
{
  "requests": [
    {
      "image": {
        "source": {
          "gcsImageUri": "gs://[YOURBUCKET]/RUSSIA24_20220830_160000_Fakti/RUSSIA24_20220830_160000_Fakti-000088.jpg"
        }
      },
      "features": [
        {
          "maxResults": 200,
          "type": "WEB_DETECTION"
        },
        {
          "maxResults": 200,
          "type": "LANDMARK_DETECTION"
        },
        {
          "maxResults": 200,
          "type": "LABEL_DETECTION"
        },
        {
          "type": "LOGO_DETECTION"
        },
        {
          "type": "FACE_DETECTION"
        },
        {
          "type": "TEXT_DETECTION"
        }
      ]
    }
  ]
}' > ./RUSSIA24_20220830_160000_Fakti-000088.json

You can see the full output from the API and the sheer richness of the annotations it produces (jump to the bottom of the page to see the OCR and Web Entities):

Now let's collapse the JSON query above into a single line and run over the entire collection of 945 preview images:

mkdir RUSSIA24_20220830_160000_Fakti-CVAPI
cat FRAMELIST.TXT | parallel -j 20 --eta "curl -s -H \"Content-Type: application/json\" -H \"Authorization: Bearer $(gcloud auth print-access-token)\" https://vision.googleapis.com/v1/images:annotate -d '{ \"requests\": [ { \"image\": { \"source\": { \"gcsImageUri\": \"gs://[YOURBUCKET]/RUSSIA24_20220830_160000_Fakti/{}.jpg\" } }, \"features\": [ { \"maxResults\": 200, \"type\": \"WEB_DETECTION\" },{\"maxResults\":200,\"type\":\"LANDMARK_DETECTION\"},{\"maxResults\":200,\"type\":\"LABEL_DETECTION\"},{\"type\":\"LOGO_DETECTION\"},{\"type\":\"FACE_DETECTION\"},{\"type\":\"TEXT_DETECTION\"} ] } ] }' > ./RUSSIA24_20220830_160000_Fakti-CVAPI/{}.json"

This will output the API's annotation of each preview image to its own JSON file.

You can download the final ZIP file containing all of the annotations of this broadcast to see what the results look like for an entire broadcast:

Surprisingly, despite the incredible richness of the annotations being performed, the final cost to annotate this entire hour-long broadcast with all 6 requested annotation types is just $11. To perform just OCR over the broadcast costs just $1.50 or just $0.60 per hour-long broadcast at higher volumes. Adding logo detection alongside OCR costs just $3.00 per broadcast or $1.20 at higher volumes.

We hope this experiment showcases the power of coupling the Visual Explorer's downloadable preview image ZIP files with machine vision to enable rapid cataloging and assessment of the visual narratives of television news from across the world.