What Google's Video AI API Sees In Television News From Around The World

With the release of the new EMEA Television News Archive yesterday, we wanted to explore what television news broadcasts from around the world look like through the eyes of a state-of-the-art video annotation system like Google's Video AI API. In June we showcased this with a Russian television news broadcast, using the Video AI API to perform multilingual OCR, label detection (identifying the objects and activities depicted) and identify shot changes (camera framings). Today we've repeated this process for the two sample broadcasts for each EMEA channel from yesterday's blog post: 18 broadcasts from 9 channels.

In this case, the 18 broadcasts are in archival MPEG2 format, so we first had to convert them to MP4 for processing with the Video AI API. Unfortunately, our traditional stream conversion workflow is complicated by the fact that ffmpeg's MP4 transcoder needs random access to the input stream. Widely cited options like "-movflags faststart" and "-movflags frag_keyframe+empty_moov" avoid the need for random access and enable stream transcoding, but yield an MP4 video that has more limited compatibility. Thus, in this case we simply downloaded the videos to RAM disk on a 64-core VM and converted locally:

time ffmpeg -nostdin -hide_banner -loglevel panic -i ./[SHOW].mpg ./[SHOW].mp4&

This quickly converted all of the shows into MP4 files, which we then shipped back to GCS.

Once the broadcasts had been transcoded, we submitted them to the Video AI API for processing using the same configuration we used for the Russian broadcast in June:

curl -s -X POST \
  -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" \      --data "{
     'features': [
    'location_id': 'us-east1',
    'videoContext': {
        'labelDetectionConfig': {
           'labelDetectionMode': 'SHOT_AND_FRAME_MODE',
           'stationaryCamera': false,
           'model': 'builtin/latest'
        'shotChangeDetectionConfig': {
            'model': 'builtin/latest'
    outputUri: 'gs://[BUCKET]/[SHOW].full.json'
  }" "https://videointelligence.googleapis.com/v1/videos:annotate"

You can see the final results for each broadcast below. Note that the annotation files range from 50MB to 250MB.

You can also try viewing these annotations using the Video Intelligence API Visualizer, though some files may be too large to view. Download any of the JSON files above to your local computer, then drag-drop it onto the "Your JSON" box at the top-right of the Video Intelligence API Visualizer page. Since the video of this broadcast is not downloadable, you'll get an error that the JSON file doesn't match the stock demo video on the Visualizer page and you won't be able to click on any of the entities to see them in place in the video, but you can view the broadcast in the TV Visual Explorer and eyeball roughly where in the broadcast each corresponds to by looking at the timelines and timecode offsets reported by the Visualizer. The OCR data is particularly useful in scanning chyrons for the names and affiliations of commentators to better understand who is telling each story in a given broadcast, as well as allowing the onscreen text of textual reports that are shown onscreen to be copy-pasted into Google Translate to understand what they say.

These example annotations offer a first glimpse at how neural video analysis tools may be able to help us make better sense of global television news, including rich semantic annotation and indexing.