With the recent release into general availability of Imagen on Vertex AI, Google's new image-based generative AI foundational model for images, we've already explored its image captioning capabilities on a handful of selected images. Today we'll showcase how it can be applied at scale to generate frame-by-frame visual descriptions of two television news broadcasts: one from the United States and one from Russia.
Why generate visual descriptive captions for the news? Like closed captioning for the hearing impaired, visual captions could offer a powerful new assistive capability for the visually impaired, giving them a greater understanding of what is being depicted onscreen, given the importance of visual storytelling to television news. They could also enable new kinds of visual search and at-scale analysis of the kinds of visual metaphors used to tell the world's most pressing stories each day and especially how those metaphors vary across the world.
Google's Imagen accepts as input any image and outputs a set of three potential captions describing the image. To demonstrate image captioning at scale, we've applied the API to two broadcasts: an ABC Evening News broadcast and a Russia 1 episode of 60 minutes. We've made a special display interface that shows you the same thumbnail grid as you see in the Visual Explorer, but instead of clicking on thumbnails to play clips, you simply move your mouse over any image to see a popup that shows you the captions the API returned for that image.
- ABC Evening News Broadcast.
- Original Broadcast. (Click to play clips)
- Captioned Display. (Mouse over thumbnails to see captions – no playable clips).
- Russia 1 Episode Of 60 Minutes.
- Original Broadcast. (Click to play clips)
- Captioned Display. (Mouse over thumbnails to see captions – no playable clips).
Overall the results are quite reasonable, if not remarkable, and rely heavily on the OCR'd text contained in the image itself. Captions are extremely brief, often along the lines of "man talking on tv show" which isn't as descriptive as it could be even for studio scenes, which frequently depict imagery, camera angles juxtaposing specific iconography and facial expressions and other elements that are important to understanding the visual impact of the scene. Images with onscreen text often yield the most detailed descriptions as the model uses the text as a guide to understanding the image contents – once again demonstrating the immense importance of understanding onscreen text in television news.
For the technically-minded, let's take a look at how we produced these two displays.
Imagen's image captioning capabilities are trivial to integrate into an application workflow: simply convert any image to BASE64 encoding and submit to the API, which after a few seconds will return up to 3 potential descriptive captions for the image. There are no prompts to manage or other settings to worry about tuning. You can see the full technical readout on parameters and workflow in the documentation.
To make it easy for you, we've wrapped up the complete workflow for interacting with the captioning API into a simple Perl script that accepts an image filename and reads it from disk, BASE64 encodes it, submits it to the API and writes the results to disk:
Make sure to change the "[YOURPROJECTID]" to your GCP project ID. Note that this is a trivial demo script without any error handling or other logic.
Let's see it in action. First we download the script (making sure to change the project ID in it), then we download a sample image and run it. It will take a few seconds and then generate the output you see below.
wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/exp_captionimagevertexaiimagetext.pl chmod 755 exp_captionimagevertexaiimagetext.pl wget https://blog.gdeltproject.org/wp-content/uploads/2022-stratcom-istanbul.png time ./exp_captionimagevertexaiimagetext.pl ./2022-stratcom-istanbul.png cat 2022-stratcom-istanbul.png.json { "predictions": [ "a man stands in front of a stage that says sg strategic summit 22", "a man stands in front of a stage that says strategic summit 22", "a man stands in front of a stage that says strategic summit" ], "deployedModelId": "6747203681382301696" }
Let's use this script to annotate our two broadcasts:
#download and unpack the thumbnails... wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/KGO_20230204_013000_ABC_World_News_Tonight_With_David_Muir.zip unzip KGO_20230204_013000_ABC_World_News_Tonight_With_David_Muir.zip wget https://storage.googleapis.com/data.gdeltproject.org/gdeltv3/iatv/visualexplorer/RUSSIA1_20230215_083000_60_minut.zip unzip RUSSIA1_20230215_083000_60_minut.zip #download our API wrapper (NOTE: change the [YOURPROJECTID] in the script to your GCP Project ID): wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/exp_captionimagevertexaiimagetext.pl chmod 755 exp_captionimagevertexaiimagetext.pl #now run the API over all of the images #NOTE: do NOT increase the number of parallel jobs as the default quotas restrict you to a limited number of captioning requests per minute #NOTE: due to the lack of error handling in the demo script, you'll likely need to run these several times to caption all images (the script automatically skips images it has already processed) time find ./RUSSIA1_20230215_083000_60_minut/ -maxdepth 1 -name "*.jpg" | parallel --eta -j 1 './exp_captionimagevertexaiimagetext.pl {}' time find ./KGO_20230204_013000_ABC_World_News_Tonight_With_David_Muir/ -maxdepth 1 -name "*.jpg" | parallel --eta -j 1 './exp_captionimagevertexaiimagetext.pl {}'
That's truly all there is to captioning the images! The end result will be a directory of JSON files, one per thumbnail image, in the format seen above.
How do we convert all of those JSON files into the interactive displays we have above? We've made a second Perl script that parses through all of the JSON files and writes the JavaScript-based interactive HTML pages above:
Running it is as simple as:
wget https://storage.googleapis.com/data.gdeltproject.org/blog/2022-tv-news-visual-explorer/exp_captionimagevertexaiimagetext_compileresults.pl chmod 755 exp_captionimagevertexaiimagetext_compileresults.pl time ./exp_captionimagevertexaiimagetext_compileresults.pl ./KGO_20230204_013000_ABC_World_News_Tonight_With_David_Muir/ time ./exp_captionimagevertexaiimagetext_compileresults.pl ./RUSSIA1_20230215_083000_60_minut/
The outputs will be two HTML pages, "RUSSIA1_20230215_083000_60_minut.captioned.html" and "KGO_20230204_013000_ABC_World_News_Tonight_With_David_Muir.captioned.html".
As you can see, the pipeline here is fairly trivial to implement and offers a first glimpse at how descriptive image captioning might in the future offer additional ways of understanding the visual storytelling of television news.