The GDELT Project

Automated Image Captioning: Experiments With Google's New Imagen on Vertex AI Generative AI Service

With the release into general availability of Imagen on Vertex AI, Google's new image-based generative AI service, let's explore how it captions a variety of news and other images that we've used to date in our tests of multimodal LLMs. We'll use its visual captioning capabilities in which it accepts an uploaded image and returns up to three textual captions describing the contents of the image at a high level. In this case we'll use the web interface to the API.

Overall, the results are reasonable, if not especially descriptive. The API appears to combine visual assessment with OCR text extraction to generate its final descriptions. This does appear to mean that when an image contains text, that text will dominate the description, even when a better caption would focus more on the visuals of the image. Captions are entirely lower case and do not contain punctuation, meaning they may be more difficult for downstream NLP tools to process, such as "the routes of trade between china mongolia and russia". It also prioritizes text from top to the bottom in the image, meaning text that appears towards the top of the image will be treated as a dominate description over text that appears lower. In the case of technical charts and graphs this can correctly ascribe the chart to its title, but in the case of more complex examples, can mean chart labels or random onscreen text are misinterpreted as the focus of the image. Unsurprisingly, it performs better on common consumer web imagery than novel news imagery – a consistent theme we've observed with all current multimodal LLMs.

In general, the captioning tool offers a basic visual description that can be used in a variety of downstream classification and reasoning tasks, though given its highly concise text-centric labeling, it performs better on user-submitted consumer imagery cases and well-structured consistent visuals like charts and charts than the more free-form and novel environment of news imagery.

Let's take a look at the API in action, starting with this image titled "Golden_Retriever_Dukedestiny01_drvd" from Wikipedia:

Here are the automatically generated captions:

And let's look at a selection of other images.

Here the model refuses to produce any output due to a policy rejection.

Here again the model refuses to produce output, categorizing the image as a policy violation.

Let's test on a few of the images used to announce GPT-4's multimodal capabilities.

Credit: Barnorama

Credit: Unknown

Credit: Unknown