Automated Image Captioning: Experiments With Google's New Imagen on Vertex AI Generative AI Service

With the release into general availability of Imagen on Vertex AI, Google's new image-based generative AI service, let's explore how it captions a variety of news and other images that we've used to date in our tests of multimodal LLMs. We'll use its visual captioning capabilities in which it accepts an uploaded image and returns up to three textual captions describing the contents of the image at a high level. In this case we'll use the web interface to the API.

Overall, the results are reasonable, if not especially descriptive. The API appears to combine visual assessment with OCR text extraction to generate its final descriptions. This does appear to mean that when an image contains text, that text will dominate the description, even when a better caption would focus more on the visuals of the image. Captions are entirely lower case and do not contain punctuation, meaning they may be more difficult for downstream NLP tools to process, such as "the routes of trade between china mongolia and russia". It also prioritizes text from top to the bottom in the image, meaning text that appears towards the top of the image will be treated as a dominate description over text that appears lower. In the case of technical charts and graphs this can correctly ascribe the chart to its title, but in the case of more complex examples, can mean chart labels or random onscreen text are misinterpreted as the focus of the image. Unsurprisingly, it performs better on common consumer web imagery than novel news imagery – a consistent theme we've observed with all current multimodal LLMs.

In general, the captioning tool offers a basic visual description that can be used in a variety of downstream classification and reasoning tasks, though given its highly concise text-centric labeling, it performs better on user-submitted consumer imagery cases and well-structured consistent visuals like charts and charts than the more free-form and novel environment of news imagery.

Let's take a look at the API in action, starting with this image titled "Golden_Retriever_Dukedestiny01_drvd" from Wikipedia:

Here are the automatically generated captions:

  • a dog wearing a blue collar is standing in the grass
  • a dog with a blue collar is standing in the grass
  • a dog wearing a blue collar is standing in a field

And let's look at a selection of other images.

Here the model refuses to produce any output due to a policy rejection.

  • a boy holding a blue balloon and a sign that says ministerstvo zdrave srbije
  • a boy holding a blue balloon and a sign that says ministerstvo zdrave crne
  • a boy is holding a blue balloon and a sign that says ministerstvo zdrave srbije

Here again the model refuses to produce output, categorizing the image as a policy violation.

  • a russian television show shows a woman talking about russian federations
  • a russian tv show shows a woman talking about russian federations
  • a russian television show shows a woman talking about russian federations being carried out

  • a man sits at a desk in front of a screen that says ms-21 import substitution program
  • a man sits at a desk in front of a computer screen that says ms-21 import substitution program
  • a man sits at a desk in front of a screen that says " yes you yes but no contracts well i 'm "

  • a news report about civilians shelled in donetsk republic russia
  • a news report says civilians shelled donetsk republic russia
  • a news report says civilians shelled in donetsk republic russia

  • a news studio with a man on the screen that says " the biden admin promised to "
  • a newsroom with a man on the screen that says " the biden admin promised to "
  • a news studio with a man on the screen that says " the biden admin promised to sabotage "

  • two men on a fox news show discussing u.s. warnings of renewed russian attacks on ukraine
  • two men on a fox news show discussing u.s. warns of renewed russian attacks on ukraine
  • two men are on a fox news show discussing u.s. warnings of renewed russian attacks on ukraine

  • a large crowd of people gathered in front of the supreme court building in washington d.c.
  • a large crowd of people gathered in front of the supreme court building
  • a large crowd of people gathered in front of the supreme court in washington d.c.

  • a nuclear explosion is being shown on expresso
  • a large explosion in the sky with expresso written on the bottom
  • a nuclear explosion is shown on expresso

  • a map of ukraine is shown on a tv channel
  • a map of ukraine is shown on a 7 hd channel
  • a map of ukraine is shown in a foreign language

  • the front of the supreme court building in washington d.c.
  • the supreme court of the united states of america
  • the supreme court building in washington d.c. with a blue sky in the background

  • a map of the world showing the routes of trade between china mongolia and russia
  • a map of the world showing the routes of trade between china and other countries
  • a map of the world showing various routes between china and other countries

  • a u.s. air force plane is being loaded with people
  • a u.s. air force plane is surrounded by people
  • a u.s. air force plane is surrounded by a crowd of people

Let's test on a few of the images used to announce GPT-4's multimodal capabilities.

  • a phone is plugged into a cable that says elecom on it
  • a phone is plugged into a cable that says elecom
  • a phone is plugged into a vga cable

Credit: Barnorama

  • a man is ironing clothes on the back of a yellow taxi .
  • a man is ironing clothes on the back of a taxi .
  • a man is ironing clothes on the back of a yellow suv .

Credit: Unknown

  • a tray of fried chicken in the shape of the earth
  • a tray of fried chicken in the shape of a world map
  • a tray of fried chicken in the shape of a map of the world

Credit: Unknown

  • a drawing of a stick figure talking about statistical learning
  • a cartoon of a stick figure talking about statistical learning
  • a drawing of a stick figure talking about statistical learning and neural networks

  • a graph showing daily meat consumption per person in 1997
  • a graph showing the daily meat consumption per person in 1997
  • a graph shows the daily meat consumption per person in 1997

  • a graph showing the consumption of corn worldwide in 2021 and 2022 by country in million bushels
  • a graph showing the consumption of corn worldwide in 2021 and 2022
  • a graph showing the consumption of corn worldwide in 2021-2022 by country in million bushels