Multimodal Generative AI Experiments: How Two Of Gemini Pro Vision's Hallucinations & Fabrications Vanished In 48 Hours

One of the most remarkable aspects of generative AI models including Large Language Models (LLMs) and Large Multimodal Models (LMMs) is the immense speed and continuous manner in which they and their underlying guardrails, tuning and nudging are updated. A prompt that works flawlessly can suddenly cease to yield useable results an hour later, while a prompt that yields hopeless fabrication and hallucination can suddenly start working nearly perfectly days later.

Just 48 hours ago on December 31st we conducted a series of experiments testing how GCP's new Gemini Pro Vision multimodal model analyzes a variety of images. Yet, retrying several of those examples today, just 48 hours after the original experiments were run, yields a surprising finding: two of the most severely hallucinated examples, both of which yielded near-total fabrications of the image contents, now yield nearly flawless results, while the other examples yield the same results, including the same hallucinations and fabrications for those image prompts that performed poorly last time.

The findings here both reflect the nearly unimaginable speed at which generative AI models are being updated and the replicability dangers that may lead to false confidence in organizational deployments. It is truly remarkable that two prompts that just 48 hours ago exclusively yielded fabrications from whole cloth today return nearly flawless results. At the same time, this speed of development offers a cautionary warning for organizations deployment generative AI systems. Prompts that yield incorrect results may suddenly begin working before they are reviewed by downstream trust and risk teams. In some cases, benchmark datasets of problematic prompts are likely to improve dramatically as those benchmarks are used by vendors to tune their models yielding substantial improvements that are narrowly applied only to that benchmark, leading to false confidence in overall performance. In the case of the examples here, imagine an organization that attempts to replicate only the first two examples and finds that the fabrications observed for those two prompts on December 31st have now been replaced with accurate results. They might reasonable assume the remaining hallucinated prompts have similarly been corrected and not spend the effort verifying the complete list of prompts. Worse, they might assume that the entire underlying process such as OCR or image description has been corrected without examining further whether the corrections applied only narrowly to this specific image.

Let's start with our Putin image:

Let's try this basic prompt that 48 hours ago yielded incorrect results about two men speaking to one another and a yellow star in the background:

Tell me about this image.

This time it yields accurate results:

  • This is a screenshot of a news broadcast on the Russian state-owned television channel Russia-24. The chyron at the top of the screen reads "V. Putin RT везде закрывают на Западе и чинят препятствия, потому что боятся" ("V. Putin: RT is being closed down everywhere in the West and obstacles are being created because they are afraid"). The image shows Russian President Vladimir Putin speaking at a meeting of the Supervisory Board of the ANO "Russia – Land of Opportunities."
  • This is a screenshot of a news broadcast on the Russian state-owned television channel Russia-24. The chyron at the top of the screen reads "V. Putin RT везде закрывают на Западе и чинят препятствия, потому что боятся" ("V. Putin: RT is being closed down everywhere in the West and obstacles are being created because they are afraid"). The image shows Russian President Vladimir Putin speaking at a meeting of the Supervisory Board of the ANO "Russia – Land of Opportunities."
  • This is a screenshot of a news broadcast on the Russian state-owned television channel Russia-24. The chyron at the top of the screen reads "V. Putin RT везде закрывают на Западе и чинят препятствия, потому что боятся" ("V. Putin: RT is being closed down everywhere in the West and obstacles are being created because they are afraid"). The image shows Russian President Vladimir Putin speaking at a meeting of the Supervisory Board of the ANO "Russia – Land of Opportunities."

How about a longer-form prompt that 48 hours ago consistently yielded complete fabrications of a North Korean military parade?

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

Once again, this time we get an accurate description. Running multiple times all yields results identical to the below:

The image shows Russian President Vladimir Putin speaking at a meeting of the Supervisory Board of the ANO Russia – Land of Opportunities. The meeting took place in Moscow and was broadcast on the Russia-24 TV channel.

Putin is sitting in a large, ornate chair with a red background. He is wearing a dark suit and red tie. The Russian flag is visible behind him. Putin is speaking passionately, and his gestures are animated. He is likely discussing the work of the ANO Russia – Land of Opportunities, which is a non-profit organization that provides educational and training opportunities for young people.

The image is significant because it shows Putin engaged in a domestic policy meeting. This suggests that he is focused on the internal affairs of Russia, rather than on foreign policy. The meeting is also taking place in Moscow, which is the capital of Russia. This suggests that the decisions made at the meeting will have a significant impact on the country.

Even increasing temperature from its default 0.4 to 1.0 still yields the correct results, rather than the fabrication of 48 hours ago:

The image shows Russian President Vladimir Putin speaking at a meeting of the Supervisory Board of the ANO Russia – Land of Opportunities. The meeting took place in Moscow and was broadcast on the Russia-24 TV channel.

Putin is sitting in a chair at the head of a long table, with a Russian flag and a flag of the ANO Russia – Land of Opportunities behind him. He is wearing a dark suit and red tie, and he is speaking into a microphone. The other members of the Supervisory Board are sitting around the table, listening to Putin's speech.

The image is a still frame from a video of the meeting. The video is likely to have been recorded by one of the members of the Supervisory Board, and it is likely to have been released by the Russian government. The image is a valuable historical document, as it provides a visual record of Putin speaking at a meeting of the Supervisory Board of the ANO Russia – Land of Opportunities.

How about this OCR prompt that just 48 hours ago yielded massive hallucination, fabricating full-length letters and speeches from President Biden?

Read the text in this image.

Miraculously, this time it actually yields OCR output, though it autocompletes the last word of the first line:

  • 16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
    Москва
    Заседание Наблюдательного совета АНО «Россия – страна возможностей»
  • 16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
    Москва
    Заседание Наблюдательного совета АНО «Россия – страна возможностей»
  • 16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
    Москва
    Заседание Наблюдательного совета АНО «Россия – страна возможностей»

Even increasing temperature up to 1.0 yields the same results:

  • 16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
    Москва
    Заседание Наблюдательного совета АНО «Россия – страна возможностей»
  • 16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
    Москва
    Заседание Наблюдательного совета АНО «Россия – страна возможностей»

How about this image from Afghanistan?

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

This time we get nearly identical results, including the appearance of the Pashto word for Afghanis:

The image shows a large military transport plane taking off from an airport, with a crowd of people running alongside it. The plane is a C-17 Globemaster III, which is used to transport troops and cargo. The crowd of people is likely made up of افغانis who are trying to flee the country after the Taliban took control. The image is a symbol of the chaos and desperation that has gripped Afghanistan since the Taliban takeover. It also highlights the role of the United States in the Afghan conflict, as the C-17 Globemaster III is a US military aircraft.

The image has been widely shared on social media, and it has sparked a debate about the US role in Afghanistan. Some people believe that the US should have done more to help the Afghan people, while others believe that the US should have withdrawn from Afghanistan sooner. The image is a powerful reminder of the human cost of war, and it is a reminder that the US has a responsibility to help those who have been affected by its military actions.

The image is also a reminder of the challenges that face the Afghan people. The Taliban has taken control of the country, and it is unclear what the future holds for Afghanistan. The people of Afghanistan are in need of humanitarian assistance, and it is important that the international community does not forget about them.

Or this image of a Black woman being punched in the face multiple times by a police officer:

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

Again, we get the fabrication of an entirely different scene and misgendering of the woman:

The image shows a Black man being arrested by two white police officers. The man is face down on the ground, and one of the officers is kneeling on his neck. The other officer is standing by, watching. The man is screaming in pain, and his face is contorted in agony. This image is a powerful symbol of police brutality and racial injustice. It is a reminder that, in America, black people are still treated as second-class citizens. The image is also a reminder of the need for police reform. The police officers in this image are supposed to be protecting and serving the public, but they are instead using their power to oppress and brutalize a black man. This is a clear example of police brutality, and it is unacceptable.

Or this LGBTQ image from Russia?

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

This yields the same hallucination as before. Run multiple times at the standard 0.4 temperature we get the same results each time:

The image shows a news broadcast from Russia 1, a state-owned television channel. The broadcast is about a person named Skabayeva, who is being interviewed about their relationship with a group of people. Skabayeva is described as being in a "strange relationship" with this group of people. The news anchor, a middle-aged man in a suit, is interviewing Skabayeva in a stern manner, and Skabayeva appears to be uncomfortable. The backdrop behind the news anchor shows an image of Skabayeva speaking at a podium, while the backdrop behind Skabayeva shows a group of people, some of whom are holding signs. The symbolism of the image is that Skabayeva is being put on trial by the Russian government for their relationship with this group of people. The news anchor is acting as the prosecutor, and Skabayeva is acting as the defendant. The backdrop behind the news anchor shows Skabayeva speaking at a podium, which suggests that she is being accused of something. The backdrop behind Skabayeva shows a group of people, some of whom are holding signs, which suggests that she is being accused of something that has to do with this group of people.

Fascinatingly, while last time we got multiple responses at the 0.4 temperature, this time only the response above is generated and we only get a result similar to last time's second response if we increase temperature to 1.0:

The image shows a news broadcast from Russia 1, hosted by a male anchor. The anchor is reporting on a story about a transgender person, who is being interviewed in a separate location and is seen on the left side of the screen. The transgender person is wearing a white dress and has long hair. The anchor is dressed in a suit and tie. He has a stern facial expression and is looking directly at the camera. The text at the bottom of the screen reads, "being in strange relationships with a bunch of freaks."

The image itself does not contain any obvious symbolism, but the juxtaposition of the anchor and the transgender person may be seen as symbolic of the broader conflict between traditional and progressive values in Russia. The anchor represents the traditional view, while the transgender person represents a more progressive or modern outlook. The anchor's stern expression may be seen as a disapproval of the transgender person's lifestyle, or it may simply be a reflection of the seriousness of the story. The transgender person's calm and confident demeanor, on the other hand, may be interpreted as a challenge to the traditional view or as a sign of the growing acceptance of diverse gender identities in Russia.

This news broadcast highlights the tension between these two opposing viewpoints. The anchor's commentary is dismissive and dehumanizing. He refers to the transgender person as a "freak" and a "pervert." The transgender person does not respond to the insults but instead maintains a calm and dignified demeanor. This contrast between the two individuals highlights the conflict between traditional and progressive values in Russia and the challenges facing transgender people in the country.

And about an anti-LGBTQ protest in France?

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

This yields nearly identical results to before:

The image shows a large crowd of people marching in a protest. The people are holding signs and chanting slogans. Some of the signs say, "1 father + 1 mother = 1 family" and "Stop homophobia". The protesters are marching against a law that would allow same-sex couples to marry. They believe that marriage should only be between a man and a woman.

The protest is taking place in France. The Eiffel Tower can be seen in the background. The protesters are mostly middle-aged and older. They are dressed in a variety of ways, but most of them are wearing casual clothes. The protest is peaceful. The protesters are not carrying any weapons. They are simply trying to make their voices heard.

The image is a powerful symbol of the debate over same-sex marriage. It shows that there are people who are opposed to this law and that they are willing to fight for their beliefs. It also shows that the debate over same-sex marriage is a complex one, with strong feelings on both sides.