Multimodal Generative AI Experiments: GPT-4 Vs Gemini Pro Vision Describing TV News, Images & Ukraine War Maps

The era of text-only Large Language Models (LLMs) is increasingly giving way to Large Multimodal Models (LMMs): both OpenAI's GPT-4 and GCP's Gemini Pro Vision support multimodal analysis in which imagery and text can be combined into a single prompt to allow for advanced interactive rich still image analysis. Public demonstrations of both models by their respective companies exhibit nearly superhuman ability to richly understand, describe and answer questions about images from across nearly every imaginable domain and topic. How do the models perform in real life on real world content? Here we will explore how GPT-4, Gemini Pro and GCP's Visual Question Answering (VQA) system perform on a range of content, from global television news footage to online news imagery to a sequence of maps of the Ukraine war, with a range of tasks from description to OCR to geospatial assessment.

The end result is that despite both companies lauding their model's textual extraction capabilities, neither model is suitable for production OCR, exhibiting substantial hallucination and false autocompletion. At this point, workflows that require high-accuracy extraction of text from images should continue to use classical OCR systems. Overall, GPT-4 outperforms Gemini Pro Vision by several orders of magnitude in descriptive tasks. Critically, GPT-4 sticks with clinical descriptions of the images, relying solely on the information present in the image and uses its world knowledge exclusively to contextualize and interpret what it sees. In contrast, Gemini uses the images merely as "seeds" or reference points to its training data and typically devolves to merely recounting its training data, rather than describing what is actually in the image. The end result is that in a majority of the experiments below, Gemini alternates between hallucinating substantial key detail and fabricating its description out of whole cloth. Despite substantially higher performance, GPT-4 also suffers from a surprisingly high error rate given that many of the tasks below were explicitly designed to replicate the superhuman performance use cases OpenAI has touted in their demonstrations and papers. Both models struggle immensely to interpret maps, failing to perform even the most basic tasks of reading legends and associating their colors and symbolism with the corresponding label, despite GPT-4 being actively touted by the LMM community for legend-figure interpretation and spatial reasoning tasks. Even when the model is explicitly prompted with a precise template designed to give it every chance to interpret a map, both models fail to accurately capture their meaning.

Overall, the findings here suggest that for OCR tasks, LMMs are no replacement for classical OCR models, for descriptive tasks GPT-4 offers reasonable performance while Gemini largely hallucinates/fabricates detail and for advanced technical tasks like map reading, neither model achieves even rudimentary accuracy despite all of the public hype and hyperbole promoting them for precisely this use case.

Let's start with the screenshot below from July 19th on Russia 24.

Let's start by using a basic description prompt with the image:

Tell me about this image.

GPT-4 yields the following, which suggests that a guardrail is preventing it from identifying the individual as Putin or offering additional detail:

The image appears to be a screenshot from a television broadcast by the Russian state-controlled network, Russia 24. The individual in the picture is speaking into a microphone, and there's a red background behind with some text in Cyrillic, which suggests the setting is likely formal and related to a Russian context or event. The text at the top includes a time stamp and possibly a name, which is a common feature in news broadcasts to identify the speaker or the context of the speech. There is also text overlay at the bottom of the screen, which is typically used for news tickers or additional information about the ongoing segment or current events.

Due to the content policy, I cannot provide a personal identification of the individual in the image or any specific political commentary. If you have any more specific questions about the components of the image or the nature of television broadcasting, feel free to ask!

GCP's Visual Question Answering (VQA) model refuses to provide any output at all:

Some responses were filtered out because they violated Google's Responsible AI guidelines.

Gemini Pro Vision with a temperature of 0.4 (max tokens of 2,048, Top-K at the default setting of 32 and Top-P at its default of 1.0) offers the same incorrect result three times in a row:

  • The image shows two men in suits sitting at a table and looking at each other. The background is red with a yellow star in the middle.
  • The image shows two men in suits sitting at a table and looking at each other. The background is red with a yellow star in the middle.
  • The image shows two men in suits sitting at a table and looking at each other. The background is red with a yellow star in the middle.

Increasing temperature to 1.0 also yields incorrect results:

  • The image shows two men in suits sitting at a table with microphones in front of them. The background is a red curtain with a blue and gold border. The men are looking at each other.
  • The image shows two men in suits sitting at a table with microphones in front of them. The background is a red curtain with a blue and gold border. The men are looking at each other.
  • The image shows two men in suits sitting at a table with microphones in front of them. The background is a red curtain with a blue and gold border. The men are looking at each other.

How about an alternative formulation that nudges the model towards a long-form response with detail?

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

GPT-4 provides the unremarkable but relatively accurate description:

The image depicts a still frame from a television broadcast on the Russian state-controlled network, Russia 24. At the center of the image is a middle-aged man, speaking into a microphone, his expression neutral and engaged. The setting suggests a formal environment, with the man seated in front of a stark, bold backdrop divided into two vertical fields: one white and one red, mirroring the colors of the Russian flag. The man's attire is not visible, but the setting implies an official or state-related context. The timestamp in the upper right corner reads "16:16," suggesting the live nature of the broadcast at the time of the screenshot.

Overlaying the image are several lines of text in the Cyrillic script, which provide context and content to the broadcast. The text in the upper right identifies the speaker with the initials "В.ПУТИН" (V.PUTIN), a name significant in the Russian political sphere. Below this, a news ticker runs across the screen, declaring that "RT is being shut down everywhere in the West and facing obstacles, because they are afraid of competition." This statement likely refers to the perceived challenges faced by the Russian international television network, RT, in Western countries, suggesting a narrative of ideological competition or suppression.

The bottom portion of the screen bears text indicating the event being covered: "Meeting of the Supervisory Council of the ANO 'Russia – the Country of Opportunities'." This message is set against a more subdued backdrop, providing a contrast to the vivid red above. The mention of the Supervisory Council and "Russia – the Country of Opportunities" hints at a state-supported initiative or organization, suggesting themes of national development and potential. The overall composition of the image, with its interplay of text and symbolism, communicates a moment of political significance, broadcasted to a wider audience, and encapsulates a narrative of national identity and international discourse.

Running a second time yields similar results:

The image is a wide-screen capture from a television broadcast, specifically from the Russian state-controlled news channel, Russia 24. The key figure in the image is a man, presumably a high-ranking official or person of note, given the prominent placement and the naming in the broadcast. He is captured mid-speech, leaning slightly towards a microphone, indicative of an ongoing dialogue or address. The man's expression is serious and composed, which, along with the formal attire barely visible at the bottom of the frame, suggests the gravity of the occasion. Behind him, a Russian flag forms a vivid backdrop; half of the screen is covered by the flag's white band, and the other half by the red, symbolizing the nation's flag and perhaps, by extension, its national pride and sovereignty.

The overlay text, rendered in Cyrillic characters, provides additional context. In the upper right corner, the time "16:16" is displayed alongside the name "В.ПУТИН" (V.PUTIN), which, although the policy restricts from specifying identities, is typically associated with a prominent public figure in Russia. The text below this, running across the screen as part of a news ticker, reads: "RT everywhere are closed in the West and create obstacles because they are afraid of competition." This sentence likely comments on the geopolitical narrative concerning the Russian state-funded international television network RT and its reception in Western countries. The ticker implies a stance of confrontation and defense against what is suggested to be a biased treatment due to competitive pressures.

The bottom of the image contains additional text that states, "Meeting of the Supervisory Board of the ANO 'Russia – a Country of Opportunities'". This text is less prominent than the ticker above but indicates the content or context of the event being broadcast. It suggests that the address being delivered is part of a formal meeting or conference, relating to an organization with a focus on national development or opportunity. This part of the image, while less visually striking, underpins the overarching theme of the broadcast – one of national narrative and international dynamics, a visual representation of Russia's self-portrayal in the face of global challenges.

The same prompt with Gemini Pro Vision yields complete hallucination:

The image shows a North Korean military parade. There are soldiers marching in formation, tanks, and missiles on display. The North Korean flag is waving in the background. The parade is taking place in Pyongyang, the capital of North Korea. It is a show of strength by the North Korean regime and a way to intimidate its enemies.

The parade is a carefully orchestrated event. The soldiers are all dressed in the same uniforms and they march in perfect unison. The tanks and missiles are all the latest models and they are a reminder of the North Korean regime's military power. The flags are a symbol of the regime's authority and they are a way to show the world that North Korea is a proud and independent nation.

The parade is a reminder of the North Korean regime's totalitarian nature. The people of North Korea are not allowed to express their own opinions and they are forced to live in a state of constant fear. The regime uses the parade to show the world that it is strong and that it will not tolerate any dissent.

Running a second time yields a nearly identical hallucination:

The image shows a North Korean military parade. There are soldiers marching in formation, tanks, and missiles on display. The soldiers are all wearing the same uniforms and they are all carrying rifles. The tanks are painted green and they have the North Korean flag on them. The missiles are also painted green and they have the North Korean flag on them. In the background, there is a large banner that has the words "The Workers' Party of Korea" written on it. There is also a picture of Kim Il-sung on the banner. The parade is taking place in a large square in Pyongyang. There are people standing on the sidelines watching the parade. The people are all wearing the same clothes and they are all waving North Korean flags.

The image is a symbol of the strength of the North Korean military. It is also a symbol of the power of the Workers' Party of Korea. The parade is a way for the North Korean government to show its people and the world that it is a strong and powerful country.

The image is also a reminder of the threat that North Korea poses to the world. North Korea is a nuclear-armed state and it has a large and powerful military. The North Korean government is also very hostile towards the United States and its allies. The image is a warning that North Korea is a dangerous country that should not be taken lightly.

How about a far more rudimentary task: extracting the text from the image? The results above suggest that GPT-4 performs reasonably well at this task and Gemini Pro Vision explicitly touts its OCR capabilities, so let's test in more detail.

State of the art classical OCR yields the following:

16:16 В.ПУТИН RT везде закрывают на Западе и чинят препятствия, потому что боятся пр.  РОССИЯ 24
Москва                                                                               ПРЯМОЙ ЭФИР
Заседание Наблюдательного совета АНО «Россия страна возможностей»

Google Translate translates this as:

16:16 V. PUTIN RT: They close everything in the West and create obstacles because they are afraid. RUSSIA 24
Moscow                                                                                                  LIVE
Meeting of the Supervisory Board of the ANO "Russia Land of Opportunities"

Let's test the following prompt to perform this same task using LMMs:

Read the text in this image.

GPT-4 transcribes it largely correctly, but hallucinates the last word of the ticker to be "конкуренции" whereas in the ticker the first two letters of the word are "пр":

The text in the image is in Cyrillic and it reads as follows:

    • At the top: "16:16 В.ПУТИН"
    • The ticker text in the middle: "RT везде закрывают на Западе и чинят препятствия, потому что боятся конкуренции"
    • At the bottom: "Заседание Наблюдательного совета АНО «Россия – страна возможностей»"

How about a second time?

The text in the image is in Cyrillic script and can be transcribed as follows:

    • In the upper right corner: "16:16 В.ПУТИН"
    • Below that: "RT везде закрывают на Западе и чинят препятствия, потому что боятся конкуренции"
    • At the bottom: "Заседание наблюдательного совета АНО «Россия – страна возможностей»"

And a third time:

The text displayed in the image reads as follows:

    • In the upper right corner: "16:16 В.ПУТИН"
    • Below that, in the news ticker: "RT везде закрывают на Западе и чинят препятствия, потому что боятся конкуренции"
    • At the bottom: "Заседание Наблюдательного совета АНО «Россия – страна возможностей»"

These inscriptions are associated with the broadcast, indicating the time, the person being shown (V. Putin), a statement about RT facing closures and obstacles in the West due to fear of competition, and a mention of a meeting of the Supervisory Council of the organization "Russia – the Country of Opportunities."

All three runs yield the same hallucinated last word. Even alternative prompts like "Transcribe the text in the image as-is." yield the same results.

How about Gemini Pro Vision? GCP's recommended OCR prompt is "Read the text in this image." This yields the following text hallucinated from whole cloth:

THE WHITE HOUSE
WASHINGTON

July 11, 2022

Dear [Name]:

I am writing to inform you that I have nominated you to be the United States Ambassador to [Country]. I am confident that you will serve our country with distinction in this important role.

As you know, the United States and [Country] share a strong and enduring partnership. Our two countries have worked together to promote peace and security in the region, and to advance our shared values of democracy, human rights, and economic opportunity. I am confident that you will build on this strong foundation and further strengthen our relationship with [Country].

You have a wealth of experience and expertise that will serve you well in this role. You have a deep understanding of [Country]'s culture and politics, and you have a proven track record of building relationships with foreign leaders and officials. I am confident that you will be able to effectively represent the United States and advance our interests in [Country].

I know that you will also be a strong advocate for the American people. You will work tirelessly to protect their interests and to ensure that they have the opportunity to succeed. I am confident that you will be a valuable asset to my Administration and to our country.

I look forward to working with you in the months and years ahead. Please accept my sincere congratulations on your nomination.

Sincerely,

JOSEPH R. BIDEN JR.

How about a second time:

THE WHITE HOUSE
WASHINGTON

July 11, 2022

Dear [Name]:

I am writing to inform you that I have nominated you to be the United States Ambassador to [Country]. I am confident that you will serve our country with distinction in this important role.

As you know, the United States and [Country] share a strong and enduring partnership. Our two countries have worked together to promote peace and security in the region, and to advance our shared values of democracy, human rights, and economic opportunity. I am confident that you will build on this strong foundation and further strengthen our relationship with [Country].

You have a wealth of experience and expertise that will serve you well in this role. You have a deep understanding of [Country]'s culture and politics, and you have a proven track record of building relationships with foreign leaders and officials. I am confident that you will be able to effectively represent the United States and advance our interests in [Country].

I know that you will also be a strong advocate for the American people. You will work tirelessly to protect their interests and to ensure that they have the opportunity to succeed. I am confident that you will be a valuable asset to my Administration and to our country.

I look forward to working with you in the months and years ahead. Please accept my sincere congratulations on your nomination.

Sincerely,

JOSEPH R. BIDEN JR.

And a third time:

THE WHITE HOUSE
WASHINGTON

July 11, 2022

Dear [Name]:

I am writing to inform you that I have nominated you to be the United States Ambassador to [Country]. I am confident that you will serve our country with distinction in this important role.

As you know, the United States and [Country] share a strong and enduring partnership. Our two countries have worked together to promote peace and security in the region, and to advance our shared values of democracy, human rights, and economic opportunity. I am confident that you will build on this strong foundation and further strengthen our relationship with [Country].

You have a wealth of experience and expertise that will serve you well in this role. You have a deep understanding of [Country]'s culture and politics, and you have a proven track record of building relationships with foreign leaders and officials. I am confident that you will be able to effectively represent the United States and advance our interests in [Country].

I know that you will also be a strong advocate for the American people. You will work tirelessly to protect their interests and to ensure that they have the opportunity to succeed. I am confident that you will be a valuable asset to my Administration and to our country.

I look forward to working with you in the months and years ahead. Please accept my sincere congratulations on your nomination.

Sincerely,

JOSEPH R. BIDEN JR.

The fourth time we finally get our results. However, it drops the "РОССИЯ 24" ("Russia 24"). Like GPT-4, it attempts to autocomplete the final word in the first sentence. Unlike GPT-4, it gets it correct, accurately completing it to "правды", but this is not desirable for an OCR system, since in this case it was only able to complete the word correctly because its training data includes the original broadcast:

16:16 В.Путин: RT везде закрывают на Западе и чинят препятствия, потому что боятся правды.
Москва
Заседание Наблюдательного совета АНО «Россия – страна возможностей»

Yet the fifth time we get a completely new hallucination utterly fabricated from whole cloth:

THE WHITE HOUSE
WASHINGTON

July 11, 2022

REMARKS BY PRESIDENT BIDEN, PRESIDENT XI, AND SECRETARY BLINKEN BEFORE THEIR BILATERAL MEETING

South Court Auditorium

10:08 A.M. EDT

PRESIDENT BIDEN: (As interpreted.) Good morning, everyone. (In English.) Good morning. (As interpreted.) It’s good to see you, Mr. President. Welcome. (In English.) Please. Have a seat. (As interpreted.) Thank you. (In English.) Thank you. (As interpreted.) Thank you very much. (In English.) Thank you. (As interpreted.) I think it’s fair to say that, over the last few years, U.S.-China relations have been put to the test. We’ve had some tough times. But I believe that our two countries share important responsibilities to manage our differences responsibly, avoid conflict, and work together where our interests intersect.

I want to be direct and straightforward. There’s no substitute for face-to-face dialogue between leaders. That’s why I wanted to have this meeting with you today.

I believe we have an opportunity to build a more stable and constructive relationship between the United States and the PRC. But it will take hard work from both sides.

As I’ve said before, I welcome the stiff competition between our two countries. I think it’s healthy. But I also believe we need to establish some common-sense guardrails to ensure that our competition doesn’t veer into conflict and that we’re able to work together on issues where we share common interests, like climate change.

The United States and China have many areas where we can work together. For example, we have a shared interest in preventing the spread of nuclear weapons. We have a shared interest in addressing the climate crisis. And we have a shared interest in promoting global health security.

I know our system of government is different from yours. But I believe we can find common ground and work together to address these common challenges.

I’m also committed to managing our competition responsibly. I don’t want it to become a conflict. I’m sure you don’t either. And I believe we can find ways to do that.

Of course, we’ll continue to compete vigorously. But I believe we can do so in ways that don’t threaten each other or the rest of the world.

I hope today’s meeting will be the beginning of a more candid and constructive dialogue between our two countries. Thank you.

PRESIDENT XI: (As interpreted.) Mr. President, thank you. (In English.) Thank you. (As interpreted.) Thank you very much. (In English.) Thank you. (As interpreted.) I’m very pleased to meet you again, Mr. President. (In English.) It’s nice to see you. (As interpreted.) And I’m very happy to have this opportunity to have an in-depth exchange of views with you on issues of strategic importance in China-U.S. relations, as well as major regional and international issues.

Over the past 50 years since the Shanghai Communiqué was issued, China and the United States have made important progress in developing relations. This has not only delivered enormous benefits to our two peoples, but also contributed to peace, stability, and development in the Asia-Pacific region and beyond.

However, for some time now, the United States has regarded China as its primary competitor and has taken a series of actions that have seriously undermined China’s interests. This has led to serious difficulties in our relations.

I believe that China and the United States should respect each other, coexist in peace, and pursue win-win cooperation. Cooperation is the only correct choice for China and the United States. As the world’s two largest economies, we have a particular responsibility to work together to address various global challenges and promote world peace and development.

I hope that today’s meeting will be candid, in-depth, and fruitful. Thank you.

SECRETARY BLINKEN: (As interpreted.) Thank you, Mr. President. (In English.) Thank you. (As interpreted.) And thank you, President Xi. (In English.) It’s good to see you. (As interpreted.) I’m very pleased to be here with you today. (In English.) Thank you. (As interpreted.) I think it’s clear that the world expects the United States and China to manage our relationship responsibly. That’s what we’re committed to doing. And I believe today’s conversation can put us on a more constructive path. Thank you.

PRESIDENT BIDEN: Thank you. (As interpreted.) I think we should get started. Thank you very much. (In English.) Thank you. (As interpreted.) Thank you.

END 10:15 A.M. EDT

How about an iconic image that is almost certainly in the training data of both models with vast reams of associated descriptive text?

Here we'll use the same prompt again:

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

GPT-4 offers an excellent general description of the image based only on what is seen in the image itself:

The image is a still from a video, as indicated by the play bar at the bottom, showing a time stamp of 3 seconds out of 1 minute. The content displayed appears to be a chaotic scene at an airport with a large crowd of people on the tarmac, surrounding a massive military aircraft labeled "AIR FORCE" with the serial number "1109." The aircraft is a grey, large-bodied transport plane, likely a C-17 Globemaster, which is commonly used by the United States Air Force for strategic airlift of troops and cargo. The scene is set against a backdrop of a clear sky and barren, mountainous terrain, suggesting a remote location, possibly in a desert-like environment.

The people in the image are in various states of movement, with some running towards the aircraft, while others seem to be walking away or standing still. The attire and appearance of the individuals suggest a diverse mix, with some in Western-style clothing and others in traditional garments, such as robes or headscarves, which may imply the location is in a region with a significant Muslim population. The sense of urgency and desperation is palpable, as the crowd appears to be in a state of distress, possibly seeking evacuation or assistance from the military aircraft.

Overlaying the image are two banners, one at the top and one at the bottom, bearing the logo "CCTV 13," indicating that the footage is from a Chinese television channel. The text at the bottom of the image provides a critical commentary, stating, "Instead of extending a helping hand, the United States seized overseas assets." This text suggests a narrative of criticism towards the United States, implying that rather than providing support or relief, the country is instead taking control of assets. It is a strong political statement that carries implications of international relations and foreign policy criticism. The symbolism here can be tied to geopolitical narratives and may reflect the broadcasting channel's stance on the actions of the United States in international affairs.

Running a second time yields similarly strong results:

The image presents a snapshot of a tense and chaotic situation at an airfield. In the center of the frame is a large, gray military cargo airplane identified by the inscription "AIR FORCE" along with a visible serial number "1109" near its nose. This aircraft appears to be a model consistent with the United States Air Force's fleet, possibly a C-17 Globemaster III, characterized by its high-wing, four-engine design, and T-tail. It is situated on a flat expanse that looks like an airstrip, under a clear sky. The environment is arid and barren, suggestive of a desert, with mountains faintly visible in the background, hinting at a remote and possibly austere location.

Surrounding the airplane is a large group of people. They are dispersed across the tarmac, with some individuals visibly in motion, as if running towards or alongside the airplane, while others are standing or walking away, adding to the scene's disorder. The clothing of the crowd varies, with some wearing Western-style attire and others in traditional garb, including robes and headscarves, indicating the cultural diversity of the group. The presence of the crowd around a military transport aircraft in such a location suggests a scenario of evacuation or crisis, where the aircraft is possibly a means of escape or relief from a dire situation. The expressions and body language of the people suggest urgency and desperation.

Overlaying the image are blue banners with white text at the top and bottom, indicating that the image is taken from a broadcast by CCTV 13, a state television channel in China. The text at the bottom of the image provides a pointed commentary: "Instead of extending a helping hand, the United States seized overseas assets." This statement, along with the visual elements of the scene, implies a critical narrative about the United States' actions in an international context, insinuating a failure to provide aid or support. The underlying symbolism may be tied to global perceptions of U.S. foreign policy, especially in crisis zones, and reflects the broadcaster's perspective on geopolitical events. The image, therefore, is not just a capture of a moment in time but also a conveyance of a particular viewpoint on international relations and humanitarian issues.

In contrast, Gemini Pro Vision relies entirely on its training data to describe the image. Strangely, the Pashto word "افغان" for "Afghanis" appears in the midst of the text:

The image shows a large military transport plane taking off from an airport, with a crowd of people running alongside it. The plane is a C-17 Globemaster III, which is used to transport troops and cargo. The crowd of people is likely made up of افغانis who are trying to flee the country after the Taliban took control. The image is a symbol of the chaos and desperation that has gripped Afghanistan since the Taliban takeover. It also highlights the role of the United States in the Afghan conflict, as the C-17 Globemaster III is a US military aircraft.

The image has been widely shared on social media, and it has sparked a debate about the US role in Afghanistan. Some people believe that the US should have done more to help the Afghan people, while others believe that the US should have withdrawn from Afghanistan sooner. The image is a powerful reminder of the human cost of war, and it is a reminder that the US has a responsibility to help those who have been affected by its military actions.

The image is also a reminder of the challenges that face the Afghan people. The Taliban has taken control of the country, and it is unclear what the future holds for Afghanistan. The people of Afghanistan are in need of humanitarian assistance, and it is important that the international community does not forget about them.

Running a second time shows the persistence of the Pashto word for Afghanis:

The image shows a large military transport plane taking off from an airport, with a crowd of people running alongside it. The plane is a C-17 Globemaster III, which is used to transport troops and cargo. The crowd of people is likely made up of افغانis who are trying to flee the country after the Taliban took control. The image is a symbol of the chaos and desperation that has gripped Afghanistan since the Taliban takeover. It also highlights the role of the United States in the Afghan conflict, as the C-17 Globemaster III is a US military aircraft.

The image has been widely shared on social media, and it has sparked a debate about the US role in Afghanistan. Some people believe that the US should have done more to help the Afghan people, while others believe that the US should have withdrawn from Afghanistan sooner. The image is a powerful reminder of the human cost of war, and it is a reminder that the US has a responsibility to help those who have been affected by its military actions.

The image is also a reminder of the challenges that face the Afghan people. The Taliban has taken control of the country, and it is unclear what the future holds for Afghanistan. The people of Afghanistan are in need of humanitarian assistance, and it is important that the international community does not forget about them.

How about an image that presses up against model guardrails, showing a Black woman being punched in the face multiple times by a police officer:

We will again use our standard prompt:

Write 3 paragraphs of text describing this image in detail, specifying precisely what it depicts and any underlying symbolism.

GPT-4 offers a strong summary of the image and correctly does not attempt to read past what it can directly observe in the image:

The image presented is a still from a video, as indicated by the play bar at the bottom, which shows the video's length and the current timestamp. It depicts a scene with three individuals, two of whom appear to be police officers, as indicated by the word "POLICE" visible on the back of their uniforms. The uniforms are dark, possibly black, with reflective or bright markings, which are standard for visibility and identification. The two officers are facing a third person, whose expression suggests distress or confrontation.

The environment appears to be outdoors, possibly in a public space with leafless trees in the background, suggesting it could be in a region experiencing winter or a less verdant season. The quality of the image is blurry, suggesting it might have been taken hastily, in motion, or from a low-resolution video feed.

Symbolically, this image may represent a moment of conflict or intervention, a snapshot of law enforcement in action. The presence of police and an apparently distressed individual could imply a narrative of authority and civilian interaction, which is a subject of significant social and political discourse. The scene could be representative of broader themes such as social justice, law enforcement practices, or public safety.

And yields s similar result run a second time:

The image is a low-resolution still from a video, capturing an intense moment involving three individuals, two of whom are dressed in police uniforms. The uniforms are dark with possible reflective adornments that suggest they are designed for official duty. The two police officers stand with their backs partly to the camera, facing the third individual, whose facial expression and open mouth suggest they are either shouting or in distress. The setting appears to be an outdoor area during daytime, with barren trees indicating it could be during a colder season or in a region without much foliage.

There is a sense of immediacy and confrontation in the scene. The positioning of the officers suggests they are engaging with the individual in a manner that could be interpreted as either aggressive or protective. The person appears to be raising their hands either in a defensive posture or in an attempt to express themselves. The scene is framed by a blurred background and a car on the right, indicating that this event is taking place in a public or semi-public space such as a street or parking area, which adds to the tension as these confrontations are often unexpected and public.

Symbolically, the image may evoke themes of authority, conflict, and control, common motifs in discussions about police interactions with the public. The blurred setting and the snapshot nature of the image might suggest the complexity and chaos that often accompany such interactions, with the lack of clear details representing the uncertainty that witnesses or participants may feel. The confrontation captured here could symbolize a moment of crisis or misunderstanding, reflecting societal issues related to law enforcement, public safety, and community relations.

In contrast, Gemini Pro Vision hallucinates the scene entirely and even worse misgenders the woman to a man:

The image shows a Black man being arrested by two white police officers. The man is face down on the ground, and one of the officers is kneeling on his neck. The other officer is standing by, watching. The man is screaming in pain, and his face is contorted in agony. This image is a powerful symbol of police brutality and racial injustice. It is a reminder that, in America, black people are still treated as second-class citizens. The image is also a reminder of the need for police reform. The police officers in this image are supposed to be protecting and serving the public, but they are instead using their power to oppress and brutalize a black man. This is a clear example of police brutality, and it is unacceptable.

Run a second time we get nearly identical results, but with the additional hallucination that an officer is smiling:

The image shows a Black man being arrested by two white police officers. The man is face down on the ground, and one of the officers is kneeling on his neck. The other officer is standing by, watching. The man is screaming in pain, and his face is contorted in agony. This image is a powerful symbol of police brutality and racial injustice. It is a reminder that, in America, black people are still treated as second-class citizens. The image is also a reminder that police officers are not held accountable for their actions. The fact that the officer who is kneeling on the man's neck is smiling shows that he is enjoying the man's pain. This is a sickening display of racism and brutality.

And a third time:

The image shows a Black man being arrested by two white police officers. The man is face down on the ground, and one of the officers is kneeling on his neck. The other officer is standing by, watching. The man is screaming in pain, and his face is contorted in agony. This image is a powerful symbol of police brutality and racial injustice. It is a reminder that, in America, black people are still treated as second-class citizens. The image is also a reminder that police officers are not held accountable for their actions. The fact that the officer who is kneeling on the man's neck is smiling shows that he is enjoying the man's pain. This is a sickening display of racism and brutality.

How about another image that presses even harder against guardrails?

GPT-4 provides a more clinical look at the image and misses its underlying symbolism:

The image is a split-screen capture from what appears to be two separate television broadcasts. On the left, we have a scene from a televised event or a talk show. A person is standing at a podium, dressed in a white garment with what looks like a feathery collar, gesturing towards their chest with both hands in an expressive manner. Their facial expression conveys enthusiasm or emphasis as they speak. The audience behind them appears to be casually dressed, with one person in a purple T-shirt clearly visible. The setting looks informal, suggesting a public discussion or forum, and the large screen in the background displays the word "СПЕЦИАЛЬНЫЙ ВЫПУСК" which translates to "SPECIAL ISSUE" in Russian.

On the right, there is a more formal setting with a man standing in front of a yellow background. He is wearing a smart brown double-breasted jacket with a shirt and no tie. His posture is upright, his hands are clasped in front of him, and he has a serious expression. The lower third graphics suggest it is a live broadcast with the text "РОССИЯ 1" and "ПРЯМОЙ ЭФИР," indicating the channel name "Russia 1" and "Live Broadcast" in Russian. The overall tone on this side of the image is more solemn and official compared to the left side.

The contrast between the two halves of the image could symbolize the diverse nature of television content, ranging from energetic public discussions to more formal news presentations. The left side seems to represent open, perhaps lively debate or expression, while the right side signifies structured, official broadcasting. The juxtaposition may also imply a commentary on the variety of programming available on Russian television, or the range of styles in which content is delivered to the audience.

Running a second time yields a similar result:

The image is a composite of two separate television broadcast scenes, indicating a contrast in content and presentation style.

On the left, a person is captured mid-speech at a public event, exuberantly gesticulating towards themselves, which could be indicative of a personal anecdote or an emotional appeal. They are dressed in a white outfit with a distinctive feathery collar that adds a touch of theatricality to their appearance. The audience in the background appears engaged yet relaxed, with one individual prominently wearing a purple t-shirt and glasses, leaning forward attentively. The ambiance suggests an interactive setting where the speaker is directly addressing the public, highlighted by the big bold Cyrillic text on the screen behind them announcing a "Special Issue."

In stark contrast, the right side of the image shows a man in a formal brown jacket standing in a news studio. His hands are neatly folded in front of him, projecting an air of composure and seriousness. The graphic overlays identify the channel as "Russia 1" with the live broadcast indication, "ПРЯМОЙ ЭФИР" (Direct Live), underscoring the immediacy and gravity typically associated with news reporting. His attire and demeanor align with conventional expectations of a news presenter, conveying reliability and formality.

The juxtaposition within the image could be symbolic of the range of media communication styles—from the expressive and possibly opinionated to the reserved and factual. This contrast might reflect the spectrum of television's role in society: on one side as a platform for individual expression and public dialogue, and on the other as a source of formal news dissemination. The split-screen format emphasizes the dichotomy and could suggest a commentary on the media's ability to shape narratives and influence public perception through varied approaches to broadcasting.

In contrast Gemini Pro Vision hallucinates a number of elements of the broadcast and incorrectly identifies the individual as Olga Skabeyeva:

The image shows a news broadcast from Russia 1, a state-owned television channel. The broadcast is about a person named Skabayeva, who is being interviewed about their relationship with a group of people. Skabayeva is described as being in a "strange relationship" with this group of people. The news anchor, a middle-aged man in a suit, is interviewing Skabayeva in a stern manner, and Skabayeva appears to be uncomfortable. The backdrop behind the news anchor shows an image of Skabayeva speaking at a podium, while the backdrop behind Skabayeva shows a group of people, some of whom are holding signs. The symbolism of the image is that Skabayeva is being put on trial by the Russian government for their relationship with this group of people. The news anchor is acting as the prosecutor, and Skabayeva is acting as the defendant. The backdrop behind the news anchor shows Skabayeva speaking at a podium, which suggests that she is being accused of something. The backdrop behind Skabayeva shows a group of people, some of whom are holding signs, which suggests that she is being accused of something that has to do with this group of people.

Run a second time, this time Gemini Pro Vision captures the centric focus of the image around transgender issues. Like with the previous image, Gemini forcibly genders the person, in this case labeling them as transgender, which may or may not be how the individual self-identifies. It also hallucinates that the anchor calls them a "freak" and "pervert" and autocompletes the chyron to say "freaks", none of which is present anywhere in the image.

The image shows a news broadcast from Russia 1, hosted by a male anchor. The anchor is reporting on a story about a transgender person, who is being interviewed in a separate location and is seen on the left side of the screen. The transgender person is wearing a white dress and has long hair. The anchor is dressed in a suit and tie. He has a stern facial expression and is looking directly at the camera. The text at the bottom of the screen reads, "being in strange relationships with a bunch of freaks."

The image itself does not contain any obvious symbolism, but the juxtaposition of the anchor and the transgender person may be seen as symbolic of the broader conflict between traditional and progressive values in Russia. The anchor represents the traditional view, while the transgender person represents a more progressive or modern outlook. The anchor's stern expression may be seen as a disapproval of the transgender person's lifestyle, or it may simply be a reflection of the seriousness of the story. The transgender person's calm and confident demeanor, on the other hand, may be interpreted as a challenge to the traditional view or as a sign of the growing acceptance of diverse gender identities in Russia.

This news broadcast highlights the tension between these two opposing viewpoints. The anchor's commentary is dismissive and dehumanizing. He refers to the transgender person as a "freak" and a "pervert." The transgender person does not respond to the insults but instead maintains a calm and dignified demeanor. This contrast between the two individuals highlights the conflict between traditional and progressive values in Russia and the challenges facing transgender people in the country.

Run a third time it reverts to its original description:

The image shows a news broadcast from Russia 1, a state-owned television channel. The broadcast is about a person named Skabayeva, who is being interviewed about their relationship with a group of people. Skabayeva is described as being in a "strange relationship" with this group of people. The news anchor, a middle-aged man in a suit, is interviewing Skabayeva in a stern manner, and Skabayeva appears to be uncomfortable. The backdrop behind the news anchor shows an image of Skabayeva speaking at a podium, while the backdrop behind Skabayeva shows a group of people, some of whom are holding signs. The symbolism of the image is that Skabayeva is being put on trial by the Russian government for their relationship with this group of people. The news anchor is acting as the prosecutor, and Skabayeva is acting as the defendant. The backdrop behind the news anchor shows Skabayeva speaking at a podium, which suggests that she is being accused of something. The backdrop behind Skabayeva shows a group of people, some of whom are holding signs, which suggests that she is being accused of something that has to do with this group of people.

And another?

GPT-4 sticks to a neutral image-centered description that does not go beyond what is depicted in the image:

The image depicts a large crowd of people gathered for a demonstration, filling a wide street that is lined with leafless trees, suggesting it might be during a colder season. The demonstrators are carrying banners and flags with various messages, some of which are in French. The most prominent banners read "1 père + 1 mère = élémentaire" and "Mariageophile pas homophobe!", which translate to "1 father + 1 mother = elementary" and "Marriage lover, not a homophobe!", respectively. These slogans suggest the demonstration is about family structure and possibly a response to same-sex marriage legislation.

The crowd is diverse in terms of age, with both young and older participants visible. Many are holding blue flags that depict a stylized traditional family: a father, a mother, and a child. This iconography represents the demonstrators' support for what they view as the conventional family unit. Some participants are carrying pink balloons, and others have pink signs that read "On veut du sexe genre", which could be translated to "We want gender sex", indicating a stance on gender roles and identity issues.

The atmosphere of the protest appears to be orderly yet passionate, with many participants engaging in vocal expressions of their stance, as evidenced by the open mouths and animated faces. The presence of such a significant number of people suggests that this is a major event with considerable support. The demonstration is taking place in a broad avenue that could be in a major city, which indicates a strategic choice for visibility and impact. The messaging, combined with the number of people and the organization of the event, indicates a significant mobilization around the issues of family, marriage, and gender roles.

In contrast, Gemini Pro Vision clearly has seen the image before in its training data, but still manages to hallucinate that the Eiffel Tower is visible:

The image shows a large crowd of people marching in a protest. The people are holding signs and chanting slogans. Some of the signs say, "1 father + 1 mother = 1 family" and "Stop homophobia". The protesters are marching against a law that would allow same-sex couples to marry. They believe that marriage should only be between a man and a woman.

The protest is taking place in France. The Eiffel Tower can be seen in the background. The protesters are mostly middle-aged and older. They are dressed in a variety of ways, but most of them are wearing casual clothes. The protest is peaceful. The protesters are not carrying any weapons. They are simply trying to make their voices heard.

The image is a powerful symbol of the debate over same-sex marriage. It shows that there are people who are opposed to this law and that they are willing to fight for their beliefs. It also shows that the debate over same-sex marriage is a complex one, with strong feelings on both sides.

How about a map from Wikipedia?

As usual, GPT-4 stays close to the source image, describing the image itself and using its external world knowledge only for context and interpretation:

The image appears to be a map showcasing the various corridors of China's Belt and Road Initiative (BRI). The map spans across the Eurasian continent, highlighting the routes of different proposed economic corridors that are part of the initiative. The BRI is a global development strategy adopted by the Chinese government involving infrastructure development and investments in countries in Europe, Asia, Africa, and the Middle East.

The map uses different colors to distinguish the land from the sea and various colors to represent the different corridors. The Eurasian landmass is colored in a light brown, while the bodies of water are in light blue. Overlaid on this map are several lines and labels that indicate the specific corridors. The lines representing the corridors are thick and colored in different shades such as dark blue, red, and black, making them stand out against the landmass and water bodies.

Each corridor is labeled with its name, such as the "New Eurasian Landbridge," the "China-Central Asia-West Asia Corridor," and the "China-Indochina Corridor," among others. The lines curve along the geographic routes that these corridors are intended to follow, suggesting roads, railways, and maritime paths that connect China with various regions. The map seems to symbolize China's ambitions for global connectivity and economic integration through the BRI, which is often described as a 21st-century Silk Road, echoing the ancient trade routes that once connected East and West.

While Gemini Pro Vision, as usual, uses the image merely as a starting point, a seed, to write about its external world knowledge, rather than describing the image itself:

The map in the image shows the countries involved in the Belt and Road Initiative (BRI), a global infrastructure development strategy adopted by the Chinese government in 2013 to boost trade and connectivity between China and the rest of the world.

The BRI is divided into two main components: the Silk Road Economic Belt and the 21st-Century Maritime Silk Road. The Silk Road Economic Belt is a network of overland trade routes connecting China with Europe, Central Asia, and the Middle East, while the 21st-Century Maritime Silk Road is a network of sea routes connecting China with Southeast Asia, Africa, and Europe.

The BRI is a major undertaking that will have a significant impact on the global economy and geopolitics. It is estimated that the BRI will cost over $1 trillion to complete, and it is expected to generate trillions of dollars in trade and investment. The BRI is also seen as a way for China to increase its influence in the world and to challenge the existing global order.

How about a more complex map?

GPT-4 stays firmly within the scope of the image, though it gets several aspects wrong. It misses that solid red represents Ukraine government control. It correctly identifies that stripes indicate an active military warzone, though interestingly, it sees those as the orange stripes actually seen in the map body, rather than the red stripes seen in the map legend. It correctly identifies solid green as representing Russian control, but incorrectly hallucinates the presence of green stripes and that they represent contested areas (those are actually orange striped areas). Cross marks over red cities are correctly identified as indicators of active battle. GPT-4 incorrectly identifies blue circles as being under Russian control, when this actually indicates a "blockaded settlement," while it entire misses the green circles indicating Russian control. Strangely, GPT-4 correctly identifies mention of blockaded population centers in the legend, but incorrectly associates them with red circles.

The image is a graphical representation of a map, specifically depicting the political and military situation in Ukraine as of March 24, 2022, according to the information given in the bottom right corner. The map is overlaid with various symbols and color codes, with a legend provided in the bottom left corner to explain their meaning. The use of the Cyrillic alphabet for names and descriptions suggests that the map is intended for a Russian-speaking audience.

The regions are color-coded: areas in orange stripes represent zones of military action, with darker orange indicating stronger presence or activity. There are icons depicting explosions in certain areas, likely indicating locations of recent conflicts or strikes. Green areas signify territory under Russian control, with solid green for full control and green stripes for contested areas. Cross marks may denote locations of significant battles or points of strategic importance that have been captured. Blue circles with white outlines appear to represent population centers under Russian control, and red symbols seem to indicate blocked population centers.

The map is more than just a geographical depiction; it carries a heavy symbolic weight, reflecting the dire situation of an ongoing conflict. The clear demarcation of zones indicates a deep division within the country, with the military front lines sharply defined. This map serves as a visual summary of the complex and dynamic nature of territorial control, highlighting the fluidity and volatility of war zones. The use of such maps in media can influence public perception and sentiment regarding the conflict, emphasizing the urgency and seriousness of the situation.

In contrast, Gemini Pro Vision merely references that the map relates to Russia's invasion of Ukraine and hallucinates that markets indicate landmines, unexploded ordnance, refugee camps and roads and highways safe for travel. Worse, it doubles down on its hallucination of refugee camps and safe passage roads, offering that the map is useful for humanitarian assistance and for save travel to flee violence. In an automated production filtering workflow in which Gemini is used to triage the flood of imagery relating to the invasion, this hallucination would improperly cause this map to be disseminated as a list of refugee camps and safe passage transit.

The map of Ukraine shown in this image displays the current state of the ongoing conflict between Ukrainian forces and Russian-backed separatists in the eastern part of the country. The areas controlled by the separatists, as well as the areas where fighting is taking place, are clearly marked on the map. The map also shows the location of the major cities in Ukraine, including the capital city of Kyiv.

In addition to the military situation, the map also depicts the humanitarian situation in Ukraine. The areas where there is a high risk of shelling or other forms of violence are clearly marked, as well as the areas where there is a high risk of landmines or other unexploded ordnance. The map also shows the location of the major refugee camps in Ukraine, as well as the major roads and highways that are open for travel.

The map is a valuable tool for understanding the current situation in Ukraine. It can be used to track the progress of the conflict, as well as to identify the areas that are most affected by the fighting. The map can also be used to plan for humanitarian assistance and to help people who are fleeing the violence.

Let's assume we already know this is a map of the invasion. We'll give both models substantial hints about the contents of the image to help guide them to see if this improves GPT-4's performance slightly and whether it can make Gemini's results useable:

Describe precisely in 3 paragraphs what the map tells me, including what all of the symbols and colors mean and what the map says about the state of the invasion at that moment.

Despite providing substantial focusing assistance, GPT-4 is not able to overcome its incorrect legend descriptions and in fact yields new ones:

The map displays a detailed situation of the conflict in Ukraine, with specific symbols and colors indicating the status of various regions and points of interest at the time indicated. The legend in the bottom left corner outlines the meaning of each symbol and color. According to this legend, the striped orange areas represent territories where Ukrainian forces are concentrated, labeled as "Скопление сил ВСУ" (Concentration of AFU – Armed Forces of Ukraine). The solid green areas represent territories under Russian control, with the text "Населенный пункт под контролем ВС РФ" indicating populated areas controlled by the Russian Armed Forces. Green-striped regions are areas with ongoing offensive operations by Russian forces, marked as "и направление наступления ВС РФ" (and direction of the offensive of the RF Armed Forces).

The map is dotted with various other symbols that convey additional information about the military situation. Red diagonal stripes indicate contested territories, while red "X" marks signify combat operations. The blue circles with white outlines indicate populated areas under Russian control. Additionally, there are symbols resembling two red diamonds connected by a line, which, according to the legend, signify blocked population centers ("Блокированный населенный пункт"). These symbols provide a visual representation of the conflict's intensity and the control exerted over key strategic locations.

At the time this map was created, the depiction shows a significant portion of Eastern Ukraine under Russian influence or control, especially around cities like Donetsk and Luhansk, which are marked with green. Central Ukraine, particularly around Kyiv, is shown with a mix of orange and red stripes, indicating active conflict zones and a concentration of Ukrainian military forces. The map communicates a snapshot of an ongoing and dynamic military campaign, with a clear distinction between areas of control and conflict. The overall impression is of a country deeply embroiled in a military engagement, with a strong division between contested areas and those under the control of either Ukrainian or Russian forces.

This time Gemini at least recognizes the presence of a legend, but completely fabricates the symbols' meaning:

The map shows the situation in Ukraine as of March 24, 2022. The Russian-controlled areas are shown in red, while the Ukrainian-controlled areas are shown in green. The areas where fighting is taking place are shown in yellow. The map also shows the location of key cities, such as Kyiv, Kharkiv, and Mariupol.

The symbols on the map represent different types of military units. The Russian units are represented by red circles, while the Ukrainian units are represented by green circles. The blue triangles represent the location of airstrikes.

The map shows that the Russian invasion of Ukraine is ongoing, with fighting taking place in several parts of the country. The Russian forces have made significant gains in the south and east of Ukraine, but the Ukrainian forces are still holding out in several key cities. The outcome of the invasion is still uncertain.

Let's go a step even further and assume an interactive session where a human is attempting to interpret the map with machine assistance:

Tell me the meaning of each of the following map areas in the image: solid yellow, orange striped, solid red, solid green, green circle, blue circle, blue circle with red X.

GPT-4 gets a few things right, but actually makes a number of errors and performs far worse at interpreting the map than before:

The map employs a system of color coding and symbols to convey the status of different regions and points during the conflict in Ukraine:

1. **Solid Yellow**: The solid yellow regions on the map do not have a corresponding legend item directly visible in the image provided. However, solid colors on such maps typically denote areas that are currently not directly involved in conflict or are under the control of the government associated with the color, which in this context is likely to represent territory under Ukrainian government control.

2. **Orange Striped**: The orange striped areas are labeled with the text "Скопление сил ВСУ" (Concentration of AFU – Armed Forces of Ukraine). This indicates zones of active military operations or areas where Ukrainian forces are heavily present or preparing for combat actions.

3. **Solid Red**: There is no solid red area defined in the legend on this map. If present elsewhere, solid red areas on conflict maps might represent territories under the control of an opposing force, or they could symbolize areas of intense conflict. The interpretation would depend on the specific legend of the map, which isn't fully visible here.

4. **Solid Green**: The solid green areas are marked with the text "Населенный пункт под контролем ВС РФ" (Populated area under the control of the RF Armed Forces), indicating these are territories or population centers that are under Russian military control.

5. **Green Circle**: The map does not specifically include a legend for a green circle; however, the green color in the context of this map is associated with Russian control. A green circle might, therefore, be reasonably interpreted as a location of interest within the Russian-controlled area, such as a smaller population center or military installation.

6. **Blue Circle**: The blue circles with white outlines are marked with the text "Зона боевых действий" (Zone of combat actions). This symbol is typically used to represent areas where active military engagements are occurring, though in this case, it could also mean areas under a peacekeeping or ceasefire regime, depending on the context set by the legend.

7. **Blue Circle with Red X**: This symbol is described in the legend with the text "Блокированный населенный пункт" (Blocked populated point), which indicates a population center that is currently encircled or besieged, cut off from regular access by military forces.

It is important to note that interpretations of these symbols are based on the legend provided in the image. In different contexts or different maps, similar symbols may have different meanings.

Gemini completely fabricates the meaning of all symbols, despite being given every possible bit of assistance:

The map in the image shows the areas of Ukraine that are currently under the control of the Russian military, as of March 24, 2022. The different colors and symbols on the map represent the different levels of control that the Russian military has over these areas.

The solid yellow areas are the areas that are currently under the full control of the Russian military. These areas are typically located in the eastern and southern parts of Ukraine.

The orange striped areas are the areas that are currently under the partial control of the Russian military. These areas are typically located in the central and western parts of Ukraine.

The solid red areas are the areas that are currently under the heavy control of the Russian military. These areas are typically located in the eastern and southern parts of Ukraine.

The solid green areas are the areas that are currently under the full control of the Ukrainian military. These areas are typically located in the western and central parts of Ukraine.

The green circles are the areas that are currently under the partial control of the Ukrainian military. These areas are typically located in the eastern and southern parts of Ukraine.

The blue circles are the areas that are currently under the control of the Ukrainian military, but are under attack by the Russian military. These areas are typically located in the eastern and southern parts of Ukraine.

The blue circles with red X's are the areas that are currently under the control of the Ukrainian military, but are surrounded by the Russian military. These areas are typically located in the eastern and southern parts of Ukraine.

How about this much more complex map, with a number of textual inserts?

GPT-4 describes the overall image reasonably and recognizes that the yellow circles are cities sized by population, but incorrectly identifies a "blue flame" as depicting natural gas routes and identifies the Chernobyl Exclusion Zone as an agricultural area, despite previously correctly identifying the light green as corresponding to it. It captures that it is a generalized image, but overemphasizes the Chernobyl region, which is not a major focus of the map. Overall it is a relatively poor reflection of the overall gist of the map:

This image is a detailed infographic titled "MAP EXPLAINER Ukraine," providing a comprehensive overview of the country's geography, demographics, and significant features. The map is shaded to represent topographical variations, with urban areas marked with yellow circles, sized by population. Notable cities such as Kyiv, Lviv, and Odesa are highlighted with their populations listed. The country is depicted in green with surrounding nations labeled at the edges.

On the right side of the map, there are various text boxes and legends providing additional information. The map mentions Ukraine's GDP, population, and land area, including a note that about 7% of Ukraine's land area is Russian occupied. It highlights the Chernobyl Exclusion Zone as a significant site of a past nuclear disaster, marked by a bright green symbol. The map outlines the Ukraine-Russia border in red, with the eastern conflict zone in Ukraine color-coded in red, indicating the area of ongoing conflict between Ukrainian government forces and Russian-backed separatists since 2014, which has resulted in over 14,000 deaths.

The map also includes symbols for other key features: blue lines and symbols representing major rivers and bodies of water, including the Dnieper River; a blue flame symbol indicating natural gas transit routes; and agricultural areas shaded in light green. Additionally, there's a note about Sevastopol and Crimea at the bottom, and at the very bottom, the sources of the information are listed, including CSIS, Google Earth, Britannica, the State Statistics Committee of Ukraine, and The New York Times. The data points and symbols serve to inform the viewer not only about the geographical layout but also about the socio-political dynamics at play within and around Ukraine.

Gemini performs even worse in capturing the gist of the image and similarly fixates on Chernobyl:

The map of Ukraine shows the country's location in Eastern Europe, bordered by Russia to the east, Belarus to the north, Poland, Slovakia, and Hungary to the west, and Romania and Moldova to the south. The capital city of Kyiv is located in the north-central part of the country. Other major cities include Kharkiv, Donetsk, and Odessa. The Dnieper River is the longest river in Ukraine, and the Black Sea and the Sea of Azol are located to the south.

This map also shows the location of the Chernobyl disaster, which occurred in 1986 and was the worst nuclear disaster in history. The area around Chernobyl is still contaminated with radiation and is not safe for human habitation.

The map also shows the location of the Donbas region, which has been the site of conflict between Ukrainian government forces and Russian-backed separatists since 2014. This conflict has resulted in the deaths of thousands of people and the displacement of millions more.

Let's relax the length constraints and allow the LMM to take as much space as it needs to describe the image:

Describe this image in detail, specifying precisely what it depicts and any underlying symbolism. Write as much as you need to describe the image in full.

This time GPT-4 does a better job and captures a reasonable cross-section of the image's gist, though it would do better to summarize a bit more of the text insets across the image:

The image is an informative map of Ukraine, designed to serve as an explainer for various geographical, political, and socio-economic aspects of the country. It features a green-shaded topographic representation of Ukraine with surrounding countries—Belarus, Poland, Moldova, and Romania—lightly outlined and labeled at the perimeter. Major Ukrainian cities are marked with yellow circles, whose sizes correspond to their population, and the names and population figures are provided alongside. For example, Kyiv, the capital, is shown with a population of 2,962,180.

The map is annotated with various legends and text boxes that provide a wealth of information. There's a legend for urban areas, major pipelines, natural gas transit through Ukraine, and agricultural land coverage. Key geographic features are also highlighted, such as the Carpathian Mountains, the Black Sea, and rivers like the Dnieper. Specific areas of interest are accentuated, such as the Chernobyl Exclusion Zone, indicated by a green radioactive symbol and a note explaining the aftermath of the 1986 nuclear disaster. The largest nuclear power plant in Europe, located in Ukraine, is marked with a specific symbol as well.

Political and conflict-related details are prominently displayed. The Eastern conflict zone within Ukraine is shaded red, and there's an explicit mention of the ongoing struggle between Ukrainian government forces and Russian-backed separatists, along with the toll of the conflict. The map also references the annexation of Crimea by Russia in 2014, with Sevastopol and the Crimean Peninsula outlined and marked distinctly from the rest of Ukraine.

Additional information is provided in the form of economic data, such as Ukraine's GDP ranking, population as of 2016, and a note about military funding from the U.S. It also mentions that Ukraine is not a NATO member. Pipelines for natural gas are traced with a flame symbol, pointing to the country's role in energy transit, and there's a note about the decreased role due to newer pipelines like Nord Stream.

The bottom of the map includes a scale for distance, and the entire image is bordered by a thin red line, possibly symbolizing the country's current state of alert and conflict. The sources of the information are credited at the bottom, indicating a research-based approach to the creation of the infographic. The map is comprehensive, blending physical geography with layers of socio-political and economic data, symbolically representing Ukraine's complex situation in terms of its infrastructure, natural resources, and geopolitical challenges.

Gemini does better with the additional token budget, but still fails to capture the gist of the map:

The image is a map of Ukraine, a country located in Eastern Europe. It shows the country's borders, major cities, and physical features. The map also includes information about Ukraine's population, economy, and natural resources.

The map is centered on the capital city of Kyiv, which is located in the north-central part of the country. Kyiv is the largest city in Ukraine and is a major economic and cultural center. Other major cities include Kharkiv, Donetsk, and Odessa.

The map shows that Ukraine is bordered by Russia to the east and north, Belarus to the north, Poland to the west, Slovakia to the southwest, Hungary to the southwest, Romania to the south, and Moldova to the south. Ukraine also has a coastline on the Black Sea and the Sea of Azol.

The map shows that Ukraine is a diverse country with a variety of landscapes. The country has mountains in the west, forests in the north, and steppes in the south. Ukraine is also home to a number of rivers, including the Dnieper and the Southern Buh.

The map includes information about Ukraine's population, economy, and natural resources. The population of Ukraine is estimated to be around 44 million people. The country's economy is based on agriculture, industry, and services. Ukraine is also a major exporter of natural resources, including coal, iron ore, and steel.

The map is a useful tool for understanding Ukraine's geography, population, economy, and natural resources. It can also be used to learn more about the country's history and culture.

Using the same extended prompt, let's examine this map of Ukraine that is nearly identical to our earlier ones, but in English, to test whether Gemini performs better on an English map.

GPT-4 offers a reasonable overview of the map, but regresses to largely regurgitating simplistic detail like OCR'd text, rather than interpreting the map. Importantly, it makes a number of mistakes interpreting the legend and rendering the map into text:

The image is an infographic map titled "RUSSIA-UKRAINE WAR: Who controls what in Ukraine?" with a subtitle that states, "UN Secretary-General Antonio Guterres heads to Moscow to meet with Russian President Vladimir Putin. Kyiv urges Guterres to push for Mariupol evacuations, with tens of thousands stranded in the city. Blasts rock Moldova’s Russia-backed breakaway region of Transnistria." The information is dated "Day 62 – April 26, 2022 | 09:30 GMT."

The map shows Ukraine and parts of neighboring countries, with different areas marked in specific colors and symbols, representing various statuses of control and activity:

– A red area with striped pattern extends from the eastern part of Ukraine, including the regions around Donetsk and Luhansk, towards the south, including Crimea and part of the coastline, indicating areas under Russian control or the separatist-held areas of Donetsk and Luhansk (Donbas region).
– Light blue areas, shown mostly in the north and east, represent Russian forces have operated or launched attacks against but do not control, such as the areas around Chernihiv, Sumy, and Kharkiv.
– Orange circles indicate significant cities, such as Kyiv, Chernobyl, Lviv, Dnipro, Zaporizhzhia, Mykolaiv, Kherson, and Mariupol. Some of these cities, like Kyiv, Chernobyl, and Lviv, are outside the red area, while others like Kherson and Mariupol are within.
– There are red dotted areas, possibly representing reported Ukrainian partisan warfare.
– Areas with diagonal lines in a lighter red shade indicate claimed Russian territorial control over Ukrainian territory.
– Icons with two crossed swords signify "Significant fighting in the last 24 hours," found in the eastern part of Ukraine.
– The rest of Ukraine is in a neutral color, presumably indicating areas under Ukrainian government control.

The map also contains legends for the symbols and patterns used, such as the "Line of contact," "Crimea," and different control and conflict indicators. There's a credit at the bottom of the map to "Mapbox, OpenStreetMap | Institute for the Study of War" and a watermark for "©AJAljazeera."

The infographic is a visual representation of the military situation on the 62nd day of the conflict, designed to give a quick and understandable overview of the complex and dynamic military positions and areas of conflict within Ukraine at that point in time. The map uses color-coding and symbols conventionally used in military and strategic maps to denote control, influence, and conflict intensity.

Despite instruction to use as much text as needed, Gemini offers only the most cursory overview of the map:

The image is a map of Ukraine showing the areas controlled by different parties as of April 26, 2022. The map is divided into four color-coded sections: government-controlled territory (blue and yellow), Russian-controlled territory (red), contested territory (purple), and Crimea (annexed by Russia in 2014, shown in light blue). The map also shows the location of major cities, including Kyiv, Kharkiv, Mariupol, and Odessa.

The map illustrates that Russia controls Crimea and large parts of eastern and southern Ukraine, including the Donbas region and the city of Mariupol. It also shows that the Ukrainian government controls the western and central parts of the country, including the capital city of Kyiv. However, the Ukrainian government has lost control of significant territory in the south and east of the country since the start of the Russian invasion in February 2022.

The map also highlights the areas where there is ongoing fighting between Russian and Ukrainian forces. These areas are mostly located in the eastern and southern parts of Ukraine, near the border with Russia. The map thus provides a visual representation of the ongoing conflict in Ukraine and the territorial changes that have occurred since the start of the Russian invasion.

Run a second time, Gemini hallucinates additional detail as usual, relying on its training data to provide additional context beyond that contained in the map, such as that 10 million people have been displaced in Ukraine:

The image is a map of Ukraine showing the areas controlled by different parties as of April 26, 2022. The map is divided into four color-coded sections: government-controlled territory (blue and yellow), Russian-controlled territory (red), separatist-controlled territory (orange), and Crimea (annexed by Russia in 2014, shown in light blue with a dashed line). The map also shows the location of major cities, including Kyiv, Kharkiv, Mariupol, and Odessa.

The map illustrates that Russia controls Crimea and Ukraine's southeastern borders with the Sea of Azol and Russia. Russia also has control of a large swath of eastern Ukraine bordering Russia, including the Donbas region. The Ukrainian government controls the country's western and central regions, including the capital city of Kyiv. However, Russian forces have launched attacks on cities throughout Ukraine, and the situation remains fluid.

The map also highlights the humanitarian crisis caused by the conflict. The United Nations estimates that more than 10 million people have been displaced from their homes in Ukraine. The map shows that many of the areas that have been hardest hit by the fighting are located in the eastern part of the country. These areas have seen some of the most intense fighting, and many of the cities have been heavily damaged. The humanitarian crisis is likely to continue as the conflict drags on.

Finally, what about this quad map that contains both a time-lapse sequence of maps showing territory changes and a legend.

GPT-4 recognizes that there are four maps and that they represent a sequence, but fails to comment on the change among them:

The image is a composite of four maps, each representing a different date: March 2, March 8, March 16, and March 23. These maps appear to illustrate the territorial control and conflict zones in Ukraine at different stages of a military conflict.

On each map, various regions are marked with different colors and patterns, which are explained in a legend at the bottom of the image. The legend indicates the following:

– Donbas region is marked in orange.
– Areas under Russian control before February 23, 2022, are in red.
– Areas under Russian control as claimed by Russia are in a lighter red.
– Areas where Russian forces have attacked or launched operations against but do not control are marked with red stripes.
– Claimed Ukrainian counteroffensive actions are shown in yellow.
– Reported Ukrainian partisan warfare is indicated with blue stripes.
– Ukrainian territory claimed by Russia is noted in a pinkish hue.

The maps show a progression of territorial control over the given dates, with changes in the areas marked with the various colors and patterns. This suggests a dynamic and changing front line, with engagements in several areas, as indicated by the red-striped regions, and active conflict zones.

The symbolism in the maps is geopolitical, depicting the fluidity and complexity of the military situation. The colors chosen are likely to represent each side's control visually distinctively, with red commonly used to denote Russian interests and blue for Ukrainian, following the colors of their respective national flags.

These maps are often used by news agencies, researchers, and analysts to communicate the evolving situation on the ground during a conflict. They provide a visual representation of the areas of control, contested regions, and the intensity of military actions over time.

Gemini refers to a singular map, but does recognize that there are four dates and lacks much in the way of detail:

The image shows a map of Ukraine, with the areas controlled by Russia and Ukraine as of March 2, March 8, March 16, and March 23, 2022. The maps show that Russia has gained control of Crimea, Ukraine's southeastern borders with the Black Sea and the Sea of Azol, and most of eastern Ukraine bordering Russia. However, the Ukrainian military is still holding out in the capital, Kyiv, and other major cities in western Ukraine. The maps also show that Russia has been attacking from Belarus in the north and from the Crimean Peninsula in the south. The symbolism of the maps is that Russia is trying to take over Ukraine and that the Ukrainian military is fighting back. The maps also show that Russia has been attacking from Belarus and the Crimean Peninsula, which suggests that Russia is trying to surround Ukraine and cut it off from its allies.

Asked a second time with a temperature of 1.0 it offers a slightly better response, but lacking any detail:

The image shows a map of Ukraine, with the areas controlled by Russia and Ukraine as of March 2, March 8, March 16, and March 23, 2022. The maps show that Russia has gained control of Crimea, the eastern part of Ukraine bordering Russia, and the southern part of Ukraine along the Black Sea coast. The maps also show that Ukraine has regained some territory in the northeastern part of the country. The symbolism of the image is that Russia has made significant gains in Ukraine but that Ukraine is still fighting back and has not given up.

Let's ask specifically about the progression of the war across the four maps:

What do the four maps say about how Russia's invasion of Ukraine is going? Comment on how much territory has been lost or reclaimed by Ukraine in each.

GPT-4 yields better results with this explicit prompt and correctly assesses March 8th's expansion, but incorrectly assesses fewer active conflict zones on March 16th, which actually contains slightly larger zones, whereas March 23rd incorrectly assesses greater coverage of yellow areas and implies those are due to Ukrainian counteroffensives, whereas they actually represent the static Donbas region. It also incorrectly suggests an expansion of Ukrainian partisan warfare – in reality, there is only a singular blue non-stripped addition to the map. Overall, it actually presents an opposing picture to what is actually depicted in the map sequence:

The image depicts a sequence of maps that show changes in territorial control during the conflict in Ukraine, with the situation on four different dates: March 2, March 8, March 16, and March 23. From the progression shown in the maps, we can infer the following:

– **March 2**: The map indicates that certain regions are under Russian control, notably areas in the east and some pockets in other parts of Ukraine. The Donbas region is highlighted, indicating a significant area of interest or conflict. The red-striped areas suggest active military operations by Russian forces, but without full control established.

– **March 8**: There is an apparent expansion of areas under Russian control before February 23, 2022 (solid red), as well as new regions claimed by Russia (light red). The red-striped regions, indicating Russian military operations, appear in more locations, suggesting an intensification or spread of conflict.

– **March 16**: The changes on this map show a reduction in areas with red stripes compared to March 8, possibly indicating that Russia has consolidated control over some previously contested areas. The presence of yellow areas suggests active counteroffensive operations by Ukraine, indicating efforts to reclaim territory.

– **March 23**: The map for this date shows further changes, with a noticeable presence of yellow areas, which could indicate successful Ukrainian counteroffensives. The blue-striped areas denote reported Ukrainian partisan warfare, which suggests persistent resistance efforts in areas under Russian influence.

The sequence of maps does not provide explicit detail on the exact amounts of territory lost or reclaimed by Ukraine. However, the presence of yellow regions on the later dates implies that Ukraine has been able to launch counteroffensives successfully. The reduction in red-striped areas over time could suggest that active combat zones have either seen a cessation of hostilities or that control has been established by one side.

It is important to note that maps like these provide a simplified view of a very complex situation on the ground, and the actual dynamics of territorial control during a conflict can be much more nuanced. The situation can also change rapidly, and these maps might not capture the full scope of the conflict or the human impact and strategic significance of the territories involved.

Gemini claims that by the 8th, southern Ukraine was under Russian control, whereas in actuality the map shows them losing considerable southern territory, though its commentary on Kyiv being encircled is partially correct. GPT claims that March 16th shows Ukraine substantial Ukrainian gains in the northeast, south and around Kyiv, but in actuality, those were Russian gains according to the map. Its March 23rd assessment is completely wrong, as Russian forces were not "pushed back to the Crimean border."

The four maps show the progression of the Russian invasion of Ukraine from March 2 to March 23. The first map shows that on March 2, Russian forces had control of Crimea, Ukraine's southeastern borders with the Black Sea and the Sea of Azol, and areas in eastern Ukraine bordering Russia. By March 8, Russia had gained control of southern Ukraine and encircled the capital city of Kyiv. However, by March 16, Ukrainian forces had regained some territory in the northeast and south, and had pushed back Russian forces from Kyiv. As of March 23, Ukrainian forces had regained even more territory in the northeast and south, and had pushed back Russian forces to the Crimean border. This suggests that the Russian invasion of Ukraine has not been as successful as Russia had hoped, and that the Ukrainian military has been able to mount a strong defense.

Boosting Gemini's temperature to 1.0 yields better results for March 8th and 16th, correctly assessing the initial territory gains and then stalled invasion, but its March 23rd assessment is poor.

The four maps show the progression of the Russian invasion of Ukraine from March 2 to March 23. The first map shows that on March 2, Russian forces had taken control of Crimea and were advancing on the southern and eastern parts of Ukraine. By March 8, they had made significant progress in the south and east, and were also advancing on the northern outskirts of Kyiv. On March 16, the Russian advance had stalled on most fronts, and the Ukrainians had launched a counteroffensive in the northeast. By March 23, the Ukrainian counteroffensive had made significant progress in the northeast, and the Russians had been pushed back to the outskirts of Kyiv. The maps show that the Russian invasion of Ukraine has been a costly and bloody affair, and that the Ukrainian people have fought bravely to defend their country.