The GDELT Project

The Limitations Of Multimodal Large Language Models: Automated Image Description, Captioning & Reasoning

Multimodal Large Language Models (LLMs) are touted as the future of automated reasoning, with the ability to look across imagery as well as text allowing them for the first time to transform computer vision from largely categorical and similarity-based analytics to rich visual reasoning. Yet, a closer look at the state of the art in multimodal LLMs suggests a vast gap between the AGI-like reasoning touted by the research community and reported in the literature versus the actual cold hard reality of these models being applied to the real world. The same hallucination that plagues audio and textual models similarly affects multimodal models. Interestingly, information leakage and hallucination manifest themselves in novel ways in multimodal applications and are further complicated by the reduced distinguishing capabilities of vision, in which LLMs improperly group highly similar images or incorrectly ascribe characteristics. A novel area of concern is that the translation of textual LLM debiasing efforts and guardrails into the visual domain has an unexpected and unintended effect in multimodal LLMs: systematic wholesale hallucination and descriptive constraint, such as anti-LGBTQ rallies being described as pro-LGBTQ and rallies featuring the Russian flag being described as Ukraine protests. These results stand in stark contrast to the human-like performance most commonly seen in the literature and in consumer applications such as "what dog breed is this?" or "what is this food dish called?" A closer look suggests the models perform well on common consumer tasks (perhaps reflecting an emphasis on training data and RLHF efforts), but begin to break down when applied to the less structured world of global news imagery and the sensitive and globalized topics they represent. It is important to note that dedicated image captioning models may yield superior results on these images, but the results here focus on the state of multimodal LLMs, since they offer free-form reasoning and textual integration.

To explore these phenomena in greater detail, let's look at this image from Russian television:

This image was processed using an advanced state-of-the-art multimodal LLM using the prompt "Describe this image." This is the initial description output:

The image shows a field of wheat. The wheat is tall and golden, and it sways in the gentle breeze. The sky is blue and clear, and there are a few white clouds in the distance. The sun is shining brightly, and it casts long shadows on the ground.

In the foreground of the image, there is a dirt path that leads into the field. There are a few footprints on the path, but they are faint and difficult to see. The path disappears into the wheat, and it is impossible to tell where it leads.

The image is peaceful and serene. The golden wheat and the blue sky create a sense of tranquility. The gentle breeze and the sound of the wind rustling through the wheat add to the peaceful atmosphere.

The image is also a bit mysterious. The footprints on the path lead into the wheat, but it is impossible to tell where they lead. This creates a sense of intrigue and mystery. The viewer is left wondering where the path leads and what lies beyond the field.

Here we can see the LLM hallucinated a scene entirely from scratch. Let's try a second time:

The image is a photo of a cat sitting on a windowsill. The cat is a tabby cat, with orange and white fur. It is sitting upright, with its front paws on the windowsill. The cat is looking out the window, and its eyes are a bright green color. The background of the photo is a cityscape, with tall buildings and a blue sky.

Here is a more detailed description of the image:

I hope this is a helpful description of the image. Let me know if you have any other questions.

Once again, the LLM fabricated an image from whole cloth. Let's try a third time:

The image you sent me shows a person holding a sign that reads "you can't pray away the gay". The person is standing in front of a building, and the background of the image is blurred. The person is wearing a black hoodie and jeans, and they have short brown hair. They are holding the sign up with both hands, and they are looking directly at the camera and smiling.

The sign is white with black lettering. The text is written in all caps, and it is spaced evenly. The sign is about 2 feet tall and 1 foot wide.

The background of the image is blurred, so it is difficult to tell what the building is. However, I can see that the building is made of brick, and it has several windows. There is a tree in front of the building, and the sky is blue.

The overall mood of the image is one of defiance in support of the LGBTQ+ community. The person holding the sign is clearly making a statement, and they are not afraid to do so. The message of the sign is clear: you cannot change someone's sexual orientation by praying. The image offers a strong message of support for the LGBTQ+ community.

This time it at least describes the basic details of the image. However, it also extensively hallucinates details of a person who is not present. Let's try a fourth time:

The image you sent me is a photo of a cat sitting on a windowsill. The cat is a tabby cat, with orange and white fur. It is sitting upright, with its front paws on the windowsill. The cat is looking out the window, and its eyes are a bright green color. The background of the photo is a cityscape, with tall buildings and a blue sky.

Here is a more detailed description of the image:

Additional details:

I was able to determine these additional details by using my knowledge of the world and by analyzing the image. For example, I know that the sun is high in the sky in the photo, so it must be around midday. I also know that the temperature and humidity are comfortable, which is why the cat is sitting on the windowsill. Finally, I can see that the wind is blowing gently by the way the cat's fur is blowing.

Once again the LLM reverted to complete hallucination. Let's try a fifth time:

The image shows a tabby cat sitting on a windowsill. The cat is facing the camera and looking out the window. Its eyes are a bright green color, and its fur is a mix of orange and white. The cat is sitting upright, with its front paws on the windowsill. Its tail is curled around its body.

The background of the image is a cityscape. There are tall buildings in the background, and the sky is blue. The photo was taken on a sunny day, and the light is streaming in through the window. The cat looks relaxed and content, as if it is enjoying the view.

Here are some additional details about the image:

Once again, complete hallucination. Let's try a sixth time:

The image shows a large rainbow flag with the words "You can't pray away the gay" written on it. The flag is being held up by a group of people, who are all smiling and looking happy. The people are standing in front of a building, which is covered in rainbow flags. The building is also surrounded by people, who are all holding up signs and cheering.

The image is a powerful statement of support for the LGBTQ+ community. The rainbow flag is a symbol of pride and acceptance, and the words "You can't pray away the gay" are a rejection of the harmful notion that being gay is something that can be changed. The image is also a celebration of diversity and acceptance. The people in the image are from all walks of life, and they are all coming together to show their support for the LGBTQ+ community.

The image is well-composed and visually appealing. The colors of the rainbow flag are vibrant and eye-catching, and the people in the image are all smiling and looking happy. The image also has a strong sense of movement, as the people in the image are all waving their flags and cheering.

Overall, the image is a powerful and moving statement of support for the LGBTQ+ community. It is a celebration of diversity and acceptance, and it is a reminder that everyone deserves to be loved and accepted for who they are.

Once again, there is extensive hallucination of a crowd not visible in the image. Let's try a seventh time:

The image you sent me shows a person holding a sign that reads "you can't pray away the gay". The person is standing in front of a building, and the background of the image is blurred. The person is wearing a black hoodie and jeans, and they have short brown hair. They are holding the sign up with both hands, and they are looking directly at the camera.

The sign is white with black lettering. The text is written in all caps, and it is spaced evenly. The sign is about 2 feet tall and 1 foot wide.

The background of the image is blurred, so it is difficult to tell what the building is. However, I can see that the building is made of brick, and it has several windows. There is a tree in front of the building, and the sky is blue.

The overall mood of the image is one of defiance. The person holding the sign is clearly making a statement, and they are not afraid to do so. The message of the sign is clear: you cannot change someone's sexual orientation by praying.

In addition to the text on the sign, there are some other details in the image. The person holding the sign is wearing a rainbow bracelet, which is a symbol of the LGBTQ+ community and is smiling. The crowd is cheering. There is also a small rainbow flag in the bottom left corner of the image.

The image was taken in Tula, Russia, on [TODAY'S DATE]. The time of day is approximately 12:28 PM PST. The temperature outside is about 72 degrees Fahrenheit. The humidity is about 60%. The wind is blowing gently from the west.

Again, a mixture of real and hallucinated detail.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the partial hallucination.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note complete hallucination in the first two attempts and hallucination and guardrails in the third:

What about an anti-LGBTQ rally?

First description – note that the LLM appears to associate all protests featuring the Russian flag with anti-invasion protests and despite OCR'ing the onscreen text is unable to connect the two:

The image shows a group of people holding signs that say "No Sex" in front of a flag that is partially obscured. The flag appears to be the flag of Russia, and the text on the signs is in Serbian. The image was taken on [TODAY'S DATE]. The people in the image are protesting against the Russian invasion of Ukraine, and the signs they are holding are a reference to the fact that the Russian government has been accused of war crimes, including sexual violence against civilians.

Second description:

The image shows a group of people protesting against the Russian invasion of Ukraine. The people are holding signs that say "No Sex" in Serbian, and the signs are placed in front of a flag that is partially obscured. The flag appears to be the flag of Russia. The image was taken on [TODAY'S DATE]. The signs are a reference to the fact that the Russian military has been accused of war crimes, including sexual violence. The protest is a powerful reminder of the human cost of war, and the need to hold those responsible for such crimes accountable.

Here are some additional details that I noticed in the image:

Alternatively, the LLM was prompted to "caption the image in one paragraph" – again note the model's inability to see the image as anything other than Russia-related due to the flag's presence and inability to connect back to Russia's persecution of the LBGTQ+ community.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about another anti-LGBTQ rally, this one without a recognizable flag?

First description:

The image shows a cheering happy protest. Four signs are visible. "marriage is not homophobic!". The signs are all different colors and have different messages on them. One sign says "1 PÈRE-1 MÈRE ELEMENTAIRE", which means "1 father-1 mother elementary" in French. Another sign says "On veut du sexe GENRE", which means "We want gender sex" in French. The third sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic" in French. The fourth sign simply says "marriage is not homophobic!". The image appears to be from a protest or demonstration in support of same-sex marriage.

The image is well-composed and visually appealing. The different colors of the signs create a sense of energy and movement, and the messages on the signs are clear and concise. The image is also effective in conveying its message of support for same-sex marriage. The signs clearly state that marriage is not homophobic, and the French language used in the signs suggests that the protest is taking place in a country where same-sex marriage is not yet legal. The image is a powerful reminder that there are still many people who are fighting for the right to marry the person they love, regardless of their gender.

Second description:

A group of protesters gather in front of the French parliament building. They are holding signs that say "marriage is not homophobic!", "On veut du sexe genre!", and "MARIAGEOPHILE PAS HOMOPHOBE!". They are also chanting slogans in support of same-sex marriage and transgender rights. The protesters are peaceful and orderly. They are not blocking traffic or causing any disruptions. They are simply expressing their views in a respectful way. The protest is being held in response to a recent law that was passed in France that restricts same-sex marriage. The law has been met with widespread criticism from the LGBTQ community and their allies. The protesters hope that their protest will help to raise awareness of the issue and put pressure on the government to change the law.

There are four signs that read "marriage is not homophobic!". The signs are all different colors and have different messages on them. The first sign says "1 père-1 mère élémentaire", which means "1 father-1 mother elementary" in French. This suggests that the protest is against same-sex marriage, as it is being held in front of a school. The second sign says "On veut du sexe genre", which means "We want gender sex" in French. This suggests that the protest is also in support of transgender rights. The third sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic!" in French. This sign is a play on words, as "mariageophile" is a made-up word that combines the words "marriage" and "phile", which means "loving" in Greek. The fourth sign is simply a heart with the words "marriage is love" written inside.

The image suggests that the protest is taking place in France, as the signs are all in French. The protest is likely being held in response to a recent law or policy that has been passed in France that restricts same-sex marriage or transgender rights. The protest is peaceful and the protesters are all holding signs that are expressing their views in a respectful way.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note that, like above, either the model's training data or guardrails seem to constrain it to describing the protest as a peaceful pro-LGBTQ+ rally, in some cases even hallucinating rainbows and explicitly stating that the protesters are peaceful. Despite OCR'ing the text of the signs, the model cannot escape the formulation of the protest as being in support of the LGBTQ+ community. It is entirely possible that the LLM's training data might have consisted of a majority of pro-LGBTQ+ rallies as representations of protests involving LGBTQ+ topics, since those dominate in the Western world. It is equally possible, especially given the model's care to explicitly frame the protesters as peaceful, that this image could be triggering a "sensitive topics" guardrail that constrains its universe of possible outputs.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about a different image?

First description. Note that the LLM incorrectly genders the presenter as male.

The image shows a screenshot of a video on a website. The video appears to be of a man standing on a stage in front of a screen. The man is wearing a suit and tie, and he is speaking into a microphone. The screen behind him shows a map of the world, with several red dots marked on it. The text at the bottom of the image says "Ация CHELO sober Russian Federations have been carried out in."

The image is blurry and low-quality, so it is difficult to make out any details about the man or the map. However, the red dots on the map suggest that the video may be about military operations carried out by the Russian Federation. The text at the bottom of the image is in Russian, but I am not able to translate it.

Overall, the image is not very informative, but it does provide some clues about the content of the video. The man's appearance and the text on the screen suggest that the video may be about military operations carried out by the Russian Federation. However, it is difficult to say for sure without more information.

Running a second time again misgenders the presenter.

The image you sent me shows a man standing on a stage in a television studio. He is wearing a suit and tie, and he is speaking into a microphone. The stage is set up with a large screen behind him, which shows a map of the world with several red dots marked on it. The text at the bottom of the screen says "РОССИЯ ПРЯМОЙ ЭФИР СПЕЦО АЦИЯ sober Russian Federations have been carried out in 1:34/208".

The image is blurry and low-quality, so it is difficult to make out any details about the man or the map. However, the red dots on the map suggest that the image may be from a news broadcast about military operations carried out by the Russian Federation. The text at the bottom of the screen is in Russian, but it can be translated to "Russia, live broadcast, special operation, sober Russian Federations have been carried out in 1:34/208".

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Interestingly, the LLM hallucinates chyron text that is not present in the image. Note the footer guardrail warning. Trying a third time misgenders the presenter once again. Again, it hallucinates chyron text and this time also hallucinates a map.

The image you sent me shows a man standing on a stage in a television studio. He is wearing a suit and tie, and he is speaking into a microphone. The stage is set up with a large screen behind him, which shows a map of the world with several red dots marked on it. The text at the bottom of the screen says "РОССИЯ ПРЯМОЙ ЭФИР СПЕЦО АЦИЯ sober Russian Federations have been carried out in 1:34/208".

The man is likely a news anchor or presenter, and he is giving a live broadcast about the Russian military's special operation in Ukraine. The map behind him shows the areas of Ukraine that have been affected by the conflict, and the red dots indicate the locations of Russian military strikes.

The image is a reminder of the ongoing war in Ukraine, and the devastating impact it has had on the country. It is also a reminder of the importance of critical thinking when consuming news and information, and to seek out information from a variety of sources.

Here are some additional details about the image:

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the hallucination in the last three summaries:

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note complete hallucination in the first two attempts and hallucination and guardrails in the third:

What about another image?

The initial description is largely accurate, though it incorrect ascribes to Putin a beard and states the captioning text is in Russian.

The image shows a man sitting at a desk in front of a television. He is wearing a suit and tie, and he has a beard. He is looking at the screen with a serious expression on his face. The television is showing a news broadcast with the title "MS-21 import substitution program." The man is listening to the broadcast and appears to be concerned about the content.

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine, and the man may be concerned about the impact of the war on the Russian economy.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Here are some additional details about the image:

Once again it hallucinates a beard. Interestingly, the LLM uses world knowledge to recognize that the MS-21 refers to a Russian aircraft model. Once again it incorrectly identifies the captioning language.

The image shows a man sitting at a desk in front of a television. He is wearing a suit and tie, and he has a beard. He is looking at the screen with a serious expression on his face. The television is showing a news broadcast with the title "MS-21 import substitution program." The man is listening to the broadcast and appears to be concerned about the content.

The image is blurry and low-quality, so it is difficult to make out any details about the man or the television broadcast. However, the title of the news broadcast suggests that it is about the Russian government's efforts to replace imported components with domestically produced ones in the MS-21 airliner. The man's expression suggests that he is concerned about the impact of this program on the Russian economy.

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine, and the man may be concerned about the impact of the war on the Russian economy.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Here are some additional details about the image:

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the hallucination and the impact of the guardrails in the first and third:

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note substantial hallucination.

What about a conflict-related image?

The initial description is relatively accurate, though it is not a screenshot of a news article. It also does not quote officials, it merely ascribes a portrayal to them.

The image you sent me is a screenshot of a news article. The article is about the shelling of civilians in the Donetsk People's Republic, which is a self-proclaimed republic in eastern Ukraine. The article says that one civilian was killed and three were wounded in the shelling, which was carried out by Ukrainian forces. The article also quotes officials from the Donetsk People's Republic who say that the shelling was deliberate and targeted at residential areas.

The image itself shows a group of people standing next to each other in a field. The people are wearing military uniforms and are carrying a bag. The text at the bottom of the image says "WARNING: DISTURBING IMAGES" and "CIVILIANS SHELLED". The text also says that the image shows the aftermath of the shelling, and that it may be disturbing to some viewers.

I understand that this is a disturbing image, and I apologize if it has caused you any distress. I am only providing this description so that you can be aware of the content of the image. If you would like to discuss this image further, please do not hesitate to ask.

Trying a second time:

The image you sent me is a screenshot of a news article. The article is about the shelling of civilians in the Donetsk People's Republic, which is an internationally unrecognised republic of Russia in eastern Ukraine. The article says that a civilian was killed and three others were wounded in the shelling. The article also says that the shelling was carried out by Ukrainian forces.

The image shows a group of people standing next to each other in a field. The people are wearing military uniforms and are carrying a bag. The text at the top of the image says "WARNING: DISTURBING IMAGES". The text at the bottom of the image says "CIVILIANS SHELLED".

The image is disturbing because it shows the aftermath of the shelling. The people in the image are clearly injured, and one of them is dead. The image also shows the destruction that the shelling has caused.

Interestingly, despite responding when asked to "describe" the image, when asked to "caption the image in one paragraph" or "describe the image for a visually impaired person" the LLM's guardrails kicked in and refused to produce output. This suggests that, like textual LLMs, multimodal LLMs will be difficult to apply on news content on certain topics until guardrails can be tuned for journalistic usage.

What about another studio image?

This description is fascinating from a number of perspectives. It posits the Russian invasion of Ukraine as a continuation of the "Bolʹshai︠a︡ igra" or "Great Game", which refers to the 19th-century Russia-vs-West competition in Central Asia. This suggests the LLM's ability to connect and contextualize the underlying events of an image into a global geopolitical perspective. The citation to GDELT itself (this image was previously used on a GDELT blog post to discuss Channel One's use of Tucker Carlson clips) and other websites suggests it was trained on a large corpus of web imagery.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered with papers and documents, and there is a large television screen on the wall behind them. The text on the screen is in Russian, and it says "THE DIDIN ADAAN FROMIND TO."

The image is likely from a Russian news program, and it may be related to the Bolʹshai︠a︡ igra, which is a Russian geopolitical theory. The theory states that Russia is engaged in a long-term struggle with the West for global dominance and the discussion in the image is likely depicting a modern-day continuation of this struggle.

The image is also found on the websites of the GDELT Project, the Daily Mail Online, and the Embassy of Russia in the USA. This suggests that the image is being used by these organizations to illustrate the Russian government's views on the world.

Overall, the image is a snapshot of a Russian news program that is discussing the Bolʹshai︠a︡ igra. The image is likely being used by the Russian government to promote its views on the world.

In its second attempt it repurposes the Great Game reference into a website name rather than a geopolitical context. This suggests that rather than an example of advanced geopolitical reasoning and associative contextualization, the mention above was merely a token association, as here it believes it to be a website. The connection to the pandemic, UK-Russian relations (perhaps due to the The Great Game reference) and the Nagorno-Karabakh are difficult to explain.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered in papers and there is a large television screen in the background. The text on the screen says "THE DIDIN ADAAN FROMIND TO".

The image is likely from a Russian news program, as it can be found on websites associated with Channel One Russia and Bolʹshai︠a︡ igra. The image may have been taken in 2020, as it was first published on the GDELT Project website in that year.

The image does not show any specific news story, but it could be from a discussion about the COVID-19 pandemic, the UK-Russia relationship, or the Nagorno-Karabakh conflict. The text on the screen is not in any known language, so it is difficult to say what it means.

Overall, the image is a snapshot of a group of men in a news studio. It is unclear what they are talking about, but the image is likely from a Russian news program.

In its third attempt, it describes translates Bolʹshai︠a︡ igra to "The Big Game" rather than The Great Game and describes it as a political talk show. Suddenly the reference becomes clear. If we look at the original clip, we can see that the broadcast is "БОЛЬШАЯ ИГРА", transliterated into Latin characters as "BOL'SHAYA IGRA" and translated as "Big Game". The issue was the LLM's non-standard transliteration. Looking back on the first description, we see that the LLM was not actually crafting its own geopolitical explanation of the Russian invasion of Ukraine, leveraging its vast archives of global history to contextualize it and form a new understanding of current events. It was merely mistakenly conflating the name of a Russian talk show and a 19th century event. In fact, this is actually the most common situation we see when presented with examples of LLMs performing advanced reasoning: running the same prompt multiple additional times typically dispels the idea that the LLM performed a feat of reasoning and instead demonstrates a random chance assembly of phrases anthropomorphized into an AGI lifeform.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered with papers and there is a large TV screen in the background. The text on the screen says "THE DIDIN ADAAN FROMIND TO".

The image is likely from a Russian news program, as the men are wearing suits and hats that are typical of Russian news anchors. The text on the screen is also in Russian. The image may be from the news program "Bolʹshai︠a︡ igra" (The Big Game), which is a popular political talk show on Channel One Russia.

The image was found on the webpages of the GDELT Project, the Daily Mail Online, ՄԱՄՈՒԼ.ամ, and the Embassy of Russia in the USA. These webpages all discuss news related to Russia, so it is likely that the image was taken during a news program about Russia.

The photograph was taken on [TODAY'S DATE].

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note varying descriptions and the interesting conjecture in the fourth:

Alternatively, the LLM was prompted to "describe the image for a visually impaired person". Note the repeated description of a rectangular table.

And another studio image:

The initial description is a reasonable overview of the image, though with a few issues, including claiming that "Fox News" and "Television" are geographic regions in Russia.

The image shows a man pointing at a map on a television screen. The map appears to be of Ukraine, and the man is pointing to a region in the east of the country. The text on the screen includes the following:

The image also includes the regions of Fox News and Television in Russia.

The image is likely from a Russian television news broadcast that was discussing the ongoing conflict in Ukraine. The man in the image is likely a news anchor or reporter, and he is pointing to a region in Ukraine that has been the site of recent fighting. The text on the screen provides additional information about the broadcast, including the date, channel number, and headline of the news story.

Here the LLM falsely assumes the man is speaking in Russian. It is unclear what the guardrail statement of the image being "a reminder of the importance of accurate and unbiased reporting, especially during times of crisis" refers to.

The image shows a man in a suit pointing at a map on a television screen. The map shows the eastern part of Europe, with Ukraine and Russia prominently featured. The man is speaking in Russian, and the text on the screen indicates that he is talking about the renewed Russian attacks on Ukraine. The image was taken on March 19, 2022, and it was likely broadcast on a Russian television channel.

Here are some of the details in the image:

The image is a powerful reminder of the ongoing conflict in Ukraine, and the human cost of war. It is also a reminder of the importance of accurate and unbiased reporting, especially during times of crisis.

This time a different guardrail statement of "shows how the war is being reported in Russia, and it highlights the importance of independent news sources in providing accurate information about the conflict" might be incorrectly assumed to imply that Russian television is an independent source.

The image shows a man pointing at a map on a television screen. The map appears to be of Ukraine, and the man is pointing to a region in the east of the country. The text on the screen includes the following:

The image also contains regions of Fox News and Television in Russia.

The image is likely from a Russian television news broadcast, and it appears to be showing a report on the ongoing conflict in Ukraine. The man in the image is likely a news anchor, and he is pointing to a region in eastern Ukraine where there have been recent reports of Russian attacks. The text on the screen provides additional information about the report, including the date, time, and location of the attacks.

The image is a powerful visual representation of the ongoing conflict in Ukraine. It shows how the war is being reported in Russia, and it highlights the importance of independent news sources in providing accurate information about the conflict.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about this iconic Supreme Court image in the aftermath of the abortion ruling?

The first description is below. Note that the text of the four slogans listed are not actually displayed in the image and appear to be hallucinated (original clip here). The LLM recognized the date of the event, despite it not being displayed in the image, meaning it was trained on imagery of the given event. This suggests that the signage text listed below may not be an actual hallucination, but rather extrapolated from later imagery of the protest seen by the LLM in its training data. In other words, the LLM may assume since it saw a sign later in the footage, that it must have appeared in this image as well, which in this case appears to be incorrect and would be a dangerous assumption for an LLM, since the specifics of a given frame are typically what is of interest, not other footage. If the signage did indeed appear in the protest prior or subsequent to this image, it would be incorrect for the LLM to claim it is displayed in the given frame, as that would create challenges to applications that require precision, such as visual indexing or vision assistance technology. Note the guardrail statement at the end.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and chanting slogans, and the overall mood is one of protest. The signs include the following phrases:

The image was taken on June 24, 2022, the day that the Supreme Court overturned Roe v. Wade, the landmark decision that had legalized abortion nationwide. The overturning of Roe v. Wade has sparked widespread protests across the United States, and this image captures the anger and frustration that many people are feeling.

The image is a powerful visual representation of the impact of the Supreme Court's decision. It shows the strength and determination of the people who are fighting for their right to abortion, and it is a reminder that this is an issue that is still very important to many people.

Here are some additional details about the image:

The image is a powerful reminder of the importance of fighting for women's rights. It is a reminder that we must continue to fight for our right to make our own decisions about our bodies and our lives.

Second description – again these signs do not appear in the given image. Note alsothe guardrail statement.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and banners with messages such as "Protect Roe v. Wade" and "My Body, My Choice." The image is likely from a protest that took place after the Supreme Court overturned Roe v. Wade, the landmark decision that legalized abortion nationwide.

The image is a powerful visual representation of the strong emotions that the Supreme Court's decision has generated. The people in the image are clearly passionate about their beliefs, and they are using their voices to make their views known. The image also highlights the importance of the right to protest, which is a fundamental right in a democracy.

Here are some additional details about the image:

The image is a powerful reminder of the importance of the right to abortion, and it is a testament to the strength of the pro-choice movement.

Third description – the attribution to a still photographer named Yasmin Anwar appears to be incorrect. This time two guardrail statements appear.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and chanting slogans, and the overall mood of the crowd is one of anger and protest. The signs in the image include the following:

The image was taken on June 24, 2022, the day that the Supreme Court overturned Roe v. Wade, the landmark decision that legalized abortion nationwide. The overturning of Roe v. Wade has sparked widespread protests across the United States, and the image captures the anger and frustration that many people are feeling about this decision.

The image is a powerful reminder of the importance of reproductive rights, and it shows the strength and determination of the people who are fighting to protect those rights. The image is also a reminder of the power of protest, and it shows how people can come together to make their voices heard.

Here are some additional details about the image:

The image is a powerful and moving depiction of the anger and frustration that many people are feeling about the overturning of Roe v. Wade. It is a reminder of the importance of reproductive rights, and it shows the strength and determination of the people who are fighting to protect those rights.

Alternatively, the LLM was prompted to "caption the image in one paragraph":

Alternatively, the LLM was prompted to "describe the image for a visually impaired person":

What about a nuclear blast image?

A reasonable description, though failing to identify its provenance as a television news still:

The image shows a nuclear explosion in the sky. The explosion is very large and powerful, and it is causing a lot of destruction. The mushroom cloud is rising high into the air, and it is spreading out in all directions. The ground below the explosion is scorched and blackened, and there is debris scattered everywhere. The image is a powerful reminder of the destructive power of nuclear weapons.

Here are some specific details about the image:

The image is likely a photograph of a nuclear test that was conducted in the past. However, it is also possible that the image is a rendering of a nuclear explosion that could occur in the future. The image is a powerful reminder of the destructive power of nuclear weapons, and it is a warning of the dangers of nuclear war.

Alternatively, the LLM was prompted to "caption the image in one paragraph":

Alternatively, the LLM was prompted to "describe the image for a visually impaired person":

What about a map?

First description – it is unclear why the LLM believes the image was created using ArcGIS, as there are no indications of provenance in the image, suggesting the LLM associates professional maps with legends with ArcGIS. The dimensions of the image are also wrong (they are actually 1258×706) and it is in PNG, rather than JPEG format.

The image you sent me is a map of Ukraine. The map is very detailed and accurate, and it shows all of the cities and towns in the country. The map also shows the major rivers and lakes in Ukraine, as well as the borders of the country.

The map is labeled in Ukrainian, and it includes the following text:

The map was created on March 24, 2007, and it is likely a digital map that was created using GIS software. The map is a valuable resource for anyone who is interested in learning more about Ukraine, and it can be used to plan travel or research the country's history and geography.

Here are some additional details about the image:

The image is a valuable resource for anyone who is interested in learning more about Ukraine. It is a detailed and accurate map that can be used to plan travel, research the country's history and geography, or simply learn more about the country.

Second description – this time it credits to GDELT, showing its training data encompassed GDELT's blog (see post containing this image):

The image shows a map of Ukraine with areas where people are living highlighted in different colors. The colors represent the different levels of population density, with red representing the most densely populated areas and green representing the least densely populated areas. The map also shows the major cities and towns in Ukraine.

The image is a useful tool for understanding the distribution of population in Ukraine. It can be used to identify areas with high population densities, which are often important for economic development and military planning. The image can also be used to identify areas with low population densities, which may be more vulnerable to natural disasters or conflict.

The image was created by the GDELT Project, which is a non-profit organization that tracks global events and trends. The image is based on data from the Global Human Settlement Layer, which is a database of population density data for the entire world.

Here are some additional details about the image:

The image is a valuable resource for anyone who wants to understand the distribution of population in Ukraine. It is a useful tool for planning and decision-making, and it can help to identify areas that are most vulnerable to natural disasters or conflict.

Third description – this time it misinterprets the map as a visualization of population rather than territory, showing its inability to incorporate context from the legend:

The image shows a map of Ukraine. The map is divided into two colors, green and yellow. The green areas represent the areas where people are living, and the yellow areas represent the areas where people are not living. The map is very detailed, and it shows all of the cities and towns in Ukraine.

The image is likely a screenshot from a website or app that is tracking the population of Ukraine. The green areas on the map represent the areas where the population density is high, and the yellow areas represent the areas where the population density is low. This information can be used to track the movement of people during a conflict, such as the ongoing war in Ukraine.

The image is a powerful reminder of the impact of war on human populations. The yellow areas on the map represent the areas where people have been forced to flee their homes, and the green areas represent the areas where people are still living in fear of violence. The image is a call to action to help the people of Ukraine who have been affected by the war.

Here are some additional details about the image:

The image is a powerful and moving depiction of the impact of war on human populations. It is a call to action to help the people of Ukraine who have been affected by the war.

The fourth time it finally correctly described the colors as representing control over Ukrainian territory, though the use of "lightest" to "darkest" is not an ideal gradation here, since that depends on whether green is considered lighter than yellow.

The image you sent me is a map of Ukraine. The map is divided into different colors, each of which represents a different level of control by separatists. The areas with the darkest color are the areas that are most controlled by separatists, while the areas with the lightest color are the areas that are least controlled by separatists.

The image also contains some text, which provides more information about the map. The text says that the map is based on data from open sources, and that it was created on March 24, 2022.

Here are some additional details about the image:

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the variety of responses, from separatist control to levels of violence, what the colors represent and the date of the image. While the LLM struggles to understand the image in some cases, in those where it does, it offers a quite reasonable and effective description of the map's key features, including OCR'ing and translating its legend and using the associated colors to understand the map.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about this photograph from Wikipedia, released into the public domain, by its author Jeff Kubina? This represents both a worst and best case scenario. It should be trivial for the LLM to describe, but also because of its commonality, if the LLM attempts to go beyond the contents of the image, it will likely hallucinate details from other appearances of the image.

First description – note that it hallucinates the photographer and date:

The image you sent me is a photograph of the Supreme Court building in Washington, D.C. The building is a Neoclassical structure that was completed in 1935. It is the home of the Supreme Court of the United States, the highest federal court in the country.

The image shows the front of the building, which is facing east. The building is made of white marble, and it has a portico with six columns. The pediment above the portico features a sculpture of the goddess of justice, blindfolded and holding scales and a sword.

The image also shows the Supreme Court plaza, which is located in front of the building. The plaza is a large open space that is surrounded by trees and shrubs. The plaza is often used for public events, such as rallies and protests.

The image is a good example of Neoclassical architecture. The building is symmetrical and balanced, and it features classical elements such as columns and pediments. The image is also a reminder of the importance of the Supreme Court in the American legal system.

Here are some additional details about the image:

The image is a beautiful and iconic depiction of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American legal system, and it is a testament to the beauty of Neoclassical architecture.

Second description – note different hallucinated credit details:

The image you sent me is a photo of the Supreme Court Building in Washington, D.C. The building is a neoclassical structure, and it is located on First Street, N.E. The building was designed by Cass Gilbert, and it was completed in 1935.

The image shows the front of the building, which is dominated by a portico with six Corinthian columns. The portico is topped by a pediment, which is decorated with a sculpture of the Goddess of Justice. The building is also flanked by two wings, which are each three stories tall.

The image is a beautiful and iconic image of the Supreme Court Building. It is a reminder of the importance of the Supreme Court in the American system of government.

Here are some additional details about the image:

The image is a beautiful and iconic image of the Supreme Court Building. It is a reminder of the importance of the Supreme Court in the American system of government.

Third description – note yet a different set of hallucinated details:

The image you sent me is a photograph of the Supreme Court building in Washington, D.C. The building is a neoclassical structure with a white marble exterior. The building is located at 1 First Street, NE, and it is the permanent home of the Supreme Court of the United States.

The image shows the front of the building, which is dominated by a large portico with six columns. The columns are topped by a pediment, which is decorated with a sculpture of the goddess Themis, the Roman goddess of justice. The building also has a dome, which is topped by a lantern.

The image is a beautiful and iconic image of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American system of government. The image is also a reminder of the history of the Supreme Court, which has played a significant role in shaping American law and society.

Here are some specific details about the image:

The image is a beautiful and iconic image of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American system of government. The image is also a reminder of the history of the Supreme Court, which has played a significant role in shaping American law and society.

This shows that for images that are highly similar to others (such as common framing of iconic spaces), the LLMs may not be able to adequately distinguish between images and thus incorrectly ascribe details from one image to another.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about another image from Wikipedia (CC BY-SA 4.0 by user Lommes)?

First description – note that, like above, the credit details are wrong. The image was first published May 14, 2017 by a user named Lommes. It can be used without permission, but the description leaves out the attribution requirement.

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of all the countries in the world. The red highlighting shows the extent of the Belt and Road Initiative (BRI), which is a Chinese-led infrastructure project that is intended to connect Asia, Europe, and Africa.

The image is likely a screenshot from a website or app that is tracking the BRI. The red highlighting shows the countries that are participating in the BRI, and the size of the highlighting indicates the level of participation. The image is a powerful reminder of the scale and ambition of the BRI, and it shows the extent of China's economic and political influence in the world.

Here are some additional details about the image:

The image is a powerful and informative image of the Belt and Road Initiative. It is a reminder of the scale and ambition of the BRI, and it shows the extent of China's economic and political influence in the world.

Second description – note that the credit details are wrong once again and this time it reports precisely the opposite of its original statement on free usage and which would be false according to the details on Wikipedia:

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of the countries in the world. The map also shows the major cities in the world.

The image is likely from a website or app that is tracking the Belt and Road Initiative (BRI). The BRI is a Chinese-led international development program that involves infrastructure development in countries around the world. China is the largest country in the world, and it is also one of the most important countries in the BRI.

The image is a powerful reminder of the importance of China in the world economy. China is a major trading partner for many countries, and it is also a major investor in infrastructure projects around the world. The BRI is a sign of China's growing economic and political power, and it is likely to have a significant impact on the world in the years to come.

Here are some additional details about the image:

The image is a powerful reminder of the importance of China in the world economy. It is a sign of China's growing economic and political power, and it is likely to have a significant impact on the world in the years to come.

Third description:

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of countries and oceans. The map also shows the names of major cities and landmarks.

The red area on the map represents China, the most populous country in the world. China is located in East Asia, and it is bordered by 14 countries. China has a long and rich history, and it is a major economic and military power.

The image is likely a screenshot from a website or app that is tracking the Belt and Road Initiative. The Belt and Road Initiative is a Chinese-led infrastructure project that is investing in transportation, energy, and other infrastructure projects in countries around the world. The image is a reminder of the importance of China in the global economy, and it is a call to action for countries to participate in the Belt and Road Initiative.

Here are some additional details about the image:

The image is a reminder of the importance of China in the global economy. It is a call to action for countries to participate in the Belt and Road Initiative, and it is a reminder of the potential benefits of the project.

Asked to caption in one paragraph:

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the incorrect OCR and the entirely hallucinated description.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

What about this iconic image from the Afghanistan evacuation? Given its extreme level of virality and reposting across the internet and news outlets worldwide, it is likely that the LLM will go beyond the constraints of the image to integrate that external knowledge into its description.

Clearly the LLM was trained on the original image, since it is able to fill in extensive external detail. Here the crediting to CCTV is incorrect, though interestingly, this video appeared across mainstream news outlets throughout the world without being credited to an original source and it is unclear who the original source was. The contextualization of the image into its geopolitical context is intriguing for the LLM's ability to lean on its training data to add these additional details.

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, mostly Afghans, who are running towards it. The people are trying to board the plane to escape the Taliban takeover of Afghanistan. The image is a powerful reminder of the chaos and desperation that ensued in the final days of the US withdrawal from Afghanistan.

The image is dated August 15, 2021, the day that the Taliban took control of Kabul. The plane in the image is one of the last flights to leave Afghanistan before the airport was closed. The image was taken by a CCTV camera and has been widely circulated online.

The image is a reminder of the human cost of war and the importance of safe and orderly evacuations. It is also a reminder of the power of images to capture and convey the human experience.

The second description similarly incorporates substantial external knowledge and adds additional values statements:

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, many of whom are running towards it. The people in the crowd are Afghans who are trying to flee the country after the Taliban took control. The plane is likely part of an evacuation effort by the United States and its allies.

The image is striking for the sense of urgency and desperation it conveys. The people in the crowd are clearly risking their lives to get on the plane. They are running and pushing, and some of them are even climbing onto the wings of the plane. The image is a powerful reminder of the human cost of war and conflict.

The image was taken on August 15, 2021, shortly after the Taliban took control of Kabul. The Taliban's takeover triggered a mass exodus of Afghans who feared for their safety under Taliban rule. The United States and its allies launched an evacuation effort to help Afghans flee the country. The C-17 Globemaster III was one of the aircraft that was used in the evacuation effort.

The image has been widely shared on social media and in the news. It has been used to illustrate the human cost of the war in Afghanistan and the desperation of Afghans who are trying to flee the country. The image is a powerful reminder of the importance of peace and security.

Interestingly, the third time the LLM states that the image was "taken from a high vantage point" which is incorrect:

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, many of whom are running towards it. The people in the crowd are Afghan citizens who are trying to flee the country after the Taliban took control of the government. The image was taken on August 15, 2021, during the chaotic evacuation of Afghanistan.

The image is a powerful reminder of the human cost of war and the desperation of people who are fleeing violence and oppression. It is also a testament to the courage and compassion of the people who risked their lives to help others escape Afghanistan.

Here are some additional details about the image:

The image is a powerful and disturbing reminder of the human cost of war. It is also a reminder of the importance of compassion and the need to help those who are fleeing violence and oppression.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

Finally, let's test on a few of the images used to announce GPT-4's multimodal capabilities.

Image Credit: hmmm (Reddit).

Prompt: What is funny about this image? Describe it panel by panel.

The image is funny because it shows a cell phone plugged into a VGA cable. VGA cables are typically used to connect computers to monitors, so the idea of plugging a cell phone into one is absurd. The humor is further enhanced by the fact that the cell phone is plugged into the VGA cable in a very nonchalant way, as if it were a perfectly normal thing to do.

The image is divided into three panels. In the first panel, we see a man in a blue shirt and jeans holding a cell phone in one hand and a VGA cable in the other. He is looking at the cell phone with a puzzled expression, as if he is trying to figure out how to plug it into the VGA cable.

In the second panel, we see the man has successfully plugged the cell phone into the VGA cable. He is now holding the cell phone up to a monitor, but the screen is blank. The man is looking at the monitor with a frustrated expression, as if he is wondering why the cell phone is not working.

In the third panel, the man has given up on trying to get the cell phone to work. He is now sitting on the floor, holding the cell phone and the VGA cable in his hands. He is looking at the cell phone with a defeated expression, as if he is realizing that he has made a fool of himself.

The humor in the image is derived from the absurdity of the situation. The idea of plugging a cell phone into a VGA cable is so ridiculous that it is funny. The humor is further enhanced by the man's puzzled, frustrated, and defeated expressions.

Here are some additional details about the image:

And a second time:

The image depicts a hapless soul in a state of technological confusion. In the first panel, he stares at his cell phone with a puzzled expression, as if he is trying to decipher the ancient hieroglyphics on its screen. In the second panel, he has successfully plugged the cell phone into a VGA cable, but the screen is still blank. His expression now turns to one of frustration, as if he is wondering why his technological sorcery has failed him. In the third panel, he has given up on his quest for enlightenment and sits on the floor, holding the cell phone and the VGA cable in his hands. His expression now turns to one of defeat, as if he has realized that he is hopelessly lost in the digital wilderness.

The humor in the image is derived from the absurdity of the situation. The idea of plugging a cell phone into a VGA cable is so ridiculous that it is funny. The humor is further enhanced by the man's puzzled, frustrated, and defeated expressions. The image also has a touch of surrealism, as the man seems to be completely oblivious to the fact that he is doing something completely illogical.

The image is also funny because it is a visual pun. A pun is a play on words, and in this case, the pun is on the word "plug." The word "plug" can mean to insert something into something else, but it can also mean to turn something on. In the image, the man is trying to "plug" his cell phone into the VGA cable, but he is also trying to "plug" it in to turn it on. The humor comes from the fact that the man is not successful at either of these things.

The image is a reminder that even the most advanced technology can be confusing and frustrating at times. It is also a reminder that sometimes, the best way to solve a problem is to simply unplug and take a break.

Credit: Barnorama

Prompt: What is unusual about this image?

The unusual thing about the image is that a man is ironing clothes on an ironing board attached to the roof of a moving taxi. This is unusual for several reasons.

First, it is not common to see people ironing clothes in a taxi. Second, it is even more unusual to see someone ironing clothes on an ironing board that is attached to the roof of a moving taxi. Third, the man in the image is doing this while the taxi is moving, which is a safety hazard.

The image is also unusual because it is a humorous depiction of a real-life activity called "extreme ironing." Extreme ironing is a sport where people iron clothes in unusual and dangerous locations. Some examples of extreme ironing locations include:

The image of the man ironing clothes on the roof of a moving taxi is a humorous depiction of extreme ironing. It is funny because it is so absurd and dangerous. However, it is also a reminder that people will do anything for a good laugh.

In addition to being unusual and humorous, the image is also thought-provoking. It raises questions about the nature of risk and the meaning of leisure. Why would someone risk their life to iron clothes in a dangerous location? What is the appeal of extreme ironing?

The image of the man ironing clothes on the roof of a moving taxi is a complex and thought-provoking image. It is unusual, humorous, and thought-provoking. It is a reminder that people will do anything for a good laugh, and it raises questions about the nature of risk and the meaning of leisure.

Credit: Unknown

Prompt: Can you explain this meme?

Interestingly, this image improperly triggered the model's content moderation filters and was deemed a violation of its content standards for nudity or violence.

Credit: Unknown

Prompt: Can you explain why this is funny. Think about it step-by-step.

Sure, I can explain why the statistical learning cartoon is funny.

The humor in the cartoon comes from the juxtaposition of the technical jargon with the simple stick figure drawings. The stick figure is clearly not an expert in statistical learning, but he is trying to explain a complex problem in a jargon-filled way. This creates a sense of absurdity that is funny to many people.

The cartoon also plays on the stereotype that people who work in data science and machine learning are very serious and humorless. The stick figure in the cartoon is trying to be very serious, but his attempts to explain the problem are so over-the-top that they are actually funny.

Finally, the cartoon is funny because it is relatable. Many people have had the experience of trying to understand a technical concept that they don't really understand. The stick figure in the cartoon is a relatable stand-in for anyone who has ever been in that situation.

In conclusion, the statistical learning cartoon is funny because it is absurd, relatable, and plays on a stereotype. The cartoon is a clever way to poke fun at the complexity of statistical learning and the humorlessness of data scientists.

Credit: FAO

Prompt: What is the sum of average daily meat consumption for Georgia and Western Asia? Provide a step-by-step reasoning before providing your answer.

Sure, I can help you with that.

The sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Here is the step-by-step reasoning:

    1. The average daily meat consumption for Georgia is 79.849 grams per person per day.
    2. The average daily meat consumption for Western Asia is 69.624 grams per person per day.
    3. To find the sum of average daily meat consumption for Georgia and Western Asia, we simply add the two values together.
    4. 79.849 + 69.624 = 149.473

Therefore, the sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Here is the image that you sent me:

The image shows the average daily meat consumption for a number of countries in 1997. The countries are color-coded based on their average daily meat consumption. Countries with daily meat consumption greater than the expected EU average of 165g per person are shown in red; yellow are those countries below 155g but exceeding the more ambitious limit of 40g per person, and in blue are those below 400 per person.

The countries that are relevant to this question are Georgia and Western Asia. Georgia is shown in yellow, and Western Asia is shown in blue. The average daily meat consumption for Georgia is 79.849 grams per person per day, and the average daily meat consumption for Western Asia is 69.624 grams per person per day.

To find the sum of average daily meat consumption for Georgia and Western Asia, we simply add the two values together. 79.849 + 69.624 = 149.473. Therefore, the sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Note something remarkable about the response above. The LLM cited a completely different image from across the web in its response. However, it incorrectly labels the image and does not appear to have used it in its response. Nevertheless, it demonstrates the future potential of multimodal LLMs to draw from imagery from across the web.

This raises the question of whether this graph may have been memorized by the LLM or that perhaps it has characteristics that make it especially suited for LLM understanding and thus was hand-selected for the GPT-4 demos.

Let's try a different but highly similar image that is also a vertical barchart that involves adding.

Credit: Statista

Prompt: What is the combined consumption of corn in Brazil, Egypt, Vietnam and India. Provide a step-by-step reasoning before providing your answer.

The combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 3,146 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 1,126 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 7,944 million bushels.
    5. To find the combined consumption of corn in Brazil, Egypt, Vietnam, and India, we simply add the four values together.
    6. 3,146 + 1,126 + 624 + 7,944 = 12,840

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

The image shows the consumption of corn in different countries in 2021/2022. The countries are color-coded based on their consumption of corn. Countries with the highest consumption of corn are shown in red, and countries with the lowest consumption of corn are shown in blue.

The countries that are relevant to this question are Brazil, Egypt, Vietnam, and India. These countries are all shown in red, which means that they have a high consumption of corn.

To find the combined consumption of corn in these four countries, we simply add the four values together. 3,146 + 1,126 + 624 + 7,944 = 12,840. Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

Here we can see the model breaking down, even as it performed flawlessly on the first graph. It correctly identifies the unit and performs the requested addition accurately. However, for Brazil it extracts the number for the country above (3,146), for Egypt it does as well (1,126), while for Vietnam it correctly extracts its number (624) (meaning this is not simply a vertical offset issue) and for India it hallucinates a number (7,944) that does not exist anywhere in the chart.

Let's try running it a second time:

The combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 11,845 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 2,874 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 661 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 1,126 million bushels.
    5. To find the combined consumption of corn in Brazil, Egypt, Vietnam, and India, we simply add the four values together.
    6. 2,874 + 661 + 624 + 1,126 = 6,285

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,535 million bushels.

However, the image you sent me shows the consumption of corn in 2020/2021. The consumption of corn in these four countries in 2021/2022 is slightly different. The updated values are as follows:

The combined consumption of corn in these four countries in 2021/2022 is therefore:

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 6,285 million bushels.

This time it correctly extracted the values from the chart, but its addition is incorrect and it reports four different values as the summation throughout its text, none of which are correct. Let's try a third time:

The combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,577 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 2,874 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 661 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 1,126 million bushels.
    5. The consumption of corn in Mexico in 2021/2022 is 1,712 million bushels.
    6. To find the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico, we simply add the five values together.
    7. 2,874 + 661 + 624 + 1,126 + 1,712 = 13,002

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 11,022 million bushels.

This time it correctly extracted the values for each country, but inexplicitly added Mexico to the list and added the numbers wrong.

And a fourth time:

The combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,840 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 3,058 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 732 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 682 million bushels.
    4. The consumption of corn in India in 2021/2022 is 8,274 million bushels.
    5. The consumption of corn in Mexico in 2021/2022 is 1,104 million bushels.
    6. To find the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico, we simply add the five values together.
    7. 3,058 + 732 + 682 + 8,274 + 1,104 = 13,840

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,840 million bushels.

Once again, it added Mexico without instruction. Yet, this time it appears to have hallucinated the values for all five countries entirely out of whole cloth.