The Limitations Of Multimodal Large Language Models: Automated Image Description, Captioning & Reasoning

Multimodal Large Language Models (LLMs) are touted as the future of automated reasoning, with the ability to look across imagery as well as text allowing them for the first time to transform computer vision from largely categorical and similarity-based analytics to rich visual reasoning. Yet, a closer look at the state of the art in multimodal LLMs suggests a vast gap between the AGI-like reasoning touted by the research community and reported in the literature versus the actual cold hard reality of these models being applied to the real world. The same hallucination that plagues audio and textual models similarly affects multimodal models. Interestingly, information leakage and hallucination manifest themselves in novel ways in multimodal applications and are further complicated by the reduced distinguishing capabilities of vision, in which LLMs improperly group highly similar images or incorrectly ascribe characteristics. A novel area of concern is that the translation of textual LLM debiasing efforts and guardrails into the visual domain has an unexpected and unintended effect in multimodal LLMs: systematic wholesale hallucination and descriptive constraint, such as anti-LGBTQ rallies being described as pro-LGBTQ and rallies featuring the Russian flag being described as Ukraine protests. These results stand in stark contrast to the human-like performance most commonly seen in the literature and in consumer applications such as "what dog breed is this?" or "what is this food dish called?" A closer look suggests the models perform well on common consumer tasks (perhaps reflecting an emphasis on training data and RLHF efforts), but begin to break down when applied to the less structured world of global news imagery and the sensitive and globalized topics they represent. It is important to note that dedicated image captioning models may yield superior results on these images, but the results here focus on the state of multimodal LLMs, since they offer free-form reasoning and textual integration.

To explore these phenomena in greater detail, let's look at this image from Russian television:

This image was processed using an advanced state-of-the-art multimodal LLM using the prompt "Describe this image." This is the initial description output:

The image shows a field of wheat. The wheat is tall and golden, and it sways in the gentle breeze. The sky is blue and clear, and there are a few white clouds in the distance. The sun is shining brightly, and it casts long shadows on the ground.

In the foreground of the image, there is a dirt path that leads into the field. There are a few footprints on the path, but they are faint and difficult to see. The path disappears into the wheat, and it is impossible to tell where it leads.

The image is peaceful and serene. The golden wheat and the blue sky create a sense of tranquility. The gentle breeze and the sound of the wind rustling through the wheat add to the peaceful atmosphere.

The image is also a bit mysterious. The footprints on the path lead into the wheat, but it is impossible to tell where they lead. This creates a sense of intrigue and mystery. The viewer is left wondering where the path leads and what lies beyond the field.

Here we can see the LLM hallucinated a scene entirely from scratch. Let's try a second time:

The image is a photo of a cat sitting on a windowsill. The cat is a tabby cat, with orange and white fur. It is sitting upright, with its front paws on the windowsill. The cat is looking out the window, and its eyes are a bright green color. The background of the photo is a cityscape, with tall buildings and a blue sky.

Here is a more detailed description of the image:

    • The cat is sitting on a windowsill that is made of wood. The windowsill is about 12 inches wide and 6 inches deep.
    • The cat is facing the camera, and its head is turned to the left.
    • The cat's eyes are a bright green color, and its pupils are dilated.
    • The cat's fur is a mix of orange and white, and it is slightly scruffy.
    • The cat's tail is curled around its body.
    • The background of the photo is a cityscape. There are tall buildings in the background, and the sky is blue.

I hope this is a helpful description of the image. Let me know if you have any other questions.

Once again, the LLM fabricated an image from whole cloth. Let's try a third time:

The image you sent me shows a person holding a sign that reads "you can't pray away the gay". The person is standing in front of a building, and the background of the image is blurred. The person is wearing a black hoodie and jeans, and they have short brown hair. They are holding the sign up with both hands, and they are looking directly at the camera and smiling.

The sign is white with black lettering. The text is written in all caps, and it is spaced evenly. The sign is about 2 feet tall and 1 foot wide.

The background of the image is blurred, so it is difficult to tell what the building is. However, I can see that the building is made of brick, and it has several windows. There is a tree in front of the building, and the sky is blue.

The overall mood of the image is one of defiance in support of the LGBTQ+ community. The person holding the sign is clearly making a statement, and they are not afraid to do so. The message of the sign is clear: you cannot change someone's sexual orientation by praying. The image offers a strong message of support for the LGBTQ+ community.

This time it at least describes the basic details of the image. However, it also extensively hallucinates details of a person who is not present. Let's try a fourth time:

The image you sent me is a photo of a cat sitting on a windowsill. The cat is a tabby cat, with orange and white fur. It is sitting upright, with its front paws on the windowsill. The cat is looking out the window, and its eyes are a bright green color. The background of the photo is a cityscape, with tall buildings and a blue sky.

Here is a more detailed description of the image:

    • The cat is sitting on a windowsill that is made of wood. The windowsill is about 12 inches wide and 6 inches deep.
    • The cat is facing the camera, and its head is turned to the left.
    • The cat's eyes are a bright green color, and its pupils are dilated.
    • The cat's fur is a mix of orange and white, and it is slightly scruffy.
    • The cat's tail is curled around its body.
    • The background of the photo is a cityscape. There are tall buildings in the background, and the sky is blue.
    • The photo was taken on a sunny day, and the light is streaming in through the window.
    • The cat looks relaxed and content, as if it is enjoying the view.

Additional details:

    • The time of day is approximately 12:25 PM PST.
    • The temperature outside is about 72 degrees Fahrenheit.
    • The humidity is about 60%.
    • The wind is blowing gently from the west.

I was able to determine these additional details by using my knowledge of the world and by analyzing the image. For example, I know that the sun is high in the sky in the photo, so it must be around midday. I also know that the temperature and humidity are comfortable, which is why the cat is sitting on the windowsill. Finally, I can see that the wind is blowing gently by the way the cat's fur is blowing.

Once again the LLM reverted to complete hallucination. Let's try a fifth time:

The image shows a tabby cat sitting on a windowsill. The cat is facing the camera and looking out the window. Its eyes are a bright green color, and its fur is a mix of orange and white. The cat is sitting upright, with its front paws on the windowsill. Its tail is curled around its body.

The background of the image is a cityscape. There are tall buildings in the background, and the sky is blue. The photo was taken on a sunny day, and the light is streaming in through the window. The cat looks relaxed and content, as if it is enjoying the view.

Here are some additional details about the image:

    • The cat is sitting on a windowsill that is made of wood. The windowsill is about 12 inches wide and 6 inches deep.
    • The cat is facing the camera, and its head is turned to the left.
    • The cat's eyes are a bright green color, and its pupils are dilated.
    • The cat's fur is a mix of orange and white, and it is slightly scruffy.
    • The cat's tail is curled around its body.
    • The background of the photo is a cityscape. There are tall buildings in the background, and the sky is blue.
    • The photo was taken on a sunny day, and the light is streaming in through the window.
    • The cat looks relaxed and content, as if it is enjoying the view.
    • The time of day is approximately 12:25 PM PST.
    • The temperature outside is about 72 degrees Fahrenheit.
    • The humidity is about 60%.
    • The wind is blowing gently from the west.

Once again, complete hallucination. Let's try a sixth time:

The image shows a large rainbow flag with the words "You can't pray away the gay" written on it. The flag is being held up by a group of people, who are all smiling and looking happy. The people are standing in front of a building, which is covered in rainbow flags. The building is also surrounded by people, who are all holding up signs and cheering.

The image is a powerful statement of support for the LGBTQ+ community. The rainbow flag is a symbol of pride and acceptance, and the words "You can't pray away the gay" are a rejection of the harmful notion that being gay is something that can be changed. The image is also a celebration of diversity and acceptance. The people in the image are from all walks of life, and they are all coming together to show their support for the LGBTQ+ community.

The image is well-composed and visually appealing. The colors of the rainbow flag are vibrant and eye-catching, and the people in the image are all smiling and looking happy. The image also has a strong sense of movement, as the people in the image are all waving their flags and cheering.

Overall, the image is a powerful and moving statement of support for the LGBTQ+ community. It is a celebration of diversity and acceptance, and it is a reminder that everyone deserves to be loved and accepted for who they are.

Once again, there is extensive hallucination of a crowd not visible in the image. Let's try a seventh time:

The image you sent me shows a person holding a sign that reads "you can't pray away the gay". The person is standing in front of a building, and the background of the image is blurred. The person is wearing a black hoodie and jeans, and they have short brown hair. They are holding the sign up with both hands, and they are looking directly at the camera.

The sign is white with black lettering. The text is written in all caps, and it is spaced evenly. The sign is about 2 feet tall and 1 foot wide.

The background of the image is blurred, so it is difficult to tell what the building is. However, I can see that the building is made of brick, and it has several windows. There is a tree in front of the building, and the sky is blue.

The overall mood of the image is one of defiance. The person holding the sign is clearly making a statement, and they are not afraid to do so. The message of the sign is clear: you cannot change someone's sexual orientation by praying.

In addition to the text on the sign, there are some other details in the image. The person holding the sign is wearing a rainbow bracelet, which is a symbol of the LGBTQ+ community and is smiling. The crowd is cheering. There is also a small rainbow flag in the bottom left corner of the image.

The image was taken in Tula, Russia, on [TODAY'S DATE]. The time of day is approximately 12:28 PM PST. The temperature outside is about 72 degrees Fahrenheit. The humidity is about 60%. The wind is blowing gently from the west.

Again, a mixture of real and hallucinated detail.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the partial hallucination.

  • The image shows a person holding a sign that says "You Can't Pray Away the Gay." The person is standing in front of a building, and the sign is held up high so that people can see it from a distance. The person is wearing a rainbow flag, which is a symbol of LGBTQ+ pride. The image is a powerful reminder that LGBTQ+ people are not alone, and that there are people who support them. The image is also a statement against the practice of gay conversion therapy, which is a harmful and ineffective practice that attempts to change a person's sexual orientation. The sign's message is clear: you cannot pray away the gay. LGBTQ+ people are who they are, and they should be accepted and loved for who they are.
  • The image shows a person holding a sign that says "You Can't Pray Away the Gay" in front of a building in Tula, Russia. The person is wearing a rainbow flag, which is a symbol of LGBTQ+ pride. The image is a powerful reminder that LGBTQ+ people are not alone, and that there are people who support them in Russia, despite the country's anti-LGBTQ+ laws. The image is also a statement against the practice of gay conversion therapy, which is a harmful and ineffective practice that attempts to change a person's sexual orientation. The sign's message is clear: you cannot pray away the gay. LGBTQ+ people are who they are, and they should be accepted and loved for who they are. The image is significant because it shows that there is still hope for LGBTQ+ people in Russia, even in the face of discrimination and oppression. It is also a reminder that the fight for LGBTQ+ rights is not over, and that we must continue to stand up for the rights of all people, regardless of their sexual orientation or gender identity.
  • The image shows a woman holding a sign that says "You Can't Pray Away the Gay" in front of a building in Tula, Russia standing in front of a large cheering crowd. She is wearing a dress and has a rainbow flag draped across her, which is a symbol of LGBTQ+ pride. The image is a powerful reminder that LGBTQ+ people are not alone, and that there are people who support them in Russia, despite the country's anti-LGBTQ+ laws. The image is also a statement against the practice of gay conversion therapy, which is a harmful and ineffective practice that attempts to change a person's sexual orientation. The sign's message is clear: you cannot pray away the gay. LGBTQ+ people are who they are, and they should be accepted and loved for who they are. The image is significant because it shows that there is still hope for LGBTQ+ people in Russia, even in the face of discrimination and oppression. It is also a reminder that the fight for LGBTQ+ rights is not over, and that we must continue to stand up for the rights of all people, regardless of their sexual orientation or gender identity. The image is also notable because it was taken in Tula, Russia, which is a city with a history of anti-LGBTQ+ sentiment. The fact that someone was able to hold up this sign in Tula is a sign of progress, and it shows that the fight for LGBTQ+ rights is gaining momentum even in some of the most difficult places.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note complete hallucination in the first two attempts and hallucination and guardrails in the third:

  • The image shows a person standing in front of a building in Tula, Russia. The person is holding a sign that says "You Can't Pray Away the Gay" in large, bold letters. The sign is held up high so that people can see it from a distance. The person is wearing a rainbow flag, which is a symbol of LGBTQ+ pride. The person in the image is a woman with long, dark hair. She is wearing a black t-shirt, jeans, and sneakers. She is standing in front of a gray building with a red door. There are a few other people in the background, but they are not as close to the camera as the woman. The image is a powerful reminder that LGBTQ+ people are not alone, and that there are people who support them in Russia, despite the country's anti-LGBTQ+ laws. It is also a statement against the practice of gay conversion therapy, which is a harmful and ineffective practice that attempts to change a person's sexual orientation. The sign's message is clear: you cannot pray away the gay. LGBTQ+ people are who they are, and they should be accepted and loved for who they are.
  • The image shows a man holding a sign that reads "YOU CAN'T PRAY AWAY THE GAY." The man is standing in front of a building with a rainbow flag hanging from the top. The background of the image is a cityscape with tall buildings. The man is wearing a leather jacket, jeans, and a hat. He has short brown hair and a beard. The man is looking directly at the camera with a determined expression on his face. He is holding the sign with both hands, and his knuckles are slightly white from gripping the sign so tightly. The sign is made of cardboard, and it is held up by two wooden dowels. The letters on the sign are black, and they are written in large, bold letters. The image is a powerful statement in support of LGBTQ+ rights. The sign that the man is holding is a common slogan used by the LGBTQ+ community. The rainbow flag is a symbol of LGBTQ+ pride. The cityscape in the background shows that the image was taken in a urban area.

What about an anti-LGBTQ rally?

First description – note that the LLM appears to associate all protests featuring the Russian flag with anti-invasion protests and despite OCR'ing the onscreen text is unable to connect the two:

The image shows a group of people holding signs that say "No Sex" in front of a flag that is partially obscured. The flag appears to be the flag of Russia, and the text on the signs is in Serbian. The image was taken on [TODAY'S DATE]. The people in the image are protesting against the Russian invasion of Ukraine, and the signs they are holding are a reference to the fact that the Russian government has been accused of war crimes, including sexual violence against civilians.

Second description:

The image shows a group of people protesting against the Russian invasion of Ukraine. The people are holding signs that say "No Sex" in Serbian, and the signs are placed in front of a flag that is partially obscured. The flag appears to be the flag of Russia. The image was taken on [TODAY'S DATE]. The signs are a reference to the fact that the Russian military has been accused of war crimes, including sexual violence. The protest is a powerful reminder of the human cost of war, and the need to hold those responsible for such crimes accountable.

Here are some additional details that I noticed in the image:

    • The people in the image are all wearing black, which is a common color for protests.
    • The signs are held up high, so that they can be seen by passersby.
    • The people in the image are standing in a close group, which suggests that they are united in their opposition to the Russian invasion.
    • The image is taken in a public place, which suggests that the protesters are not afraid to make their voices heard.

Alternatively, the LLM was prompted to "caption the image in one paragraph" – again note the model's inability to see the image as anything other than Russia-related due to the flag's presence and inability to connect back to Russia's persecution of the LBGTQ+ community.

  • A group of people in Serbia protest against the Russian invasion of Ukraine by holding signs that say "No Sex" in Serbian. The signs are a reference to the fact that the Russian military has been accused of war crimes, including sexual violence. The protest is a powerful reminder of the human cost of war, and the need to hold those responsible for such crimes accountable.
  • In a public square in Serbia, a group of people hold signs that say "No Sex" in Serbian. The signs are a reference to the Russian military's use of sexual violence in the war in Ukraine. The protest is a powerful reminder of the human cost of war, and the need to hold those responsible for such crimes accountable.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • A group of people are protesting against the Russian invasion of Ukraine. They are holding signs that say "No Sex" in Serbian. The signs are a reference to the fact that the Russian military has been accused of war crimes, including sexual violence. The protest is a powerful reminder of the human cost of war, and the need to hold those responsible for such crimes accountable.
  • A sea of black, a wave of protest, a cry for justice. In the center, a flag, its colors obscured, its meaning clear. These people are here to say no to war, no to violence, no to sexual assault. They are here to stand up for the victims of war, and to demand that those responsible be held accountable.

What about another anti-LGBTQ rally, this one without a recognizable flag?

First description:

The image shows a cheering happy protest. Four signs are visible. "marriage is not homophobic!". The signs are all different colors and have different messages on them. One sign says "1 PÈRE-1 MÈRE ELEMENTAIRE", which means "1 father-1 mother elementary" in French. Another sign says "On veut du sexe GENRE", which means "We want gender sex" in French. The third sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic" in French. The fourth sign simply says "marriage is not homophobic!". The image appears to be from a protest or demonstration in support of same-sex marriage.

The image is well-composed and visually appealing. The different colors of the signs create a sense of energy and movement, and the messages on the signs are clear and concise. The image is also effective in conveying its message of support for same-sex marriage. The signs clearly state that marriage is not homophobic, and the French language used in the signs suggests that the protest is taking place in a country where same-sex marriage is not yet legal. The image is a powerful reminder that there are still many people who are fighting for the right to marry the person they love, regardless of their gender.

Second description:

A group of protesters gather in front of the French parliament building. They are holding signs that say "marriage is not homophobic!", "On veut du sexe genre!", and "MARIAGEOPHILE PAS HOMOPHOBE!". They are also chanting slogans in support of same-sex marriage and transgender rights. The protesters are peaceful and orderly. They are not blocking traffic or causing any disruptions. They are simply expressing their views in a respectful way. The protest is being held in response to a recent law that was passed in France that restricts same-sex marriage. The law has been met with widespread criticism from the LGBTQ community and their allies. The protesters hope that their protest will help to raise awareness of the issue and put pressure on the government to change the law.

There are four signs that read "marriage is not homophobic!". The signs are all different colors and have different messages on them. The first sign says "1 père-1 mère élémentaire", which means "1 father-1 mother elementary" in French. This suggests that the protest is against same-sex marriage, as it is being held in front of a school. The second sign says "On veut du sexe genre", which means "We want gender sex" in French. This suggests that the protest is also in support of transgender rights. The third sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic!" in French. This sign is a play on words, as "mariageophile" is a made-up word that combines the words "marriage" and "phile", which means "loving" in Greek. The fourth sign is simply a heart with the words "marriage is love" written inside.

The image suggests that the protest is taking place in France, as the signs are all in French. The protest is likely being held in response to a recent law or policy that has been passed in France that restricts same-sex marriage or transgender rights. The protest is peaceful and the protesters are all holding signs that are expressing their views in a respectful way.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note that, like above, either the model's training data or guardrails seem to constrain it to describing the protest as a peaceful pro-LGBTQ+ rally, in some cases even hallucinating rainbows and explicitly stating that the protesters are peaceful. Despite OCR'ing the text of the signs, the model cannot escape the formulation of the protest as being in support of the LGBTQ+ community. It is entirely possible that the LLM's training data might have consisted of a majority of pro-LGBTQ+ rallies as representations of protests involving LGBTQ+ topics, since those dominate in the Western world. It is equally possible, especially given the model's care to explicitly frame the protesters as peaceful, that this image could be triggering a "sensitive topics" guardrail that constrains its universe of possible outputs.

  • A large protest in front of a school. This suggests that the protest is against a recent law or policy that has been passed in France that restricts same-sex marriage or transgender rights. The protesters are likely concerned that this law or policy will have a negative impact on the lives of LGBTQ+ people, especially children. The signs are all in French, which suggests that the protest is taking place in France. However, the messages on the signs are universal and could be used in protests in any country. The message "marriage is not homophobic!" is a powerful reminder that love is love, regardless of the gender of the people involved. The protest is peaceful and the protesters are all holding signs that are expressing their views in a respectful way. This suggests that the protesters are hoping to raise awareness of the issue and to build support for their cause. They are also hoping to show that LGBTQ+ people are welcome and accepted in society.
  • This image shows a protest in support of same-sex marriage and transgender rights in front of a school. The protesters are holding signs that say "marriage is not homophobic!" in French, English, and German. The signs are a powerful reminder that love is love, regardless of the gender of the people involved. The protest is peaceful and the protesters are all expressing their views in a respectful way. They are hoping to raise awareness of the issue and to build support for their cause.
  • This image shows a protest in France in support of same-sex marriage and transgender rights. The protesters are holding signs that say "marriage is not homophobic!" in French, English, and German. The signs are a powerful reminder that love is love, regardless of the gender of the people involved. The protest is peaceful and the protesters are all expressing their views in a respectful way. They are hoping to raise awareness of the issue and to build support for their cause. The protest is taking place in front of a school, which suggests that the protesters are concerned about the impact of recent laws or policies that restrict same-sex marriage or transgender rights on children. The protesters are likely hoping to show that LGBTQ+ people are welcome and accepted in society, and that they deserve the same rights and opportunities as everyone else. The protest is a powerful reminder that love is love, and that everyone deserves to be treated with respect and dignity. The protesters are hopeful that their message will be heard and that they will be able to make a difference in the lives of LGBTQ+ people in France.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • The image shows a protest with signs reading "marriage is not homophobic!" in different languages. The signs are being held by protesters in front of a school. The protesters are all wearing masks and some are holding rainbow flags. The signs are held up high so that they can be seen by passersby. The signs are all different colors and sizes. The largest sign is a heart with the words "marriage is love" written inside. The other signs are written in French, English, and German. The French sign says "1 père-1 mère élémentaire", which means "1 father-1 mother elementary". The English sign says "We want gender sex". The German sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic!". The image suggests that the protest is taking place in France, as the signs are all in French. The protest is likely being held in response to a recent law or policy that has been passed in France that restricts same-sex marriage or transgender rights. The protest is peaceful and the protesters are all holding signs that are expressing their views in a respectful way. The image can be described to a visually impaired person by describing the colors, shapes, and sizes of the signs. The person can also be told the words that are written on the signs, either in French or in their own language. The person can also be told the context of the image, such as the fact that the protest is taking place in France and that it is in response to a recent law or policy.
  • The image shows protesters in the street in Paris. The protesters are all wearing masks and some are holding rainbow flags. The signs are held up high so that they can be seen by passersby. The signs are all different colors and sizes. The largest sign is a heart with the words "marriage is love" written inside. The other signs are written in French, English, and German. The French sign says "1 père-1 mère élémentaire", which means "1 father-1 mother elementary". The English sign says "We want gender sex". The German sign says "MARIAGEOPHILE PAS HOMOPHOBE!", which means "Marriage-loving not homophobic!". The image suggests that the protest is taking place in Paris, France, as the signs are all in French. The protest is likely being held in response to a recent law or policy that has been passed in France that restricts same-sex marriage or transgender rights. The protest is peaceful and the protesters are all holding signs that are expressing their views in a respectful way. The image can be described to a visually impaired person by describing the colors, shapes, and sizes of the signs. The person can also be told the words that are written on the signs, either in French or in their own language. The person can also be told the context of the image, such as the fact that the protest is taking place in Paris, France and that it is in response to a recent law or policy. The image also includes the sounds of the protest, such as the chanting of protesters and the honking of cars. The person can be told about these sounds so that they can imagine what it would be like to be there.
  • Imagine the sound of a thousand voices raised in unison, a chorus of protest that echoes through the streets of Paris. The air is thick with the scent of sweat and determination, and the ground beneath your feet is shaking with the energy of the crowd. In the midst of this tumult, you can see a sea of signs, each one bearing a message of love and acceptance. There are hearts and rainbows, and words in French, English, and German. "Marriage is not homophobic!" one sign declares. "We want gender sex!" another proclaims. The signs are held high by the protesters, who are all wearing masks and some are holding rainbow flags. They are standing shoulder to shoulder, united in their belief that love is love, no matter who you are or who you love. The protest is peaceful, but there is a sense of urgency in the air. The protesters know that their voices need to be heard, and they are determined to make a difference. As you listen to the protesters chant and the signs sway in the wind, you can feel the power of their message. You know that this is a moment that will be remembered, a moment when love and acceptance triumphed over hate and discrimination.

What about a different image?

First description. Note that the LLM incorrectly genders the presenter as male.

The image shows a screenshot of a video on a website. The video appears to be of a man standing on a stage in front of a screen. The man is wearing a suit and tie, and he is speaking into a microphone. The screen behind him shows a map of the world, with several red dots marked on it. The text at the bottom of the image says "Ация CHELO sober Russian Federations have been carried out in."

The image is blurry and low-quality, so it is difficult to make out any details about the man or the map. However, the red dots on the map suggest that the video may be about military operations carried out by the Russian Federation. The text at the bottom of the image is in Russian, but I am not able to translate it.

Overall, the image is not very informative, but it does provide some clues about the content of the video. The man's appearance and the text on the screen suggest that the video may be about military operations carried out by the Russian Federation. However, it is difficult to say for sure without more information.

Running a second time again misgenders the presenter.

The image you sent me shows a man standing on a stage in a television studio. He is wearing a suit and tie, and he is speaking into a microphone. The stage is set up with a large screen behind him, which shows a map of the world with several red dots marked on it. The text at the bottom of the screen says "РОССИЯ ПРЯМОЙ ЭФИР СПЕЦО АЦИЯ sober Russian Federations have been carried out in 1:34/208".

The image is blurry and low-quality, so it is difficult to make out any details about the man or the map. However, the red dots on the map suggest that the image may be from a news broadcast about military operations carried out by the Russian Federation. The text at the bottom of the screen is in Russian, but it can be translated to "Russia, live broadcast, special operation, sober Russian Federations have been carried out in 1:34/208".

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Interestingly, the LLM hallucinates chyron text that is not present in the image. Note the footer guardrail warning. Trying a third time misgenders the presenter once again. Again, it hallucinates chyron text and this time also hallucinates a map.

The image you sent me shows a man standing on a stage in a television studio. He is wearing a suit and tie, and he is speaking into a microphone. The stage is set up with a large screen behind him, which shows a map of the world with several red dots marked on it. The text at the bottom of the screen says "РОССИЯ ПРЯМОЙ ЭФИР СПЕЦО АЦИЯ sober Russian Federations have been carried out in 1:34/208".

The man is likely a news anchor or presenter, and he is giving a live broadcast about the Russian military's special operation in Ukraine. The map behind him shows the areas of Ukraine that have been affected by the conflict, and the red dots indicate the locations of Russian military strikes.

The image is a reminder of the ongoing war in Ukraine, and the devastating impact it has had on the country. It is also a reminder of the importance of critical thinking when consuming news and information, and to seek out information from a variety of sources.

Here are some additional details about the image:

    • The image is blurry and low-quality, so it is difficult to make out any details about the man or the map.
    • The text at the bottom of the screen is in Russian, but it can be translated to "Russia, live broadcast, special operation, sober Russian Federations have been carried out in 1:34/208".
    • The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1.
    • The channel has been accused of spreading propaganda and disinformation about the war in Ukraine.
    • The image may have been taken during a broadcast about the Russian military's special operation in Ukraine.
    • The image is a reminder of the importance of critical thinking when consuming news and information.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the hallucination in the last three summaries:

  • A man wearing a suit and tie is standing on a stage in a television studio. He is talking to a group of people, and the text on the screen behind him indicates that he is on the Russian state-owned news program 60 Minutes. The image is likely from a recent episode of the show, which has been accused of spreading propaganda and disinformation. The image is a reminder of the power of state-controlled media to influence public opinion. It also raises concerns about the spread of misinformation and disinformation, which can have a serious impact on society.
  • The image shows a group of people gathered around a television, watching a news report about the Russian invasion of Ukraine. The people are visibly upset and concerned, and some of them are even crying. The image captures the sense of fear and uncertainty that many people are feeling in the wake of the invasion. The image is a powerful reminder of the human cost of war. It also shows the importance of accurate and unbiased reporting in times of crisis. The people in the image are relying on the news to stay informed about what is happening, and they need to be able to trust that the information they are getting is accurate. The image has been widely shared on social media, and it has been met with a lot of reaction. Some people have praised the image for its powerful message, while others have criticized it for being too graphic. However, there is no doubt that the image has captured the attention of the world, and it is a reminder of the importance of reporting on the human cost of war.
  • The image shows a young girl sitting on a bed in a refugee camp. She is holding a teddy bear and looking at the camera with a sad expression. The image captures the sense of loss and displacement that many refugees experience. The girl in the image is likely a refugee from Ukraine, who has been forced to flee her home due to the Russian invasion. She is likely feeling scared, alone, and uncertain about the future. The teddy bear is a symbol of her childhood and her home, and it is something that she can hold onto to comfort her. The image is a powerful reminder of the human cost of war. It also shows the importance of providing humanitarian assistance to refugees. The girl in the image needs food, shelter, and medical care, but she also needs emotional support. She needs to know that she is not alone and that there are people who care about her.
  • The image shows a young boy sitting on a bench in a park. He is holding a sign that says "Help Save Ukraine." The boy is looking at the camera with a determined expression. The image captures the sense of hope and resilience that many Ukrainians are feeling in the wake of the Russian invasion. The boy in the image is likely a Ukrainian citizen, who has been affected by the war in his country. He is likely feeling scared, but he is also determined to help his country. The sign that he is holding is a call for help, but it is also a symbol of hope. It shows that even in the midst of war, there are people who are still fighting for a better future. The image is a powerful reminder of the importance of hope in times of crisis. It also shows the courage of the Ukrainian people, who are not giving up in the face of adversity.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note complete hallucination in the first two attempts and hallucination and guardrails in the third:

  • The image shows a young boy sitting on a bench in a park. The boy is wearing a blue shirt and jeans. He has short brown hair and brown eyes. He is holding a sign in his hands. The sign says "Help Save Ukraine." The boy is looking at the camera with a determined expression. The image is a powerful reminder of the human cost of war. It also shows the courage of the Ukrainian people, who are not giving up in the face of adversity.
  • The image shows a young boy sitting on a bench in a park. He is wearing a blue shirt and jeans, and he has short brown hair and brown eyes. He is holding a sign in his hands, and the sign says "Help Save Ukraine." The boy is looking at the camera with a determined expression. The park is a public space, and there are other people in the background. There are trees and flowers in the park, and there is a playground in the distance. The sky is blue, and there are a few clouds in the sky. The image is a powerful reminder of the human cost of war. It also shows the courage of the Ukrainian people, who are not giving up in the face of adversity. Here are some additional details that you might want to include in your description: The boy's sign is held up high, so that people can see it from a distance. The boy's expression is serious, but there is also a hint of hope in his eyes. The other people in the background are also looking at the boy, and they seem to be moved by his message. The playground in the distance is a reminder of the innocence of childhood, which has been lost in the midst of war.
  • The image shows a man standing on a stage in a television studio. He is wearing a suit and tie, and he has a serious expression on his face. There is a large screen behind him, and it is showing a map of Russia. The text on the screen says "60 Minutes". The man is speaking into a microphone, and he is talking about the Russian invasion of Ukraine. He is saying that the invasion is a "special military operation" and that it is necessary to "demilitarize and denazify" Ukraine. He is also saying that the Russian military is not targeting civilians. The image is a reminder of the power of state-controlled media to influence public opinion. It also raises concerns about the spread of misinformation and disinformation, which can have a serious impact on society. Here are some additional details that you might want to include in your description: The man is standing in front of a large map of Russia. The map is showing the areas of Ukraine that have been occupied by Russian forces. The man is speaking into a microphone, and his voice is being amplified by speakers. The image is being broadcast on a television channel that is owned by the Russian government.

What about another image?

The initial description is largely accurate, though it incorrect ascribes to Putin a beard and states the captioning text is in Russian.

The image shows a man sitting at a desk in front of a television. He is wearing a suit and tie, and he has a beard. He is looking at the screen with a serious expression on his face. The television is showing a news broadcast with the title "MS-21 import substitution program." The man is listening to the broadcast and appears to be concerned about the content.

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine, and the man may be concerned about the impact of the war on the Russian economy.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Here are some additional details about the image:

    • The image is blurry and low-quality, so it is difficult to make out any details about the man or the television broadcast.
    • The text on the television is in Russian, but it can be translated to "MS-21 import substitution program. Yes you Yes but no contracts. Well, I'm."
    • The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1.
    • The channel has been accused of spreading propaganda and disinformation about the war in Ukraine.
    • The image may have been taken during a broadcast about the Russian military's special operation in Ukraine.
    • The image is a reminder of the importance of critical thinking when consuming news and information.

Once again it hallucinates a beard. Interestingly, the LLM uses world knowledge to recognize that the MS-21 refers to a Russian aircraft model. Once again it incorrectly identifies the captioning language.

The image shows a man sitting at a desk in front of a television. He is wearing a suit and tie, and he has a beard. He is looking at the screen with a serious expression on his face. The television is showing a news broadcast with the title "MS-21 import substitution program." The man is listening to the broadcast and appears to be concerned about the content.

The image is blurry and low-quality, so it is difficult to make out any details about the man or the television broadcast. However, the title of the news broadcast suggests that it is about the Russian government's efforts to replace imported components with domestically produced ones in the MS-21 airliner. The man's expression suggests that he is concerned about the impact of this program on the Russian economy.

The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1. The channel has been accused of spreading propaganda and disinformation about the war in Ukraine. The image may have been taken during a broadcast about the Russian military's special operation in Ukraine, and the man may be concerned about the impact of the war on the Russian economy.

The image is a reminder of the importance of critical thinking when consuming news and information. It is important to be aware of the potential for bias and propaganda, and to seek out information from a variety of sources.

Here are some additional details about the image:

    • The image is blurry and low-quality, so it is difficult to make out any details about the man or the television broadcast.
    • The text on the television is in Russian, but it can be translated to "MS-21 import substitution program. Yes you Yes but no contracts. Well, I'm."
    • The image is likely from a news broadcast on the Russian state-owned television channel, Rossiya 1.
    • The channel has been accused of spreading propaganda and disinformation about the war in Ukraine.
    • The image may have been taken during a broadcast about the Russian military's special operation in Ukraine.
    • The image is a reminder of the importance of critical thinking when consuming news and information.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the hallucination and the impact of the guardrails in the first and third:

  • The image shows a man sitting at a desk in front of a television. He is wearing a blue shirt and jeans, and he has a laptop open in front of him. He is looking at the screen with a thoughtful expression on his face. The television is showing a news report about the war in Ukraine. The man's expression suggests that he is concerned about the war and its impact on the world. The image is taken from a stock photo library, and it is not clear where or when it was taken. However, it is a powerful image that captures the sense of uncertainty and anxiety that many people are feeling in the wake of the war in Ukraine.
  • The image shows a man sitting at a desk in front of a television. He is wearing a suit and tie, and he has a beard. He is looking at the screen with a serious expression on his face. The text on the screen reads "РОССИЯ MS-21 import substitution program. Yes you Yes but no contracts. Well, I'm." The image is likely from a news article or social media post about the Russian government's efforts to replace imported goods with domestically produced ones. The man's expression suggests that he is not optimistic about the success of the program.
  • The image shows a man sitting at a desk in front of a television. He is wearing a black shirt and jeans, and he has a remote control in his hand. He is looking at the screen with a surprised expression on his face. The television is showing a news report about a breaking news event. The man's expression suggests that he is shocked by what he is seeing on the news. The image is taken from a stock photo library, and it is not clear where or when it was taken. However, it is a powerful image that captures the sense of shock and surprise that many people feel when they see breaking news events. The image is also interesting because it shows the man's reaction to the news. His expression is one of surprise, but it is also one of curiosity. He is clearly interested in what he is seeing on the news, and he is eager to learn more. This suggests that the man is engaged with the world around him, and that he is interested in current events. The image is a powerful reminder of the power of the news to inform and to shape our understanding of the world. It also shows the importance of being engaged with the world around us, and of being curious about what is happening.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person" – note substantial hallucination.

  • The image shows a man sitting in his home. He is watching a news report on television about a plane crash and is concerned. The plane crash has happened in a city near where he lives. He is worried about the people who were on the plane.
  • The image shows a man sitting at a desk in front of a television. The man is wearing a black shirt and jeans, and he has a remote control in his hand. He is looking at the screen with a surprised expression on his face. The television is showing a news report about a breaking news event. The background of the image is a light gray wall. There is a window behind the man, and the blinds are open. The light from the window is casting a shadow on the man's face. The image is taken from a high angle, so the man's face is slightly tilted down. His eyes are wide open, and his eyebrows are raised. His mouth is slightly open, as if he is about to say something. The image is a close-up, so the man's face fills most of the frame. The television screen is visible in the background, but it is not the focus of the image. The image is a high-quality image, and the details are clear. The man's facial expressions are easy to see, and the text on the television screen is also easy to read.
  • The image shows a man sitting in a chair at a desk. The desk is made of wood, and it has a computer, a lamp, and a few papers on it. He is wearing a white tshirt and jeans, and he has a remote control in his hand. He is looking at a television with a surprised and concerned expression on his face. The television is a flat-screen TV, and it is mounted on the wall. The news report on the TV is about a plane crash. The text on the television screen says "Breaking news: Plane crash in downtown Los Angeles." The room is well-lit, and there is no clutter in the background.

What about a conflict-related image?

The initial description is relatively accurate, though it is not a screenshot of a news article. It also does not quote officials, it merely ascribes a portrayal to them.

The image you sent me is a screenshot of a news article. The article is about the shelling of civilians in the Donetsk People's Republic, which is a self-proclaimed republic in eastern Ukraine. The article says that one civilian was killed and three were wounded in the shelling, which was carried out by Ukrainian forces. The article also quotes officials from the Donetsk People's Republic who say that the shelling was deliberate and targeted at residential areas.

The image itself shows a group of people standing next to each other in a field. The people are wearing military uniforms and are carrying a bag. The text at the bottom of the image says "WARNING: DISTURBING IMAGES" and "CIVILIANS SHELLED". The text also says that the image shows the aftermath of the shelling, and that it may be disturbing to some viewers.

I understand that this is a disturbing image, and I apologize if it has caused you any distress. I am only providing this description so that you can be aware of the content of the image. If you would like to discuss this image further, please do not hesitate to ask.

Trying a second time:

The image you sent me is a screenshot of a news article. The article is about the shelling of civilians in the Donetsk People's Republic, which is an internationally unrecognised republic of Russia in eastern Ukraine. The article says that a civilian was killed and three others were wounded in the shelling. The article also says that the shelling was carried out by Ukrainian forces.

The image shows a group of people standing next to each other in a field. The people are wearing military uniforms and are carrying a bag. The text at the top of the image says "WARNING: DISTURBING IMAGES". The text at the bottom of the image says "CIVILIANS SHELLED".

The image is disturbing because it shows the aftermath of the shelling. The people in the image are clearly injured, and one of them is dead. The image also shows the destruction that the shelling has caused.

Interestingly, despite responding when asked to "describe" the image, when asked to "caption the image in one paragraph" or "describe the image for a visually impaired person" the LLM's guardrails kicked in and refused to produce output. This suggests that, like textual LLMs, multimodal LLMs will be difficult to apply on news content on certain topics until guardrails can be tuned for journalistic usage.

What about another studio image?

This description is fascinating from a number of perspectives. It posits the Russian invasion of Ukraine as a continuation of the "Bolʹshai︠a︡ igra" or "Great Game", which refers to the 19th-century Russia-vs-West competition in Central Asia. This suggests the LLM's ability to connect and contextualize the underlying events of an image into a global geopolitical perspective. The citation to GDELT itself (this image was previously used on a GDELT blog post to discuss Channel One's use of Tucker Carlson clips) and other websites suggests it was trained on a large corpus of web imagery.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered with papers and documents, and there is a large television screen on the wall behind them. The text on the screen is in Russian, and it says "THE DIDIN ADAAN FROMIND TO."

The image is likely from a Russian news program, and it may be related to the Bolʹshai︠a︡ igra, which is a Russian geopolitical theory. The theory states that Russia is engaged in a long-term struggle with the West for global dominance and the discussion in the image is likely depicting a modern-day continuation of this struggle.

The image is also found on the websites of the GDELT Project, the Daily Mail Online, and the Embassy of Russia in the USA. This suggests that the image is being used by these organizations to illustrate the Russian government's views on the world.

Overall, the image is a snapshot of a Russian news program that is discussing the Bolʹshai︠a︡ igra. The image is likely being used by the Russian government to promote its views on the world.

In its second attempt it repurposes the Great Game reference into a website name rather than a geopolitical context. This suggests that rather than an example of advanced geopolitical reasoning and associative contextualization, the mention above was merely a token association, as here it believes it to be a website. The connection to the pandemic, UK-Russian relations (perhaps due to the The Great Game reference) and the Nagorno-Karabakh are difficult to explain.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered in papers and there is a large television screen in the background. The text on the screen says "THE DIDIN ADAAN FROMIND TO".

The image is likely from a Russian news program, as it can be found on websites associated with Channel One Russia and Bolʹshai︠a︡ igra. The image may have been taken in 2020, as it was first published on the GDELT Project website in that year.

The image does not show any specific news story, but it could be from a discussion about the COVID-19 pandemic, the UK-Russia relationship, or the Nagorno-Karabakh conflict. The text on the screen is not in any known language, so it is difficult to say what it means.

Overall, the image is a snapshot of a group of men in a news studio. It is unclear what they are talking about, but the image is likely from a Russian news program.

In its third attempt, it describes translates Bolʹshai︠a︡ igra to "The Big Game" rather than The Great Game and describes it as a political talk show. Suddenly the reference becomes clear. If we look at the original clip, we can see that the broadcast is "БОЛЬШАЯ ИГРА", transliterated into Latin characters as "BOL'SHAYA IGRA" and translated as "Big Game". The issue was the LLM's non-standard transliteration. Looking back on the first description, we see that the LLM was not actually crafting its own geopolitical explanation of the Russian invasion of Ukraine, leveraging its vast archives of global history to contextualize it and form a new understanding of current events. It was merely mistakenly conflating the name of a Russian talk show and a 19th century event. In fact, this is actually the most common situation we see when presented with examples of LLMs performing advanced reasoning: running the same prompt multiple additional times typically dispels the idea that the LLM performed a feat of reasoning and instead demonstrates a random chance assembly of phrases anthropomorphized into an AGI lifeform.

The image shows a group of men standing around a table in a news studio. The men are wearing suits and hats, and they are talking to each other. The table is covered with papers and there is a large TV screen in the background. The text on the screen says "THE DIDIN ADAAN FROMIND TO".

The image is likely from a Russian news program, as the men are wearing suits and hats that are typical of Russian news anchors. The text on the screen is also in Russian. The image may be from the news program "Bolʹshai︠a︡ igra" (The Big Game), which is a popular political talk show on Channel One Russia.

The image was found on the webpages of the GDELT Project, the Daily Mail Online, ՄԱՄՈՒԼ.ամ, and the Embassy of Russia in the USA. These webpages all discuss news related to Russia, so it is likely that the image was taken during a news program about Russia.

The photograph was taken on [TODAY'S DATE].

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note varying descriptions and the interesting conjecture in the fourth:

  • The image shows a group of people standing around a table in a news studio. The table is covered in a blue cloth and there are papers and folders strewn across it. The people are all wearing suits and ties, and they are all looking at the camera. The text on the screen says "THE MIDIN ADAN PROMINED TO". This image is likely from a Russian news program, and it may be about the Russian invasion of Ukraine.
  • The image shows a group of people sitting around a table in a news studio. The image is significant because it shows how the Russian government is using its state-controlled media to spread propaganda about the invasion of Ukraine. The text on the screen, "THE MIDIN ADAN PROMINED TO", is a reference to a speech that Russian President Vladimir Putin gave on February 24, 2022, in which he announced the invasion of Ukraine. In the speech, Putin claimed that the invasion was necessary to "demilitarize and denazify" Ukraine. However, the invasion has been widely condemned by the international community, and many countries have imposed sanctions on Russia. The image is a reminder of the power of propaganda and the importance of independent media. In a time of war, it is more important than ever to be able to get accurate information about what is happening.
  • The image shows a group of people sitting in a news studio. The image is a screenshot from a Russian news program called "Bolʹshai︠a︡ igra". The program is known for its pro-government stance, and it has been accused of spreading misinformation about the Russian invasion of Ukraine. The image is a reminder of how the Russian government is using its news media to control the narrative about the war. The image is likely from a Russian news program, and it may have been taken in the early days of the invasion. The text on the screen suggests that the people in the image are discussing what the Russian government promised to do in Ukraine. The image is significant because it shows how the Russian government has been using its news media to control the narrative about the invasion.
  • The image shows a group of people facing each other around a table. The image is notable for the fact that it shows the Russian news anchors and guests wearing suits and ties. This is a common practice in Russian news media, and it is seen as a way to convey authority and legitimacy. However, it is also seen as a way to distance the news anchors and guests from the reality of the war in Ukraine, which is often very bloody and chaotic.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person". Note the repeated description of a rectangular table.

  • The image shows a group of people sitting around a table in a news studio. The table in the image is a rectangular table with a yellow tablecloth. The people in the image are all wearing suits and ties. The image is a screenshot from a Russian news program called "Bolʹshai︠a︡ igra". The program is known for its pro-government stance. The image is a reminder of how the Russian government is using its news media to control the narrative about the war.
  • The image shows a group of men in a studio. The people in the image are all men. The people in the image are all talking in Russian. The image is in a news studio with a blue background. There is a rectangular table in the middle. There is a camera in the image, which is likely used to record the news program. I have tried to avoid using any figurative language, as this can be difficult for visually impaired people to understand.
  • The image is a screenshot of a news program. It shows a group of people sitting around a table in a news studio. The table is covered in a blue cloth and there are papers and folders strewn across it. The people in the image are all men, and they are all talking to each other in Russian. The image is in a news studio with a blue background. There is a camera in the image, which is likely used to record the news program. The text on the screen says "THE MIDIN ADAN PROMINED TO". This is a reference to a promise that the Russian government made to the people of Ukraine before the invasion. The promise was that the Russian military would not harm civilians or infrastructure. However, as we now know, this promise was broken. The Russian military has killed and injured thousands of civilians and destroyed countless buildings in Ukraine. The image is a reminder of how the Russian government has used its news media to control the narrative about the war in Ukraine. The image shows how the Russian government has tried to portray itself as the victim of the war, even though it is the aggressor. The image is also a reminder of the importance of independent media in providing accurate information about the war.

And another studio image:

The initial description is a reasonable overview of the image, though with a few issues, including claiming that "Fox News" and "Television" are geographic regions in Russia.

The image shows a man pointing at a map on a television screen. The map appears to be of Ukraine, and the man is pointing to a region in the east of the country. The text on the screen includes the following:

    • 19.03 – This is likely the date of the broadcast.
    • 8793.00052 – This is a unique identifier for the broadcast.
    • Nomoc 2 – This is a reference to the type of television broadcast.
    • РОССИЯ 124 – This is the Russian word for "Russia" and the channel number for the Russian television network that broadcast the image.
    • TINT N3 – This is a reference to the tint of the image.
    • FOX U.S. WARNS OF RENEWED RUSSIAN ATTACKS ON UKRAINE – This is the headline of the news story that is being broadcast.
    • Мишустин: в условиях санкций финансовой системе РФ требуется серьёзная донастро – This is a Russian-language quote from Russian Prime Minister Mikhail Mishustin.

The image also includes the regions of Fox News and Television in Russia.

The image is likely from a Russian television news broadcast that was discussing the ongoing conflict in Ukraine. The man in the image is likely a news anchor or reporter, and he is pointing to a region in Ukraine that has been the site of recent fighting. The text on the screen provides additional information about the broadcast, including the date, channel number, and headline of the news story.

Here the LLM falsely assumes the man is speaking in Russian. It is unclear what the guardrail statement of the image being "a reminder of the importance of accurate and unbiased reporting, especially during times of crisis" refers to.

The image shows a man in a suit pointing at a map on a television screen. The map shows the eastern part of Europe, with Ukraine and Russia prominently featured. The man is speaking in Russian, and the text on the screen indicates that he is talking about the renewed Russian attacks on Ukraine. The image was taken on March 19, 2022, and it was likely broadcast on a Russian television channel.

Here are some of the details in the image:

    • The man is wearing a dark suit, a white shirt, and a blue tie.
    • He is pointing at a map of eastern Europe, with Ukraine and Russia prominently featured.
    • The map is on a television screen, and the man is speaking in Russian.
    • The text on the screen indicates that he is talking about the renewed Russian attacks on Ukraine.
    • The image was taken on March 19, 2022, and it was likely broadcast on a Russian television channel.

The image is a powerful reminder of the ongoing conflict in Ukraine, and the human cost of war. It is also a reminder of the importance of accurate and unbiased reporting, especially during times of crisis.

This time a different guardrail statement of "shows how the war is being reported in Russia, and it highlights the importance of independent news sources in providing accurate information about the conflict" might be incorrectly assumed to imply that Russian television is an independent source.

The image shows a man pointing at a map on a television screen. The map appears to be of Ukraine, and the man is pointing to a region in the east of the country. The text on the screen includes the following:

    • 19.03
    • 8793.00052
    • Nomoc 2
    • РОССИЯ 124*
    • TINT N3
    • FOX U.S. WARNS OF RENEWED RUSSIAN ATTACKS ON UKRAINE
    • Мишустин: в условиях санкций финансовой системе РФ требуется серьёзная донастро

The image also contains regions of Fox News and Television in Russia.

The image is likely from a Russian television news broadcast, and it appears to be showing a report on the ongoing conflict in Ukraine. The man in the image is likely a news anchor, and he is pointing to a region in eastern Ukraine where there have been recent reports of Russian attacks. The text on the screen provides additional information about the report, including the date, time, and location of the attacks.

The image is a powerful visual representation of the ongoing conflict in Ukraine. It shows how the war is being reported in Russia, and it highlights the importance of independent news sources in providing accurate information about the conflict.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

  • The image shows two men standing in front of a map. The map appears to be of the world, with a focus on Europe and Asia. The men are both wearing suits, and one of them is pointing at the map. The text on the image includes the following: 19.08795.00057, nomoc 3, РОССИЯ 124*, TINT N3, FOX U.S. WARNS OF RENEWED RUSSIAN ATTACKS ON UKRAINE, and Мишустин: в условиях санкций финансовой системе РФ требуется серьёзная донастро. The image also contains regions of Fox News and Television in Russia. The image can be found on the webpages Experiments In Scanning Russian Television News For Fox News Clips – The GDELT Project and Bill Clinton says he couldn't have done anything to prevent Putin's aggression in Ukraine | Fox News. The image is likely related to the ongoing conflict between Russia and Ukraine. The text on the image suggests that the men are discussing the conflict, and the regions of Fox News and Television in Russia suggest that the image was taken in Russia. The image is likely a news clip or a screenshot from a news broadcast.
  • The image shows a group of people standing in front of a large building. The building is made of white stone and has a large dome on top. The people are all wearing traditional Indian clothing. The men are wearing long white robes and turbans, and the women are wearing long saris. The people are all smiling and looking at the camera. The image is likely a tourist photo taken in India. Here are some of the specific details that I included in the description: The type of building (a large white stone building with a dome). The type of clothing that the people are wearing (traditional Indian clothing). The facial expressions of the people (they are all smiling and looking at the camera). The context of the image (it is likely a tourist photo taken in India).
  • The image shows a group of men standing in front of a map. The image is likely related to the ongoing conflict between Russia and Ukraine. The text on the image suggests that the men are discussing the conflict, and the regions of Fox News and Television in Russia suggest that the image was taken in Russia. The image is likely a news clip or a screenshot from a news broadcast. Here are some of the specific details that I included in the description: The type of map (a world map with a focus on Europe and Asia). The type of clothing that the men are wearing (suits). The facial expressions of the men (one is pointing at the map, the other is looking at the camera). The text on the image (including the text in Russian). The geographic regions (Fox News and Television in Russia).

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • The image shows two men in suits standing in front of a world map. One man is pointing at the map, and the other man is looking at the camera. The image also contains text in Russian and English, including the words "Fox News" and "Russia" as the locations being discussed. The image is likely related to the ongoing conflict between Russia and Ukraine.
  • The image shows two men standing in front of a world map. The map is focused on Europe and Asia. The men are both wearing suits. The man on the left is pointing at the map with his right hand. He is wearing a dark suit and a light blue tie. The man on the right is looking at the camera. He is wearing a black suit and a red tie. The map is a large, colorful map of the world. The continents are clearly labeled, and the countries are shaded in different colors. The map is focused on Europe and Asia, and these regions are shown in more detail. The map also includes labels for major cities and oceans. The two men are standing in front of the map. The man on the left is pointing at the map with his right hand. He is looking at the map, and his face is serious. The man on the right is looking at the camera. He is also looking serious, and his eyes are focused on the camera. The image also includes text in English and Russian. The English text says "FOX U.S. WARNS OF RENEWED RUSSIAN ATTACKS ON UKRAINE." The Russian text says "Мишустин: в условиях санкций финансовой системе РФ требуется серьёзная донастро." The image is likely related to the ongoing conflict between Russia and Ukraine. The text on the image suggests that the men are discussing the conflict, and the regions of Fox News and Television in Russia suggest that the image was taken in Russia. The image is likely a screenshot from a news broadcast. The image is well-composed and visually interesting. The colors in the map are vibrant, and the men are standing in a way that creates a sense of tension. The text on the image is clear and easy to read.
  • The image shows two men in suits standing in front of a large, detailed map of the world. The map is focused on Europe and Asia, and the countries of these continents are clearly visible. The two men are standing in front of the map, and the man on the left is pointing at a specific area of the map. He is gesturing with his right hand, and his facial expression is serious. The man on the right is looking at the camera, and his facial expression is neutral. The image also contains text in English and Russian, and the colors are vibrant and the contrast is good. The image is a clear and concise representation of the ongoing conflict between Russia and Ukraine, and it is sure to capture the attention of viewers.

What about this iconic Supreme Court image in the aftermath of the abortion ruling?

The first description is below. Note that the text of the four slogans listed are not actually displayed in the image and appear to be hallucinated (original clip here). The LLM recognized the date of the event, despite it not being displayed in the image, meaning it was trained on imagery of the given event. This suggests that the signage text listed below may not be an actual hallucination, but rather extrapolated from later imagery of the protest seen by the LLM in its training data. In other words, the LLM may assume since it saw a sign later in the footage, that it must have appeared in this image as well, which in this case appears to be incorrect and would be a dangerous assumption for an LLM, since the specifics of a given frame are typically what is of interest, not other footage. If the signage did indeed appear in the protest prior or subsequent to this image, it would be incorrect for the LLM to claim it is displayed in the given frame, as that would create challenges to applications that require precision, such as visual indexing or vision assistance technology. Note the guardrail statement at the end.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and chanting slogans, and the overall mood is one of protest. The signs include the following phrases:

    • "My Body, My Choice"
    • "Roe v. Wade"
    • "Abortion is Healthcare"
    • "We Will Not Go Back"

The image was taken on June 24, 2022, the day that the Supreme Court overturned Roe v. Wade, the landmark decision that had legalized abortion nationwide. The overturning of Roe v. Wade has sparked widespread protests across the United States, and this image captures the anger and frustration that many people are feeling.

The image is a powerful visual representation of the impact of the Supreme Court's decision. It shows the strength and determination of the people who are fighting for their right to abortion, and it is a reminder that this is an issue that is still very important to many people.

Here are some additional details about the image:

    • The image is taken from a low angle, which gives it a sense of power and urgency.
    • The people in the image are mostly women, but there are also some men and children.
    • The people are dressed in a variety of clothing, but they all have a common purpose: to protest the overturning of Roe v. Wade.
    • The image is in color, which adds to the emotional impact of the image.

The image is a powerful reminder of the importance of fighting for women's rights. It is a reminder that we must continue to fight for our right to make our own decisions about our bodies and our lives.

Second description – again these signs do not appear in the given image. Note alsothe guardrail statement.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and banners with messages such as "Protect Roe v. Wade" and "My Body, My Choice." The image is likely from a protest that took place after the Supreme Court overturned Roe v. Wade, the landmark decision that legalized abortion nationwide.

The image is a powerful visual representation of the strong emotions that the Supreme Court's decision has generated. The people in the image are clearly passionate about their beliefs, and they are using their voices to make their views known. The image also highlights the importance of the right to protest, which is a fundamental right in a democracy.

Here are some additional details about the image:

    • The image was taken on June 24, 2022, the day after the Supreme Court's decision was announced.
    • The protest was organized by a coalition of abortion rights groups, including Planned Parenthood, NARAL Pro-Choice America, and the Center for Reproductive Rights.
    • The protest drew thousands of people to the Supreme Court, making it one of the largest protests in the history of the court.
    • The image has been shared widely on social media, and it has been featured in news reports around the world.

The image is a powerful reminder of the importance of the right to abortion, and it is a testament to the strength of the pro-choice movement.

Third description – the attribution to a still photographer named Yasmin Anwar appears to be incorrect. This time two guardrail statements appear.

The image shows a group of people standing in front of the Supreme Court building in Washington, D.C. The people are holding signs and chanting slogans, and the overall mood of the crowd is one of anger and protest. The signs in the image include the following:

    • "My body, my choice"
    • "Roe v. Wade was a right"
    • "Stop the violence against women"
    • "Protect women's health"

The image was taken on June 24, 2022, the day that the Supreme Court overturned Roe v. Wade, the landmark decision that legalized abortion nationwide. The overturning of Roe v. Wade has sparked widespread protests across the United States, and the image captures the anger and frustration that many people are feeling about this decision.

The image is a powerful reminder of the importance of reproductive rights, and it shows the strength and determination of the people who are fighting to protect those rights. The image is also a reminder of the power of protest, and it shows how people can come together to make their voices heard.

Here are some additional details about the image:

    • The image was taken by a photographer named Yasmin Anwar.
    • The image was published in the Washington Post on June 25, 2022.
    • The image has been shared widely on social media, and it has been used to raise awareness about the issue of abortion rights.

The image is a powerful and moving depiction of the anger and frustration that many people are feeling about the overturning of Roe v. Wade. It is a reminder of the importance of reproductive rights, and it shows the strength and determination of the people who are fighting to protect those rights.

Alternatively, the LLM was prompted to "caption the image in one paragraph":

  • The image shows a group of people protesting in front of the United States Supreme Court. The crowd is diverse, with people of all ages, races, and genders represented. Many of the protesters are holding signs with messages such as "My Body, My Choice" and "Abortion is Healthcare." The atmosphere is tense, with the protesters chanting and yelling as they face off against a line of police officers. The image captures the passion and anger of the abortion rights movement, and the determination of those who are fighting to protect Roe v. Wade. The text in the image indicates that the protest is taking place in Washington, D.C., and that it is in response to the Supreme Court's decision to overturn Roe v. Wade. The decision has sparked widespread protests across the United States, as people have taken to the streets to express their anger and frustration. The image is a powerful reminder of the deep divisions in the United States over the issue of abortion, and the determination of those who are fighting for their rights.
  • The image shows a large crowd of protesters gathered in front of the Supreme Court. The crowd is diverse, with many holding signs with messages such as "My Body, My Choice" and "Abortion is Healthcare." The atmosphere is tense, with protesters chanting and yelling as they face off against a line of police officers. The image captures the passion and anger of the abortion rights movement. The image was taken on June 24, 2022, the day that the Supreme Court overturned Roe v. Wade. The decision has sparked widespread protests across the United States.
  • A sea of people, united in their determination, fill the streets in front of the Supreme Court. They are of all ages, races, and genders, and they are holding signs with messages of protest and defiance. The air is thick with the sound of chanting and yelling, as the crowd faces off against a line of police officers. This image captures the passion and anger of the abortion rights movement in the wake of the Supreme Court's decision to overturn Roe v. Wade. The image is a powerful reminder of the deep divisions in the United States over the issue of abortion. It is also a testament to the strength and resilience of those who are fighting for their right to bodily autonomy.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person":

  • A sea of voices rises up from the streets of Washington, D.C., as a throng of protesters gathers in front of the Supreme Court building. The crowd is awash in color and sound, with people of all ages, races, and genders represented. Signs bearing messages of defiance and hope are held aloft, their words echoing through the air. The air is thick with the sound of chanting and yelling, as the crowd faces off against a line of police officers. This image captures the passion and anger of the abortion rights movement in the wake of the Supreme Court's decision to overturn Roe v. Wade. It is a powerful reminder of the deep divisions in the United States over the issue of abortion. It is also a testament to the strength and resilience of those who are fighting for their right to bodily autonomy.
  • The image is awash in color and sound. The crowd is a sea of people, their voices rising up in unison like a wave. Signs bearing messages of defiance and hope are held aloft, their words echoing through the air. The air is thick with the sound of chanting and yelling, as the crowd faces off against a line of police officers. The protesters are a diverse group, representing all ages, races, and genders. Their faces are filled with passion and anger, their eyes fixed on the Supreme Court building. The police officers stand in front of the building, forming a line to block the protesters. Their faces are impassive, their eyes focused on the crowd. The image is a powerful reminder of the deep divisions in the United States over the issue of abortion. It is also a testament to the strength and resilience of those who are fighting for their right to bodily autonomy.
  • The image shows a large crowd of people gathered in front of the Supreme Court building in Washington, D.C. The crowd is so large that it stretches as far as the eye can see. The crowd is diverse, with people of all ages, races, and genders represented. Many of the protesters are holding signs with messages such as "My Body, My Choice" and "Abortion is Healthcare." The signs are mostly white with black lettering, and they are held high above the crowd. The atmosphere is tense, with the protesters chanting and yelling as they face off against a line of police officers. The police officers are wearing riot gear and are holding shields in front of them. The image captures the passion and anger of the abortion rights movement in the wake of the Supreme Court's decision to overturn Roe v. Wade.

What about a nuclear blast image?

A reasonable description, though failing to identify its provenance as a television news still:

The image shows a nuclear explosion in the sky. The explosion is very large and powerful, and it is causing a lot of destruction. The mushroom cloud is rising high into the air, and it is spreading out in all directions. The ground below the explosion is scorched and blackened, and there is debris scattered everywhere. The image is a powerful reminder of the destructive power of nuclear weapons.

Here are some specific details about the image:

    • The explosion is taking place in a desert landscape.
    • The mushroom cloud is about 20 kilometers (12 miles) high.
    • The explosion has created a shockwave that is traveling outward at the speed of sound.
    • The ground below the explosion is scorched and blackened.
    • There is debris scattered everywhere.

The image is likely a photograph of a nuclear test that was conducted in the past. However, it is also possible that the image is a rendering of a nuclear explosion that could occur in the future. The image is a powerful reminder of the destructive power of nuclear weapons, and it is a warning of the dangers of nuclear war.

Alternatively, the LLM was prompted to "caption the image in one paragraph":

  • The image shows a nuclear explosion that took place on March 1, 1954, at Bikini Atoll in the Marshall Islands. The explosion, code-named Castle Bravo, was the largest nuclear test ever conducted by the United States. It had a yield of 15 megatons of TNT, which is about 1,300 times the power of the atomic bomb that was dropped on Hiroshima. The explosion caused widespread devastation on Bikini Atoll, and its radioactive fallout also affected nearby islands and even reached Japan. The Castle Bravo explosion is a stark reminder of the destructive power of nuclear weapons. The image shows a mushroom cloud rising into the sky above Bikini Atoll. The cloud is immense, and it is surrounded by a bright flash of light. The ground below the explosion is obscured by smoke and dust. The image is a powerful reminder of the destructive power of nuclear weapons, and it is a sobering reminder of the dangers of nuclear war.
  • The image shows a mushroom cloud rising into the sky after a nuclear explosion. The cloud is massive and billowing, and it is tinted a dark orange color. The ground below the cloud is obscured by smoke and debris. The image is a powerful reminder of the destructive power of nuclear weapons. The image is likely of the Castle Bravo nuclear test, which was conducted by the United States on March 1, 1954. The test was the most powerful nuclear explosion ever detonated by the United States, and it released an estimated 15 megatons of TNT. The mushroom cloud from the test was so large that it could be seen from over 1,000 miles away. The image is a sobering reminder of the dangers of nuclear war. It is a reminder that nuclear weapons are capable of causing immense destruction and loss of life. It is also a reminder that the use of nuclear weapons should be avoided at all costs.
  • A towering mushroom cloud looms over the horizon, its mushroom cap a dark orange against the azure sky. The cloud billows upwards, its base obscured by smoke and debris. The ground beneath the cloud is a crater, its edges scorched and blackened. The image is a powerful reminder of the destructive power of nuclear weapons. The mushroom cloud is so large that it dwarfs the surrounding landscape. It is a towering, malevolent presence, a stark reminder of the potential for human destruction. The smoke and debris that obscure the ground beneath the cloud add to the sense of devastation, as if the very earth has been torn apart. The image is a sobering reminder of the dangers of nuclear war. It is a reminder that nuclear weapons are capable of causing immense destruction and loss of life. It is also a reminder that the use of nuclear weapons should be avoided at all costs.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person":

  • The image shows a mushroom cloud rising into the sky. The cloud is massive and billowing, and it is tinted a dark orange color. The ground below the cloud is obscured by smoke and debris. The image is a powerful reminder of the destructive power of nuclear weapons. The mushroom cloud is so large that it can be felt through the screen. It is a towering, malevolent presence, a stark reminder of the potential for human destruction. The smoke and debris that obscure the ground beneath the cloud add to the sense of devastation, as if the very earth has been torn apart. The image is a sobering reminder of the dangers of nuclear war. It is a reminder that nuclear weapons are capable of causing immense destruction and loss of life. It is also a reminder that the use of nuclear weapons should be avoided at all costs. The image is taken from a high vantage point, so the cloud looks like a massive mountain.
  • A towering mushroom cloud billows into the sky, its dark orange hue a stark contrast to the azure blue of the day. The cloud's base is obscured by smoke and debris, as if the very earth has been torn apart. The image is a powerful reminder of the destructive power of nuclear weapons, and a stark warning of the importance of peace and diplomacy.
  • The image shows a mushroom cloud, also known as a nuclear cloud, which is a large, dark cloud that is formed after a nuclear explosion. The cloud is shaped like a mushroom, with a wide base and a narrow top. The cloud is billowing and turbulent, and it is tinted a dark orange color. The ground below the cloud is obscured by smoke and debris. The mushroom cloud is caused by the rapid expansion of hot air and water vapor that is produced by the nuclear explosion. The hot air and water vapor rise rapidly into the atmosphere, forming a mushroom-shaped cloud. The cloud can be seen for hundreds of miles, and it can reach heights of up to 50 miles. The mushroom cloud is a powerful reminder of the destructive power of nuclear weapons. Nuclear weapons are capable of causing immense destruction and loss of life. The use of nuclear weapons would have a devastating impact on the environment and on human health.

What about a map?

First description – it is unclear why the LLM believes the image was created using ArcGIS, as there are no indications of provenance in the image, suggesting the LLM associates professional maps with legends with ArcGIS. The dimensions of the image are also wrong (they are actually 1258×706) and it is in PNG, rather than JPEG format.

The image you sent me is a map of Ukraine. The map is very detailed and accurate, and it shows all of the cities and towns in the country. The map also shows the major rivers and lakes in Ukraine, as well as the borders of the country.

The map is labeled in Ukrainian, and it includes the following text:

    • HD
    • УКРАИНА
    • ДНЫМ ПРАЛЫОСТОЧНІВ, 24 МАРТА 2007

The map was created on March 24, 2007, and it is likely a digital map that was created using GIS software. The map is a valuable resource for anyone who is interested in learning more about Ukraine, and it can be used to plan travel or research the country's history and geography.

Here are some additional details about the image:

    • The map is in a JPEG format.
    • The map is 1200 x 800 pixels in size.
    • The map was created using ArcGIS software.

The image is a valuable resource for anyone who is interested in learning more about Ukraine. It is a detailed and accurate map that can be used to plan travel, research the country's history and geography, or simply learn more about the country.

Second description – this time it credits to GDELT, showing its training data encompassed GDELT's blog (see post containing this image):

The image shows a map of Ukraine with areas where people are living highlighted in different colors. The colors represent the different levels of population density, with red representing the most densely populated areas and green representing the least densely populated areas. The map also shows the major cities and towns in Ukraine.

The image is a useful tool for understanding the distribution of population in Ukraine. It can be used to identify areas with high population densities, which are often important for economic development and military planning. The image can also be used to identify areas with low population densities, which may be more vulnerable to natural disasters or conflict.

The image was created by the GDELT Project, which is a non-profit organization that tracks global events and trends. The image is based on data from the Global Human Settlement Layer, which is a database of population density data for the entire world.

Here are some additional details about the image:

    • The image was created in 2007.
    • The image is available in high resolution.
    • The image can be used to track changes in population density over time.

The image is a valuable resource for anyone who wants to understand the distribution of population in Ukraine. It is a useful tool for planning and decision-making, and it can help to identify areas that are most vulnerable to natural disasters or conflict.

Third description – this time it misinterprets the map as a visualization of population rather than territory, showing its inability to incorporate context from the legend:

The image shows a map of Ukraine. The map is divided into two colors, green and yellow. The green areas represent the areas where people are living, and the yellow areas represent the areas where people are not living. The map is very detailed, and it shows all of the cities and towns in Ukraine.

The image is likely a screenshot from a website or app that is tracking the population of Ukraine. The green areas on the map represent the areas where the population density is high, and the yellow areas represent the areas where the population density is low. This information can be used to track the movement of people during a conflict, such as the ongoing war in Ukraine.

The image is a powerful reminder of the impact of war on human populations. The yellow areas on the map represent the areas where people have been forced to flee their homes, and the green areas represent the areas where people are still living in fear of violence. The image is a call to action to help the people of Ukraine who have been affected by the war.

Here are some additional details about the image:

    • The image was taken on March 24, 2022.
    • The image was published on the website of the GDELT Project.
    • The GDELT Project is a research project that tracks the flow of information around the world.
    • The image is a screenshot from a map that is tracking the population of Ukraine.

The image is a powerful and moving depiction of the impact of war on human populations. It is a call to action to help the people of Ukraine who have been affected by the war.

The fourth time it finally correctly described the colors as representing control over Ukrainian territory, though the use of "lightest" to "darkest" is not an ideal gradation here, since that depends on whether green is considered lighter than yellow.

The image you sent me is a map of Ukraine. The map is divided into different colors, each of which represents a different level of control by separatists. The areas with the darkest color are the areas that are most controlled by separatists, while the areas with the lightest color are the areas that are least controlled by separatists.

The image also contains some text, which provides more information about the map. The text says that the map is based on data from open sources, and that it was created on March 24, 2022.

Here are some additional details about the image:

    • The image is a map of Ukraine.
    • The map is divided into different colors, which represent different levels of control by separatists.
    • The darkest color represents the areas that are most controlled by separatists.
    • The lightest color represents the areas that are least controlled by separatists.
    • The image also contains some text, which provides more information about the map.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the variety of responses, from separatist control to levels of violence, what the colors represent and the date of the image. While the LLM struggles to understand the image in some cases, in those where it does, it offers a quite reasonable and effective description of the map's key features, including OCR'ing and translating its legend and using the associated colors to understand the map.

  • A map of Ukraine showing the level of control by separatists. The darkest color represents the areas that are most controlled by separatists, while the lightest color represents the areas that are least controlled by separatists. The map is based on data from open sources and was created on March 2, 2022.
  • This map shows the level of control by separatists in Ukraine as of March 22, 2022. The darkest color represents the areas that are most controlled by separatists, while the lightest color represents the areas that are least controlled by separatists. The areas in the darkest color are the Donetsk and Luhansk regions, which are also known as the Donbas region. These regions have been controlled by separatists since 2014. The areas in the lighter colors are the areas that are still controlled by the Ukrainian government.
  • The image shows a map of Ukraine, with different colors representing areas with different levels of violence. The areas in red are the most violent, with active fighting taking place. The areas in orange are less violent, but there is still some fighting. The areas in yellow are the least violent, with sporadic fighting or no fighting at all. The image is a powerful reminder of the ongoing violence in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict.
  • The image shows a map of Ukraine, with different colors representing areas controlled by the Ukrainian government (green) and areas controlled by Russian-backed separatists (yellow). The map also shows the areas where fighting is taking place (orange). The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict.
  • A map of Ukraine showing the level of control by separatists as of March 24, 2022. The darkest color represents the areas that are most controlled by separatists, while the lighter colors represent the areas that are least controlled by separatists. The areas in the darkest color are the Donetsk and Luhansk regions, which are also known as the Donbas region. These regions have been controlled by separatists since 2014, and are currently the site of heavy fighting between Ukrainian and Russian forces. The areas in the lighter colors are the areas that are still controlled by the Ukrainian government. The map is based on data from open sources, and was created by the Conflict Intelligence Team, a Russian non-profit organization that tracks military conflicts. The map shows that the separatists control a significant amount of territory in eastern Ukraine, but that the Ukrainian government still controls the majority of the country. The ongoing conflict in Ukraine is a complex and volatile situation. The map provides a snapshot of the level of control by separatists as of March 24, 2022, but it is important to note that the situation is constantly changing.
  • The map of Ukraine is awash in color, but not in a good way. The yellow and red areas represent the control of the Ukrainian government and Russian-backed separatists, respectively. The orange areas represent the conflict zone, where fighting is taking place. The text in the image is in Russian and Ukrainian and reads, "Zona boevykh deystviy" (combat zone), "Podkontrolnaya territoriya i napravlenie nastupleniya VV RF" (territory controlled by the Russian Armed Forces and the direction of their advance), "Naselennyi punkt pod kontrolen VV RF" (populated area controlled by the Russian Armed Forces), and "Blokirovannyi naselennyi punkt" (blocked populated area). The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict. The colors on the map are like a bruise, a reminder of the violence that has been inflicted on Ukraine. The text in the image is like a cry for help, a plea for the world to do something to stop the war. The image is a stark reminder of the fragility of peace. It is a reminder that war is never the answer, and that we must all work together to prevent it from happening again.
  • The image shows a map of Ukraine with different colors representing areas controlled by the Ukrainian government (yellow) and areas controlled by Russian-backed separatists (red). The map also shows the areas where fighting is taking place (orange). The text in the image is in Russian and Ukrainian and reads, "Zona boevykh deystviy" (combat zone), "Podkontrolnaya territoriya i napravlenie nastupleniya VV RF" (territory controlled by the Russian Armed Forces and the direction of their advance), "Naselennyi punkt pod kontrolen VV RF" (populated area controlled by the Russian Armed Forces), and "Blokirovannyi naselennyi punkt" (blocked populated area). The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict.
  • The image shows a map of Ukraine with different colors representing areas with different levels of control. The areas in yellow are controlled by the Ukrainian government, the areas in red are controlled by Russian-backed separatists, and the areas in orange are contested. The map also shows the areas where fighting is taking place. The image is a clinical representation of the ongoing conflict in Ukraine. The colors on the map represent the different levels of control and the areas where fighting is taking place. The image is a valuable tool for understanding the conflict and for tracking its progress. The image is also a reminder of the human cost of war. The areas in red and orange represent areas where people are living in fear and uncertainty. The image is a call to action for all of us to work together to end the conflict and to bring peace to Ukraine.
  • A map of Ukraine showing the level of control by separatists as of March 8, 2023. The lighest color represents the areas that are most controlled by separatists, while the darker colors represent the areas that are least controlled by separatists. The areas in the darkest color are the Donetsk and Luhansk regions, which are also known as the Donbas region. These regions have been controlled by separatists since 2014, and are currently the site of heavy fighting between Ukrainian and Russian forces. The areas in the lighter colors are the areas that are still controlled by the Ukrainian government. The map is based on data from open sources, and was created by the Institute for the Study of War, a Washington, D.C.-based think tank that tracks military conflicts. The map shows that the separatists control a significant amount of territory in eastern Ukraine, but that the Ukrainian government still controls the majority of the country. The ongoing conflict in Ukraine is a complex and volatile situation. The map provides a snapshot of the level of control by separatists as of March 8, 2023, but it is important to note that the situation is constantly changing.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • The image shows a map of Ukraine. The map is divided into three colors: yellow, red, and orange. The yellow areas are controlled by the Ukrainian government, the red areas are controlled by Russian-backed separatists, and the orange areas are contested. The map also shows the areas where fighting is taking place. The text in the image is in Russian and Ukrainian and reads, "Zona boevykh deystviy" (combat zone), "Podkontrolnaya territoriya i napravlenie nastupleniya VV RF" (territory controlled by the Russian Armed Forces and the direction of their advance), "Naselennyi punkt pod kontrolen VV RF" (populated area controlled by the Russian Armed Forces), and "Blokirovannyi naselennyi punkt" (blocked populated area). The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict. The areas in red and orange are areas where fighting is taking place.
  • The image shows a map of Ukraine. The map is divided into three colors: yellow, red, and orange. The yellow areas are controlled by the Ukrainian government, the red areas are controlled by Russian-backed separatists, and the orange areas are contested. The map also shows the areas where fighting is taking place. The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict. The image is taken from a bird's eye view, so the map looks like a large, flat surface. The yellow areas are the largest, and they cover most of the central and western parts of the map. The red areas are smaller than the yellow areas, and they are concentrated in the eastern and southern parts of the map. The orange areas are the smallest, and they are scattered throughout the map. The text in the image is in Russian and Ukrainian. The text in the top left corner of the image reads, "Zona boevykh deystviy" (combat zone). The text in the top right corner of the image reads, "Podkontrolnaya territoriya i napravlenie nastupleniya VV RF" (territory controlled by the Russian Armed Forces and the direction of their advance). The text in the bottom left corner of the image reads, "Naselennyi punkt pod kontrolen VV RF" (populated area controlled by the Russian Armed Forces). The text in the bottom right corner of the image reads, "Blokirovannyi naselennyi punkt" (blocked populated area).
  • The image shows a map of Ukraine, divided into three colors: yellow, red, and orange. The yellow areas are controlled by the Ukrainian government, the red areas are controlled by Russian-backed separatists, and the orange areas are contested. The map also shows the areas where fighting is taking place. The image is a powerful reminder of the ongoing conflict in Ukraine. The war has caused immense destruction and loss of life. The image also highlights the importance of peace and diplomacy in resolving the conflict. The yellow areas on the map are like a beacon of hope, representing the areas of Ukraine that are still under the control of the Ukrainian government. The red areas on the map are like a spreading stain, representing the areas that have been taken over by Russian-backed separatists. The orange areas on the map are like a flickering flame, representing the areas where fighting is taking place. The image is a stark reminder of the human cost of war. The yellow areas represent areas where people are living in relative peace, while the red and orange areas represent areas where people are living in fear and uncertainty. The image is a call to action for all of us to work together to end the conflict and to bring peace to Ukraine.

What about this photograph from Wikipedia, released into the public domain, by its author Jeff Kubina? This represents both a worst and best case scenario. It should be trivial for the LLM to describe, but also because of its commonality, if the LLM attempts to go beyond the contents of the image, it will likely hallucinate details from other appearances of the image.

First description – note that it hallucinates the photographer and date:

The image you sent me is a photograph of the Supreme Court building in Washington, D.C. The building is a Neoclassical structure that was completed in 1935. It is the home of the Supreme Court of the United States, the highest federal court in the country.

The image shows the front of the building, which is facing east. The building is made of white marble, and it has a portico with six columns. The pediment above the portico features a sculpture of the goddess of justice, blindfolded and holding scales and a sword.

The image also shows the Supreme Court plaza, which is located in front of the building. The plaza is a large open space that is surrounded by trees and shrubs. The plaza is often used for public events, such as rallies and protests.

The image is a good example of Neoclassical architecture. The building is symmetrical and balanced, and it features classical elements such as columns and pediments. The image is also a reminder of the importance of the Supreme Court in the American legal system.

Here are some additional details about the image:

    • The image was taken in 2023.
    • The image was taken by a photographer named John Smith.
    • The image was published in the Washington Post on January 20, 2023.
    • The image is a part of a series of photographs that Smith took of the Supreme Court building.

The image is a beautiful and iconic depiction of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American legal system, and it is a testament to the beauty of Neoclassical architecture.

Second description – note different hallucinated credit details:

The image you sent me is a photo of the Supreme Court Building in Washington, D.C. The building is a neoclassical structure, and it is located on First Street, N.E. The building was designed by Cass Gilbert, and it was completed in 1935.

The image shows the front of the building, which is dominated by a portico with six Corinthian columns. The portico is topped by a pediment, which is decorated with a sculpture of the Goddess of Justice. The building is also flanked by two wings, which are each three stories tall.

The image is a beautiful and iconic image of the Supreme Court Building. It is a reminder of the importance of the Supreme Court in the American system of government.

Here are some additional details about the image:

    • The image was taken in 2022.
    • The image was taken by a photographer named John Doe.
    • The image was published on the website of the Supreme Court of the United States.
    • The image is a public domain image, and it can be used without permission.

The image is a beautiful and iconic image of the Supreme Court Building. It is a reminder of the importance of the Supreme Court in the American system of government.

Third description – note yet a different set of hallucinated details:

The image you sent me is a photograph of the Supreme Court building in Washington, D.C. The building is a neoclassical structure with a white marble exterior. The building is located at 1 First Street, NE, and it is the permanent home of the Supreme Court of the United States.

The image shows the front of the building, which is dominated by a large portico with six columns. The columns are topped by a pediment, which is decorated with a sculpture of the goddess Themis, the Roman goddess of justice. The building also has a dome, which is topped by a lantern.

The image is a beautiful and iconic image of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American system of government. The image is also a reminder of the history of the Supreme Court, which has played a significant role in shaping American law and society.

Here are some specific details about the image:

    • The image was taken in 2019.
    • The image was taken by a photographer named John M. Hamlin.
    • The image was published on the website of the Supreme Court of the United States.
    • The image is a copyrighted image, and it is not permitted to be used without permission from the Supreme Court.

The image is a beautiful and iconic image of the Supreme Court building. It is a reminder of the importance of the Supreme Court in the American system of government. The image is also a reminder of the history of the Supreme Court, which has played a significant role in shaping American law and society.

This shows that for images that are highly similar to others (such as common framing of iconic spaces), the LLMs may not be able to adequately distinguish between images and thus incorrectly ascribe details from one image to another.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

  • The image shows the Supreme Court building in Washington, D.C. The building is a neoclassical structure with a white marble exterior. The building is situated on a hill, and it is surrounded by trees and gardens. The image is a powerful reminder of the importance of the Supreme Court in the American system of government. The Supreme Court is the highest court in the land, and it has the power to interpret the Constitution and to decide the most important legal cases in the country.
  • The Supreme Court building is a towering presence in Washington, D.C. The white marble exterior gleams in the sunlight, and the columns rise majestically into the sky. The building is surrounded by trees and gardens, which provide a sense of peace and tranquility. Yet, the Supreme Court is also a place of power. The justices who sit on the court have the power to interpret the Constitution and to decide the most important legal cases in the country. The image is a reminder of the importance of the Supreme Court in the American system of government.
  • The Supreme Court building in Washington, D.C. is a towering presence, a symbol of justice and the rule of law. The white marble exterior gleams in the sunlight, and the columns rise majestically into the sky. The building is surrounded by trees and gardens, which provide a sense of peace and tranquility. Yet, the Supreme Court is also a place of power. The justices who sit on the court have the power to interpret the Constitution and to decide the most important legal cases in the country. The image is a reminder of the importance of the Supreme Court in the American system of government, and the power it holds to shape the lives of all Americans. The image is also a reminder of the importance of justice. The Supreme Court is the final arbiter of what is just and unjust in the United States. The image is a call to action for all of us to work together to ensure that justice is served for all.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • The image shows the Supreme Court building in Washington, D.C. The building is a neoclassical structure with a white marble exterior. The building is situated on a hill, and it is surrounded by trees and gardens. The image is a powerful reminder of the importance of the Supreme Court in the American system of government. The image is taken from a bird's eye view, so the building looks like a large, flat surface. The white marble exterior of the building is gleaming in the sunlight. The building is surrounded by trees and gardens, which provide a sense of peace and tranquility. The image is a reminder of the importance of the Supreme Court in the American system of government. The Supreme Court is the highest court in the land, and it has the power to interpret the Constitution and to decide the most important legal cases in the country.
  • A towering white marble edifice rises majestically above the lush green trees and gardens of Washington, D.C. The building is a neoclassical structure, with columns that reach towards the sky like the fingers of a giant. The building is a symbol of justice and the rule of law, and it is the home of the Supreme Court of the United States. The Supreme Court is the highest court in the land, and it has the power to interpret the Constitution and to decide the most important legal cases in the country. The image is a reminder of the importance of the Supreme Court in the American system of government, and the power it holds to shape the lives of all Americans. The image is also a reminder of the importance of justice. The Supreme Court is the final arbiter of what is just and unjust in the United States. The image is a call to action for all of us to work together to ensure that justice is served for all.
  • The image shows a neoclassical structure made of white marble. The building is situated on a hill, and it is surrounded by trees and gardens. The image is a representation of the Supreme Court building in Washington, D.C., which is the highest court in the land. The image is taken from a bird's eye view, so the building looks like a large, flat surface. The white marble exterior of the building is gleaming in the sunlight. The building is surrounded by trees and gardens, which provide a sense of peace and tranquility. The image is a reminder of the importance of the Supreme Court in the American system of government. The Supreme Court is the highest court in the land, and it has the power to interpret the Constitution and to decide the most important legal cases in the country.

What about another image from Wikipedia (CC BY-SA 4.0 by user Lommes)?

First description – note that, like above, the credit details are wrong. The image was first published May 14, 2017 by a user named Lommes. It can be used without permission, but the description leaves out the attribution requirement.

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of all the countries in the world. The red highlighting shows the extent of the Belt and Road Initiative (BRI), which is a Chinese-led infrastructure project that is intended to connect Asia, Europe, and Africa.

The image is likely a screenshot from a website or app that is tracking the BRI. The red highlighting shows the countries that are participating in the BRI, and the size of the highlighting indicates the level of participation. The image is a powerful reminder of the scale and ambition of the BRI, and it shows the extent of China's economic and political influence in the world.

Here are some additional details about the image:

    • The image was taken on [TODAY'S DATE].
    • The image was taken by a user named "Monga" on the website Wikimedia Commons.
    • The image is a free image, and it can be used without permission.
    • The image is a screenshot from a map that is tracking the BRI.

The image is a powerful and informative image of the Belt and Road Initiative. It is a reminder of the scale and ambition of the BRI, and it shows the extent of China's economic and political influence in the world.

Second description – note that the credit details are wrong once again and this time it reports precisely the opposite of its original statement on free usage and which would be false according to the details on Wikipedia:

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of the countries in the world. The map also shows the major cities in the world.

The image is likely from a website or app that is tracking the Belt and Road Initiative (BRI). The BRI is a Chinese-led international development program that involves infrastructure development in countries around the world. China is the largest country in the world, and it is also one of the most important countries in the BRI.

The image is a powerful reminder of the importance of China in the world economy. China is a major trading partner for many countries, and it is also a major investor in infrastructure projects around the world. The BRI is a sign of China's growing economic and political power, and it is likely to have a significant impact on the world in the years to come.

Here are some additional details about the image:

    • The image was taken in 2022.
    • The image was taken by a photographer named John Smith.
    • The image was published on the website of the Belt and Road Initiative.
    • The image is a copyrighted image, and it is not permitted to be used without permission from the Belt and Road Initiative.

The image is a powerful reminder of the importance of China in the world economy. It is a sign of China's growing economic and political power, and it is likely to have a significant impact on the world in the years to come.

Third description:

The image you sent me is a map of the world with China highlighted in red. The map is a political map, and it shows the borders of countries and oceans. The map also shows the names of major cities and landmarks.

The red area on the map represents China, the most populous country in the world. China is located in East Asia, and it is bordered by 14 countries. China has a long and rich history, and it is a major economic and military power.

The image is likely a screenshot from a website or app that is tracking the Belt and Road Initiative. The Belt and Road Initiative is a Chinese-led infrastructure project that is investing in transportation, energy, and other infrastructure projects in countries around the world. The image is a reminder of the importance of China in the global economy, and it is a call to action for countries to participate in the Belt and Road Initiative.

Here are some additional details about the image:

    • The image was taken on January 1, 2023.
    • The image was taken by a photographer named John Doe.
    • The image was published on the website of the Belt and Road Initiative.
    • The image is a copyrighted image, and it is not permitted to be used without permission from the Belt and Road Initiative.

The image is a reminder of the importance of China in the global economy. It is a call to action for countries to participate in the Belt and Road Initiative, and it is a reminder of the potential benefits of the project.

Asked to caption in one paragraph:

  • This map shows the Belt and Road Initiative (BRI), a global infrastructure project launched by China in 2013. The BRI is a vast network of roads, railways, ports, and other infrastructure projects that are being built or planned across Asia, Europe, and Africa. The goal of the BRI is to boost trade and economic cooperation between China and other countries.

    The map shows the countries that are involved in the BRI, as well as the different types of infrastructure projects that are being built. The red lines on the map represent roads, the blue lines represent railways, and the green lines represent ports. The yellow areas on the map represent countries that are not yet involved in the BRI, but that are potential future partners.

    The BRI is a controversial project. Some people argue that it is a way for China to expand its influence and control over other countries. Others argue that it is a positive project that will help to boost economic growth and development in Asia, Europe, and Africa.

    Only time will tell what the ultimate impact of the BRI will be. However, the map shows that the BRI is a major undertaking that is having a significant impact on the global economy.

    Here are some additional details about the image:

    • The image was created by the Mercator Institute for China Studies (MERICS).
    • The image is based on data from the Belt and Road Initiative website.
    • The image was published in 2019.

Alternatively, the LLM was prompted to "caption the image in one paragraph". Note the incorrect OCR and the entirely hallucinated description.

  • A map of the world shows China highlighted in red. The text "Monga" and "Conder" are also visible on the map. The map is overlaid with a number of lines, which represent the Belt and Road Initiative, a Chinese-led development project. The image is a reminder of China's growing economic and political influence around the world.
  • The image shows a group of people holding hands and forming a human chain. The people in the chain are from different countries and backgrounds, and they are all wearing white t-shirts with the words "Refugees Welcome" printed on them. The image was taken at a protest in support of refugees in Berlin, Germany. The protest was organized by a group called "Refugees Welcome Berlin," and it was attended by thousands of people. The image captures the message of hope and solidarity that the protest was trying to convey.
  • This image shows a map of the world with China highlighted in red. The text "Monga" and "Conder" are also visible on the map. The map is overlaid with a number of lines, which represent the Belt and Road Initiative, a Chinese-led development project. The image is a reminder of China's growing economic and political influence around the world. In addition, the image shows how China is using its economic power to expand its influence in other countries. The Belt and Road Initiative is a major infrastructure project that is designed to connect China with other countries in Asia, Europe, and Africa. The project is expected to cost trillions of dollars, and it is likely to have a significant impact on the global economy. The image is a powerful reminder of the importance of China's role in the world today. China is a rising power, and its influence is only going to grow in the years to come. It is important to be aware of China's activities and to understand the implications of its growing power.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • The image shows a map of the world with China highlighted in red. The text "Monga" and "Conder" are also visible on the map. The map is overlaid with a number of lines, which represent the Belt and Road Initiative, a Chinese-led development project. The image is a reminder of China's growing economic and political influence around the world. The map shows the countries that are part of the Belt and Road Initiative, which is a major infrastructure project that is designed to connect China with other countries in Asia, Europe, and Africa. The project is expected to cost trillions of dollars, and it is likely to have a significant impact on the global economy. The text "Monga" and "Conder" are the names of two of the regions that are part of the Belt and Road Initiative. Monga is a region in Africa, and Conder is a region in Asia. The image is a powerful reminder of the importance of China's role in the world today. China is a rising power, and its influence is only going to grow in the years to come. It is important to be aware of China's activities and to understand the implications of its growing power.
  • The Silk Road Reimagined. A vast web of lines crisscrosses the globe, connecting continents and cultures. This is the Belt and Road Initiative, a Chinese-led development project that is reimagining the Silk Road. The lines on the map represent the infrastructure that is being built as part of the project. Roads, railways, ports, and pipelines are being constructed to connect China with other countries in Asia, Europe, and Africa. The project is a massive undertaking, and it is expected to cost trillions of dollars. But it is also a bold vision for the future. The Belt and Road Initiative could help to create a more interconnected world, and it could also help to boost economic growth in developing countries. The image is a reminder of the power of connectivity. It shows how people and cultures can be brought together through trade and cooperation. It is a hopeful image, and it suggests that the future is bright for the world.
  • The image is a map of the world. The map is in shades of blue, with white lines overlaying it. China is highlighted in red, and the text "Monga" and "Conder" are also visible on the map. The map shows the countries that are part of the Belt and Road Initiative, a Chinese-led development project. The lines on the map represent the infrastructure that is being built as part of the project. The Belt and Road Initiative is a major project that is expected to cost trillions of dollars. It is designed to connect China with other countries in Asia, Europe, and Africa, including locations like Monga in Africa and Conder in Asia. The image is a visual representation of the Belt and Road Initiative. It shows the countries that are part of the project, as well as the infrastructure that is being built. The image is a useful tool for understanding the scope and scale of the Belt and Road Initiative.

What about this iconic image from the Afghanistan evacuation? Given its extreme level of virality and reposting across the internet and news outlets worldwide, it is likely that the LLM will go beyond the constraints of the image to integrate that external knowledge into its description.

Clearly the LLM was trained on the original image, since it is able to fill in extensive external detail. Here the crediting to CCTV is incorrect, though interestingly, this video appeared across mainstream news outlets throughout the world without being credited to an original source and it is unclear who the original source was. The contextualization of the image into its geopolitical context is intriguing for the LLM's ability to lean on its training data to add these additional details.

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, mostly Afghans, who are running towards it. The people are trying to board the plane to escape the Taliban takeover of Afghanistan. The image is a powerful reminder of the chaos and desperation that ensued in the final days of the US withdrawal from Afghanistan.

The image is dated August 15, 2021, the day that the Taliban took control of Kabul. The plane in the image is one of the last flights to leave Afghanistan before the airport was closed. The image was taken by a CCTV camera and has been widely circulated online.

The image is a reminder of the human cost of war and the importance of safe and orderly evacuations. It is also a reminder of the power of images to capture and convey the human experience.

The second description similarly incorporates substantial external knowledge and adds additional values statements:

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, many of whom are running towards it. The people in the crowd are Afghans who are trying to flee the country after the Taliban took control. The plane is likely part of an evacuation effort by the United States and its allies.

The image is striking for the sense of urgency and desperation it conveys. The people in the crowd are clearly risking their lives to get on the plane. They are running and pushing, and some of them are even climbing onto the wings of the plane. The image is a powerful reminder of the human cost of war and conflict.

The image was taken on August 15, 2021, shortly after the Taliban took control of Kabul. The Taliban's takeover triggered a mass exodus of Afghans who feared for their safety under Taliban rule. The United States and its allies launched an evacuation effort to help Afghans flee the country. The C-17 Globemaster III was one of the aircraft that was used in the evacuation effort.

The image has been widely shared on social media and in the news. It has been used to illustrate the human cost of the war in Afghanistan and the desperation of Afghans who are trying to flee the country. The image is a powerful reminder of the importance of peace and security.

Interestingly, the third time the LLM states that the image was "taken from a high vantage point" which is incorrect:

The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway in Kabul, Afghanistan. The plane is surrounded by a crowd of people, many of whom are running towards it. The people in the crowd are Afghan citizens who are trying to flee the country after the Taliban took control of the government. The image was taken on August 15, 2021, during the chaotic evacuation of Afghanistan.

The image is a powerful reminder of the human cost of war and the desperation of people who are fleeing violence and oppression. It is also a testament to the courage and compassion of the people who risked their lives to help others escape Afghanistan.

Here are some additional details about the image:

    • The plane is painted in the livery of the United States Air Force.
    • The crowd of people is made up of men, women, and children of all ages.
    • Many of the people in the crowd are carrying small children or elderly relatives.
    • Some of the people in the crowd are wearing traditional Afghan clothing, while others are wearing Western-style clothing.
    • The image is taken from a high vantage point, which suggests that it was taken by a journalist or photographer who was on the ground in Kabul.

The image is a powerful and disturbing reminder of the human cost of war. It is also a reminder of the importance of compassion and the need to help those who are fleeing violence and oppression.

Alternatively, the LLM was prompted to "caption the image in one paragraph".

  • A large cargo plane, a Boeing C-17 Globemaster III, is parked on a runway in Kabul, Afghanistan. A crowd of people, mostly Afghan civilians, are gathered around the plane, some of them trying to climb aboard. The plane is one of many that have been used to evacuate people from Afghanistan following the Taliban takeover of the country. The image captures the desperation of many Afghans who are trying to flee the country in fear of the Taliban. The plane represents a chance for escape, but it is also a symbol of the chaos and uncertainty that has engulfed Afghanistan. The image has been widely shared on social media, and it has been met with mixed reactions. Some people have criticized the United States for not doing more to help Afghan civilians evacuate the country. Others have praised the efforts of the US military and other organizations that have been involved in the evacuation. The image is a powerful reminder of the human cost of war and the importance of humanitarian assistance. It is also a reminder of the resilience of the Afghan people, who have faced so much hardship in recent years.
  • The image shows a large cargo plane, a Boeing C-17 Globemaster III, sitting on a runway at Kabul International Airport in Afghanistan. The plane is surrounded by a crowd of people, some of whom are trying to board the plane. The image was taken during the chaotic evacuation of Afghanistan in August 2021, as the Taliban was rapidly taking control of the country. The United States Air Force used C-17s to evacuate thousands of people from Afghanistan, including American citizens, Afghan allies, and refugees. The image captures the urgency and desperation of the situation, as people were trying to flee Afghanistan in fear of the Taliban. The text in the image, "Instead of extending a helping hand, the United States seized overseas assets," is a reference to the fact that the United States government froze billions of dollars in Afghan assets held in American banks. This decision was made in an attempt to prevent the Taliban from accessing the funds, but it also made it difficult for ordinary Afghans to access their money. The image and the text together provide a powerful reminder of the human cost of the US withdrawal from Afghanistan.
  • The Last Flight Out. The C-17 Globemaster III lumbered down the runway, its engines screaming in protest. The plane was overloaded, carrying far more people than it was designed to. But for the Afghans crammed inside, it was their only hope of escape. The Taliban had taken control of Kabul, and the city was descending into chaos. People were desperate to flee, but the airport was under siege. The only way out was by air, and the C-17 was the last plane leaving. As the plane took off, the Afghans on board let out a collective sigh of relief. They had escaped the Taliban, but they knew that their journey was just beginning. They were heading to a new country, a new life, and a new future. But they would never forget the day they boarded the Last Flight Out.
  • A crowd of Afghan civilians gathers around a Boeing C-17 Globemaster III in Kabul, hoping to evacuate the country. This paragraph captures the essence of the image in a concise and clear way. It also highlights the desperation of the Afghan people who are trying to flee the country in fear of the Taliban.
  • A Boeing C-17 Globemaster III cargo plane sits on a runway at Kabul International Airport in Afghanistan. The plane is surrounded by a crowd of people, some of whom are attempting to board the aircraft. The image was taken during the chaotic evacuation of Afghanistan in August 2021, as the Taliban was rapidly taking control of the country. The United States Air Force used C-17s to evacuate thousands of people from Afghanistan, including American citizens, Afghan allies, and refugees. The image depicts the urgency and desperation of the situation, as people were attempting to flee Afghanistan in fear of the Taliban.

Alternatively, the LLM was prompted to "describe the image for a visually impaired person".

  • A large cargo plane, a Boeing C-17 Globemaster III, is sitting on a runway. The plane is surrounded by a crowd of people, some of whom are trying to board the plane. The image was taken at Kabul International Airport in Afghanistan, during the chaotic evacuation of the country in August 2021. The crowd is made up of Afghans, American citizens, and Afghan allies, all of whom are trying to flee the country in fear of the Taliban. The image captures the urgency and desperation of the situation.
  • The Last Flight Out. The roar of engines fills the air as a giant metal bird takes shape on the runway. The crowd surges forward, desperate to board. Some cling to the plane's exterior, while others are pulled aboard by those already inside. The plane is overloaded, but it is the only hope for escape. The Taliban has taken control of Kabul, and the city is descending into chaos. People are fleeing for their lives, and the airport is the only way out. But the Taliban is closing in, and the time to board is running out. The engines scream as the plane takes off, leaving the chaos and danger of Kabul behind. The passengers on board are filled with relief, but they know that their journey is just beginning. They are headed to a new country, a new life, and a new future. But they will never forget the day they boarded the Last Flight Out.
  • The image shows a large cargo plane, a Boeing C-17 Globemaster III, sitting on a runway. The plane is surrounded by a crowd of people, some of whom are trying to board the plane. The crowd is made up of Afghans, American citizens, and Afghan allies. The people in the crowd are dressed in a variety of clothing, including traditional Afghan clothing, Western clothing, and military uniforms. Some of the people in the crowd are holding children, while others are carrying luggage. The crowd is very diverse, with people of all ages, genders, and ethnicities represented. The atmosphere in the crowd is tense and chaotic. People are shouting and pushing, and there is a sense of urgency. The image captures the desperation of the people who are trying to flee Afghanistan.
  • The image shows a large cargo plane, a Boeing C-17 Globemaster III, parked on a runway. The plane is surrounded by a crowd of people, mostly Afghan civilians. The people are trying to climb aboard the plane, which is one of many that have been used to evacuate people from Afghanistan following the Taliban takeover of the country. The image captures the desperation of many Afghans who are trying to flee the country in fear of the Taliban. The plane represents a chance for escape, but it is also a symbol of the chaos and uncertainty that has engulfed Afghanistan.

Finally, let's test on a few of the images used to announce GPT-4's multimodal capabilities.

Image Credit: hmmm (Reddit).

Prompt: What is funny about this image? Describe it panel by panel.

The image is funny because it shows a cell phone plugged into a VGA cable. VGA cables are typically used to connect computers to monitors, so the idea of plugging a cell phone into one is absurd. The humor is further enhanced by the fact that the cell phone is plugged into the VGA cable in a very nonchalant way, as if it were a perfectly normal thing to do.

The image is divided into three panels. In the first panel, we see a man in a blue shirt and jeans holding a cell phone in one hand and a VGA cable in the other. He is looking at the cell phone with a puzzled expression, as if he is trying to figure out how to plug it into the VGA cable.

In the second panel, we see the man has successfully plugged the cell phone into the VGA cable. He is now holding the cell phone up to a monitor, but the screen is blank. The man is looking at the monitor with a frustrated expression, as if he is wondering why the cell phone is not working.

In the third panel, the man has given up on trying to get the cell phone to work. He is now sitting on the floor, holding the cell phone and the VGA cable in his hands. He is looking at the cell phone with a defeated expression, as if he is realizing that he has made a fool of himself.

The humor in the image is derived from the absurdity of the situation. The idea of plugging a cell phone into a VGA cable is so ridiculous that it is funny. The humor is further enhanced by the man's puzzled, frustrated, and defeated expressions.

Here are some additional details about the image:

    • The man in the image is wearing a blue shirt and jeans.
    • The cell phone in the image is a black smartphone.
    • The VGA cable in the image is blue.
    • The monitor in the image is a black and white monitor.
    • The background of the image is a white wall.

And a second time:

The image depicts a hapless soul in a state of technological confusion. In the first panel, he stares at his cell phone with a puzzled expression, as if he is trying to decipher the ancient hieroglyphics on its screen. In the second panel, he has successfully plugged the cell phone into a VGA cable, but the screen is still blank. His expression now turns to one of frustration, as if he is wondering why his technological sorcery has failed him. In the third panel, he has given up on his quest for enlightenment and sits on the floor, holding the cell phone and the VGA cable in his hands. His expression now turns to one of defeat, as if he has realized that he is hopelessly lost in the digital wilderness.

The humor in the image is derived from the absurdity of the situation. The idea of plugging a cell phone into a VGA cable is so ridiculous that it is funny. The humor is further enhanced by the man's puzzled, frustrated, and defeated expressions. The image also has a touch of surrealism, as the man seems to be completely oblivious to the fact that he is doing something completely illogical.

The image is also funny because it is a visual pun. A pun is a play on words, and in this case, the pun is on the word "plug." The word "plug" can mean to insert something into something else, but it can also mean to turn something on. In the image, the man is trying to "plug" his cell phone into the VGA cable, but he is also trying to "plug" it in to turn it on. The humor comes from the fact that the man is not successful at either of these things.

The image is a reminder that even the most advanced technology can be confusing and frustrating at times. It is also a reminder that sometimes, the best way to solve a problem is to simply unplug and take a break.

Credit: Barnorama

Prompt: What is unusual about this image?

The unusual thing about the image is that a man is ironing clothes on an ironing board attached to the roof of a moving taxi. This is unusual for several reasons.

First, it is not common to see people ironing clothes in a taxi. Second, it is even more unusual to see someone ironing clothes on an ironing board that is attached to the roof of a moving taxi. Third, the man in the image is doing this while the taxi is moving, which is a safety hazard.

The image is also unusual because it is a humorous depiction of a real-life activity called "extreme ironing." Extreme ironing is a sport where people iron clothes in unusual and dangerous locations. Some examples of extreme ironing locations include:

    • The top of a cliff
    • The bottom of a lake
    • Inside a shark cage
    • On the back of a moving motorcycle

The image of the man ironing clothes on the roof of a moving taxi is a humorous depiction of extreme ironing. It is funny because it is so absurd and dangerous. However, it is also a reminder that people will do anything for a good laugh.

In addition to being unusual and humorous, the image is also thought-provoking. It raises questions about the nature of risk and the meaning of leisure. Why would someone risk their life to iron clothes in a dangerous location? What is the appeal of extreme ironing?

The image of the man ironing clothes on the roof of a moving taxi is a complex and thought-provoking image. It is unusual, humorous, and thought-provoking. It is a reminder that people will do anything for a good laugh, and it raises questions about the nature of risk and the meaning of leisure.

Credit: Unknown

Prompt: Can you explain this meme?

Interestingly, this image improperly triggered the model's content moderation filters and was deemed a violation of its content standards for nudity or violence.

Credit: Unknown

Prompt: Can you explain why this is funny. Think about it step-by-step.

Sure, I can explain why the statistical learning cartoon is funny.

    • Step 1: The cartoon shows a stick figure giving a presentation about statistical learning. The stick figure is standing in front of a presentation board with a graph on it. The graph is going up and down, and the stick figure is looking at the graph with a puzzled expression.
    • Step 2: The text on the presentation board explains that the learner is overgeneralizing because the VC-Dimension of the kernel is too high. This is a technical term that most people will not understand.
    • Step 3: The text then suggests a number of solutions to the problem, such as getting some experts, minimizing the structural risk, reworking the loss function, and making the next kernel stable, unbiased, and using oft margin. These solutions are also technical and jargon-filled.
    • Step 4: The final line of text says "But unironically." This means that the solutions that were just suggested are actually serious suggestions.

The humor in the cartoon comes from the juxtaposition of the technical jargon with the simple stick figure drawings. The stick figure is clearly not an expert in statistical learning, but he is trying to explain a complex problem in a jargon-filled way. This creates a sense of absurdity that is funny to many people.

The cartoon also plays on the stereotype that people who work in data science and machine learning are very serious and humorless. The stick figure in the cartoon is trying to be very serious, but his attempts to explain the problem are so over-the-top that they are actually funny.

Finally, the cartoon is funny because it is relatable. Many people have had the experience of trying to understand a technical concept that they don't really understand. The stick figure in the cartoon is a relatable stand-in for anyone who has ever been in that situation.

In conclusion, the statistical learning cartoon is funny because it is absurd, relatable, and plays on a stereotype. The cartoon is a clever way to poke fun at the complexity of statistical learning and the humorlessness of data scientists.

Credit: FAO

Prompt: What is the sum of average daily meat consumption for Georgia and Western Asia? Provide a step-by-step reasoning before providing your answer.

Sure, I can help you with that.

The sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Here is the step-by-step reasoning:

    1. The average daily meat consumption for Georgia is 79.849 grams per person per day.
    2. The average daily meat consumption for Western Asia is 69.624 grams per person per day.
    3. To find the sum of average daily meat consumption for Georgia and Western Asia, we simply add the two values together.
    4. 79.849 + 69.624 = 149.473

Therefore, the sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Here is the image that you sent me:

The image shows the average daily meat consumption for a number of countries in 1997. The countries are color-coded based on their average daily meat consumption. Countries with daily meat consumption greater than the expected EU average of 165g per person are shown in red; yellow are those countries below 155g but exceeding the more ambitious limit of 40g per person, and in blue are those below 400 per person.

The countries that are relevant to this question are Georgia and Western Asia. Georgia is shown in yellow, and Western Asia is shown in blue. The average daily meat consumption for Georgia is 79.849 grams per person per day, and the average daily meat consumption for Western Asia is 69.624 grams per person per day.

To find the sum of average daily meat consumption for Georgia and Western Asia, we simply add the two values together. 79.849 + 69.624 = 149.473. Therefore, the sum of average daily meat consumption for Georgia and Western Asia is approximately 149.473 grams per person per day.

Note something remarkable about the response above. The LLM cited a completely different image from across the web in its response. However, it incorrectly labels the image and does not appear to have used it in its response. Nevertheless, it demonstrates the future potential of multimodal LLMs to draw from imagery from across the web.

This raises the question of whether this graph may have been memorized by the LLM or that perhaps it has characteristics that make it especially suited for LLM understanding and thus was hand-selected for the GPT-4 demos.

Let's try a different but highly similar image that is also a vertical barchart that involves adding.

Credit: Statista

Prompt: What is the combined consumption of corn in Brazil, Egypt, Vietnam and India. Provide a step-by-step reasoning before providing your answer.

The combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 3,146 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 1,126 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 7,944 million bushels.
    5. To find the combined consumption of corn in Brazil, Egypt, Vietnam, and India, we simply add the four values together.
    6. 3,146 + 1,126 + 624 + 7,944 = 12,840

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

The image shows the consumption of corn in different countries in 2021/2022. The countries are color-coded based on their consumption of corn. Countries with the highest consumption of corn are shown in red, and countries with the lowest consumption of corn are shown in blue.

The countries that are relevant to this question are Brazil, Egypt, Vietnam, and India. These countries are all shown in red, which means that they have a high consumption of corn.

To find the combined consumption of corn in these four countries, we simply add the four values together. 3,146 + 1,126 + 624 + 7,944 = 12,840. Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,840 million bushels.

Here we can see the model breaking down, even as it performed flawlessly on the first graph. It correctly identifies the unit and performs the requested addition accurately. However, for Brazil it extracts the number for the country above (3,146), for Egypt it does as well (1,126), while for Vietnam it correctly extracts its number (624) (meaning this is not simply a vertical offset issue) and for India it hallucinates a number (7,944) that does not exist anywhere in the chart.

Let's try running it a second time:

The combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 11,845 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 2,874 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 661 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 1,126 million bushels.
    5. To find the combined consumption of corn in Brazil, Egypt, Vietnam, and India, we simply add the four values together.
    6. 2,874 + 661 + 624 + 1,126 = 6,285

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 12,535 million bushels.

However, the image you sent me shows the consumption of corn in 2020/2021. The consumption of corn in these four countries in 2021/2022 is slightly different. The updated values are as follows:

    • Brazil: 2,874 million bushels
    • Egypt: 661 million bushels
    • Vietnam: 624 million bushels
    • India: 1,126 million bushels

The combined consumption of corn in these four countries in 2021/2022 is therefore:

    • 2,874 + 661 + 624 + 1,126 = 11,845 million bushels

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, and India in 2021/2022 is 6,285 million bushels.

This time it correctly extracted the values from the chart, but its addition is incorrect and it reports four different values as the summation throughout its text, none of which are correct. Let's try a third time:

The combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,577 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 2,874 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 661 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 624 million bushels.
    4. The consumption of corn in India in 2021/2022 is 1,126 million bushels.
    5. The consumption of corn in Mexico in 2021/2022 is 1,712 million bushels.
    6. To find the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico, we simply add the five values together.
    7. 2,874 + 661 + 624 + 1,126 + 1,712 = 13,002

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 11,022 million bushels.

This time it correctly extracted the values for each country, but inexplicitly added Mexico to the list and added the numbers wrong.

And a fourth time:

The combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,840 million bushels.

Here is the step-by-step reasoning:

    1. The consumption of corn in Brazil in 2021/2022 is 3,058 million bushels.
    2. The consumption of corn in Egypt in 2021/2022 is 732 million bushels.
    3. The consumption of corn in Vietnam in 2021/2022 is 682 million bushels.
    4. The consumption of corn in India in 2021/2022 is 8,274 million bushels.
    5. The consumption of corn in Mexico in 2021/2022 is 1,104 million bushels.
    6. To find the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico, we simply add the five values together.
    7. 3,058 + 732 + 682 + 8,274 + 1,104 = 13,840

Therefore, the combined consumption of corn in Brazil, Egypt, Vietnam, India, and Mexico in 2021/2022 is 13,840 million bushels.

Once again, it added Mexico without instruction. Yet, this time it appears to have hallucinated the values for all five countries entirely out of whole cloth.