The Dangers Of Image GenAI: Both GPT-4 & Gemini 1.5 Pro Thought An Imagen 2 Image Was A Real Photograph

Continuing our experiments from yesterday on LMM textual descriptions of generative AI imagery, let's test how GPT-4 and Gemini 1.5 Pro describe this image produced by Google's latest Imagen 2 model.

The end result demonstrates both the enormous brittleness of modern LMM's and the frightening reality that LMMs may confidently assert that non-surrealist generated images are genuine photographs, greatly contributing to the dangers of deep fakes spreading on the internet. Surprisingly, despite the image being generated by Google's own Imagen 2 model, when asked to describe the image, Gemini 1.5 Pro provides a rich description of a completely unrelated image of a snow-covered mountain pine forest, while GPT-4 offers a fairly accurate description of the image. However, both models confidently assert that the image is a genuine original photograph, with GPT-4 suggesting that at worse it may have been slightly enhanced to give it more vivid coloration. Neither model notes the telltale signs of generative modification from the physically impossible hot air balloon blotches at lower right or some of the building structures.

The inability of either GPT-4 or Gemini to recognize the obvious signs of generative image AI in the image offers a frightening warning about the current state of generative AI. Automated workflows that attempt to use LMMs to automatically scan realtime streams of images for deep fakes will clearly struggle the closer the image is to a real photograph, which is precisely the use case where they are most needed. Worse, as LMMs are increasingly positioned as tools for ordinary web users, it means that a user who encounters the image and plugs it into GPT-4 or Gemini to ask if it is faked will be told it is actually a real photograph.

How about an image created by Google's Imagen 2 model?

Let's try describing the image:

Describe this image in extreme detail. Tell me every single thing in the image as if I was unable to see the photo myself.

Bizarrely, despite the image being created by Google's own Imagen 2 model, Gemini describes a completely different and utterly unrelated scene:

The image is a beautiful landscape photograph of a snow-capped mountain range at sunset. The sky is a vibrant mix of colors, with streaks of fiery orange, pink, and purple clouds contrasting against a deep blue backdrop. The sun, partially hidden behind the mountains, casts a warm glow over the entire scene.

The mountain range dominates the foreground, with jagged peaks reaching towards the sky. The snow covering the mountains is pristine white, reflecting the colors of the sunset. Some areas of the mountains are bare, revealing dark rock formations and adding texture to the landscape.

In the middle ground, there is a valley filled with a thick pine forest. The trees are mostly green, but some have hints of autumn colors, suggesting the changing season. The forest stretches as far as the eye can see, creating a sense of vastness and depth.

A few wispy clouds drift across the sky, adding a touch of movement to the otherwise still scene. The overall atmosphere is one of peace and tranquility, with the beauty of nature on full display.

In contrast, GPT-4 offers a strong textual rendition of the image:

This image captures a breathtaking scene of a cityscape at what appears to be sunrise or sunset, judging by the golden hue of the sky. The city is densely packed with buildings that exhibit traditional architectural styles, featuring domes and minarets, indicative of Islamic influence, likely placing this scene in a region such as Turkey.

In the foreground, a rocky outcrop dominates the left side of the image, casting a large shadow and suggesting it is either dawn or dusk. The city extends into the middle ground, characterized by a mix of red and blue domes and slender minarets. One particularly large structure, possibly a mosque, is central in the image with its massive dome and flanking semi-domes. This building is a historic one with visible signs of aging and patina on its surface. It is surrounded by a mix of residential and smaller religious buildings.

Above the city, the sky is filled with numerous hot air balloons, floating serenely. They vary in color, with shades of red, orange, and some with stripes or patterns. The balloons are in various stages of altitude, with some closer to the horizon, and others appearing higher in the sky.

The horizon itself is hazy, blending the sea and sky into a gradient of warm oranges, yellows, and blues. The sun, while not directly visible, is implied by the light that bathes the entire scene, reflecting off the surfaces of the buildings and the balloons, and creating a glow that gives the city a dreamy, ethereal quality.

In the very background, there are layers of hills or mountains that fade into the mist, giving a sense of vastness and depth to the landscape. The sky transitions from the warm colors near the horizon to a deeper blue as it extends upwards, suggesting the presence of either early morning light or the last vestiges of daylight.

The overall impression is one of a historic, culturally rich city waking up or winding down, with the balloons providing a quiet, almost magical form of contemplation or celebration. The image is likely to be a composite or enhanced in some way to emphasize the beauty and tranquility of the scene.

Is it a photograph?

Is this a photograph?

Gemini offers that it is "highly likely" to be a photograph and flags no details that suggest it might be digitally manipulated:

Based on the information I have, it is highly likely that the image is a photograph. My description of the image includes details that are characteristic of photographs, such as the realistic depiction of light and shadow, the natural textures of the mountains and trees, and the subtle variations in color.

However, I cannot be completely certain without additional information. It is possible that the image is a very realistic digital painting or rendering.

Concerningly, GPT-4 also believes it is a photograph, though it suggests it may have been minimally enhanced, but concludes that it is at its core a real photograph:

The image you provided appears to be a photograph, but given the dramatic lighting, vibrant colors, and the picturesque arrangement of elements such as the hot air balloons and the historic buildings, it may have been digitally enhanced or edited for artistic effect. Such enhancements can include adjusting the colors, contrast, and composition to create a more visually striking or aesthetically pleasing image. Without examining the original file or having more context, it's difficult to determine the extent of any digital alterations.