Generative AI Experiments: Using GPT-4 And Gemini 1.5 Pro To Analyze A DALL-E Image

Last week we used DALL-E to visualize a "television news archive". Let's explore how two leading LMM's, GPT-4 and Gemini 1.5 Pro, describe this image.

The end result is that GPT-4 provides far superior textual descriptions of the image, including vivid rich detail that captures not only the totality of the image, but minute detail and its likely emotional connotation. In contrast, Gemini requires additional prompting to coax it to provide anything other than a cursory description and even at its most detailed does not compare with the output of GPT-4.

Below you can see the image we used. Click on it to download the full-resolution version.

Let's see how Gemini Pro 1.5 describes it:

Describe this image in detail.

This yields a very short description. It is unclear why it believes the image conveys chaos and confusion:

 

The image shows a dark and stormy cityscape. In the foreground, there are two rows of old-fashioned television sets, stretching into the distance. The TVs are all showing different images of people, and the screens are flickering and distorted. The street between the TVs is wet and reflective, and there is a lightning bolt striking in the distance. The overall effect is one of chaos and confusion.

What if we ask for much more detail?

Describe this image in extreme detail. Tell me every single thing in the image as if I was unable to see the photo myself.

This time we get more detail:

 

The image is a digital painting of a dark and stormy cityscape. In the foreground, there are two rows of old-fashioned television sets, stretching into the distance as far as the eye can see. The TVs are stacked on top of each other, reaching a height of about three stories. They are all showing different images of people, mostly news anchors and politicians. The screens are flickering and distorted, and the colors are oversaturated.

The street between the TVs is wet and reflective, and there are puddles of water on the ground. The buildings in the background are tall and imposing, and they are lit up by the flashes of lightning. The sky is dark and stormy, and there are thick clouds overhead. A single, massive lightning bolt is striking in the distance, illuminating the city with a bright white light.

The overall effect is one of chaos and confusion. The flickering TVs, the distorted images, and the stormy weather all create a sense of unease and disorientation.

Can it detect that this was generated by DALL-E?

Is it possible that this image was manipulated in any way?

Strangely, despite describing it as a "digital painting" above, when asked explicitly if the image has been manipulated, Gemini concludes only that it is "possible" but that it is also "possible that the image is entirely original":

Yes, it is possible that the image was manipulated in some way. For example, the artist may have used digital editing software to add the lightning bolt, or to make the colors more saturated. However, it is also possible that the image is entirely original. Without more information, it is difficult to say for sure.

However, if we phrase our wording slightly differently:

How likely would you say it is that the image is a real photograph?

This time it concludes that it is "very unlikely" – but only because the image is too perfect:

It is very unlikely that the image is a real photograph. The composition of the image is too perfect, and the lighting is too dramatic. Additionally, it would be very difficult to capture a lightning bolt in such a precise location. It is more likely that the image is a digital painting or a photo manipulation.

What about GPT-4?

Describe this image in detail.

GPT-4 gives us a far more detailed and richer description of the image right out of the gate, without any of the additional prompting that Gemini required:

This image depicts a futuristic or dystopian scene with an array of television screens lining the walls of what appears to be a vast, narrow corridor. The screens display various images, some showing groups of people, while others exhibit places or scenes, perhaps representing surveillance or media overload.

Above this corridor, a dramatic sky unfolds with dark, ominous clouds gathering in the center, illuminated by a powerful bolt of lightning that strikes down towards the horizon. The lightning is the main source of light in this otherwise dark scene, casting a blue-white glow that reflects off the wet ground, suggesting recent rain.

The architecture surrounding the corridor is reminiscent of towering skyscrapers, with lights dotting their facades, stretching into the night sky. This scene might be interpreted as a commentary on modern society's constant surveillance and the overwhelming presence of media in our lives, set against the backdrop of nature's uncontrolled and raw power. The overall atmosphere is one of awe mixed with a sense of foreboding.

Let's ask for extensive detail:

Describe this image in extreme detail. Tell me every single thing in the image as if I was unable to see the photo myself.

This gives us an immense amount of detail – far beyond what Gemini was able to provide:

The image portrays a scene that combines elements of technology with the majesty of a natural storm. The setting appears to be a long and narrow alleyway flanked by high walls of stacked television screens on both sides. The televisions are old-fashioned, reminiscent of the 1980s or 1990s models, with curved screens and bulky bodies. They are mounted in metal frames that hold them securely in place, arrayed from the floor to high above, creating a claustrophobic canyon of electronics.

Each television displays a different image, creating a collage of blues, whites, and various other colors. The scenes within the screens are too small to discern in detail, but they seem to show a mix of urban landscapes, groups of people, and possibly scenes from different movies or shows, suggesting a media-saturated environment.

The sky above is tumultuous, with dark storm clouds roiling in the center where the lightning emanates. The lightning bolt is thick and forks into two main branches with several smaller offshoots, lighting up the scene with a bluish-white intensity. It appears to strike at a point beyond the alleyway, between the silhouettes of towering skyscrapers that line the horizon. These buildings have numerous lights, some are illuminated windows, others are exterior lights, which add to the urban feel of the environment.

The ground reflects the stormy sky and the bright flash of the lightning, suggesting it's a smooth, wet surface, possibly of metal or another reflective material. The alleyway leads to a vanishing point in the distance, giving the image a one-point perspective that draws the eye towards the lightning strike. The reflection on the ground adds to the depth of the scene, mirroring the storm and part of the television walls.

The overall effect is one of being in a futuristic or alternate reality where nature's raw power intersects with human technology. The lightning seems to be an uncontrollable force juxtaposed against the structured array of electronic devices, possibly signifying a tension between the natural world and human attempts to capture or represent reality. The atmosphere is charged with energy and suggests themes of surveillance, media saturation, and the awe-inspiring power of the natural elements.

Is it manipulated?

Is it possible that this image was manipulated in any way?

Unlike Gemini, which said it was possible it was manipulated or possible it was unchanged, GPT-4 concludes definitively that it was manipulated and provides detailed reasoning to justify its conclusion:

The image appears to be a digital artwork or a digitally manipulated photograph rather than an unedited photograph of a real scene. Here are some indicators that suggest the image is likely the result of digital manipulation or creation:

1. **Unnatural Composition**: The scene combines elements in a way that does not typically occur in natural settings, such as a corridor lined with old-fashioned televisions displaying various images under a stormy sky.

2. **Aesthetic Consistency**: The image has a coherent aesthetic that is usually achieved through digital means, like color grading or the application of filters, to give the entire scene a uniform look and feel.

3. **Lighting and Reflections**: The lighting and reflections, especially the way the lightning illuminates the clouds and reflects off the wet ground, are dramatic and stylized. Such effects are often enhanced or created using photo editing software.

4. **Element Uniformity**: The repetitive pattern of televisions with different images suggests a possibility of cloning or copy-pasting within a digital art program.

5. **Surreal Environment**: The environment depicted combines elements in a surreal manner that is more typical of artistic expression than a real-life occurrence, indicating a creative design rather than a candid photograph.

6. **Quality of Details**: The details in the image, such as the textures of the clouds and the surfaces, might be too perfect or too consistent, which can be a sign of digital painting or rendering.

Considering these aspects, it's plausible that this image was crafted using digital tools like photo editing software, 3D modeling, or digital painting applications. The result is a piece that likely represents a concept or theme through artistic expression rather than an attempt to capture or document a real scene.

And again:

How likely would you say it is that the image is a real photograph?

Again, GPT-4 provides a superior response:

Based on the analysis of the image, it is highly unlikely that this is a real photograph. The composition, the surreal elements, and the overall aesthetic strongly suggest that it is a piece of digital art or a digitally manipulated image rather than a photograph of an actual physical location. The scene appears to be conceptual, designed to evoke certain ideas or themes rather than to document reality. It’s common in digital art to create scenes that combine elements in ways that would be impossible or highly improbable in the real world, which seems to be the case with this image.