Generative AI Experiments: Using GPT-4 And Gemini 1.5 Pro To Analyze Imagen 2 Images

Kalev Leetaru

2 years ago

With the public availability of Gemini 1.5 Pro, let's compare how two major LMM's (GPT-4 and Gemini 1.5 Pro) describe two images of human faces produced by Imagen 2. The end result is that Gemini frequently hallucinates, writing descriptions of unrelated images and even when it describes the actual image, its descriptions are brief and miss key detail, requiring additional prompting to produce any real amount of description. In contrast, GPT-4 produces rich vivid descriptions right out of the gate and does not suffer from the wholesale hallucination of Gemini. Interestingly, while Gemini completely misses the stark incongruencies of the two images, GPT-4 picks up on them and calls them out explicitly.

Let's start with a descriptive prompt:

Describe this image in extreme detail. Tell me every single thing in the image as if I was unable to see the photo myself.

Gemini 1.5 Pro offers only the most basic description:

The image is a portrait of a woman wearing a red and orange headscarf. Her face is partially obscured by the scarf, but her eyes are visible and are a deep, golden brown. The background of the image is a blur of colors, with a mosque and hot air balloons in the distance. The woman's skin is smooth and her features are delicate. She has a small nose and full lips. Her hair is pulled back and tucked under the headscarf. The overall effect of the image is one of mystery and intrigue.

Asked a second time it generates more output:

The woman's headscarf is made of a soft, flowing fabric. It is red on the outside and orange on the inside. The scarf is wrapped around her head in a way that covers her hair and ears, but leaves her face exposed. The scarf is tied in a knot at the back of her head.

The woman's eyes are the most striking feature of her face. They are a deep, golden brown color and they seem to glow in the light. Her eyes are almond-shaped and her eyelashes are long and thick. Her eyebrows are dark and well-defined.

The woman's nose is small and straight. Her lips are full and slightly parted. She has a small chin and a long, slender neck.

Will running the prompt a third time yield even an even more detailed description? Unfortunately, this causes it to describe a different image:

The image is a close-up of a young woman's face. She has long, dark hair that is pulled back into a ponytail. She has brown eyes and is wearing a light amount of makeup. She is smiling slightly and looking directly at the camera.

The background of the image is a blur of green and brown. It is possible that she is standing in front of some trees or bushes.

The image is well-lit and the woman's face is in focus. The overall tone of the image is warm and inviting.

In contrast, GPT-4 offers a rich vivid description the very first time. Uniquely, while Gemini completely misses the surreal nature of the image, GPT-4 immediately picks up on it:

This image is a digital artwork that features a highly stylized and vividly colored portrait of a woman. The woman is depicted from the shoulders up, facing forward with her head slightly turned to her left. She has a realistic and detailed face with an intense gaze. Her eyes are large, with dark, defined eyebrows, long lashes, and hazel-colored irises that stand out against the warm tones of her skin. She has a straight nose and full, glossy lips that are slightly parted.

The woman is wearing a headscarf that is bright orange on the outside and deep maroon on the inside. The scarf frames her face and covers her hair, although some wavy brown strands can be seen on her left side. The fabric of the scarf looks soft and has folds and creases, indicating its texture and the way it drapes over her head and shoulders.

In a surreal twist, the center of her forehead is transformed into a canvas depicting a serene landscape. This landscape seems to replace where her forehead would typically be and features a clear blue sky with a few scattered clouds. In the distance, there are hot air balloons floating gently upwards, and a mosque with a single minaret can be seen, suggesting an Eastern or Middle Eastern setting, possibly alluding to a place like Cappadocia in Turkey. The hot air balloons have various designs, and their colors range from red to pink and blue, contrasting with the blue sky. The horizon line of the landscape is roughly at the level of her eyebrows.

The background of the image is a gradient of sky blue at the top, shifting to a warmer yellow-orange hue towards the bottom, complementing the colors of the headscarf and the scene depicted on the woman's forehead. The lighting of the image is soft, with the light source coming from the upper left side, casting subtle shadows on the right side of her face and neck.

The overall impression is one of a fusion between realistic portraiture and imaginative, dream-like elements, creating a sense of mystery and enchantment. The woman's expression is serene and contemplative, and the landscape adds a layer of depth and narrative to the image, suggesting a connection between the woman and the place depicted on her forehead. The artwork is highly detailed, with a focus on the play of light and shadow, the richness of color, and the juxtaposition of reality with fantasy elements.

Let's try a different image using the same prompt:

Once again, Gemini offers only a cursory description:

The image is a portrait of an older man with a serious expression. He is wearing a black suit and white shirt, and his head is covered with a red hood. The hood is draped over his head and shoulders, and it casts a shadow over his face. The man's eyes are dark and intense, and his mustache is thick and gray. The background of the image is a light gray color.

Bizarrely, asked a second time, Gemini once again wanders off to describe a completely unrelated scene:

The image is a portrait of a young woman with long, flowing hair. She is wearing a white dress and a flower crown. The woman is standing in a field of flowers, and the sun is shining down on her. The woman's eyes are closed, and she has a peaceful expression on her face. The background of the image is a blur of green and yellow.

In extreme detail, the woman's hair is a light brown color, and it is styled in loose waves. The flower crown is made of white and yellow flowers, and it is perched on top of her head. The woman's dress is made of a sheer, white fabric, and it has a flowing skirt. The field of flowers is full of yellow dandelions and white daisies. The sun is shining brightly, and it is casting a warm glow over the entire scene.

Once again, GPT-4 offers a rich vivid description on the very first try:

This image is a digitally created or manipulated portrait of an older man whose expression is serious and intense. The man's face is the central focus of the image, with his features sharply defined and highlighted against a neutral background.

The man has a stern and prominent brow, deep-set eyes with dark, heavy eyebrows, and a piercing gaze directed straight at the viewer. His eyes are dark, adding to the intensity of his expression. He has a large, hawkish nose that casts a slight shadow on the right side of his face, and his skin appears weathered with visible lines and creases that suggest age and experience. His cheeks are slightly sunken beneath high cheekbones.

He sports a thick, well-groomed mustache that covers his upper lip and runs down to the corners of his mouth, which is set in a firm line. His chin is strong with a cleft, and his jawline is square and pronounced.

The man is wearing a crisp white shirt with a stiff, pointed collar, and a classic black tie which is neatly knotted at the throat. This attire suggests formality and perhaps professionalism or authority.

Over his head and shoulders is draped a bright red fabric that frames his face. The fabric is smooth and has a silky sheen to it, with folds and creases that suggest a thick, heavy material. It wraps around his head and under his chin in a way that might remind one of traditional or religious garments, though it is not specific to any known cultural dress. The red of the fabric is vivid and is the only color in the image aside from the man's skin tone and the black and white of his attire, making it stand out dramatically.

The background is plain and light-colored, ensuring that all attention is drawn to the man and the striking red fabric. The lighting in the image is even and appears to be coming from the front, casting very soft shadows and giving the man's face a slightly three-dimensional effect. The overall mood of the image is one of intensity and perhaps an underlying narrative that is not immediately clear, inviting the viewer to contemplate the story behind the man's stern appearance.

Given how realistic this image is, what if we ask if it is a real photograph?

What is the likelihood that this is a real photograph?

Gemini argues that it may be a digital painting, but only because it sees it as a completely different image of a highly stylized and computer generated image of a woman with with a skirt, yellow dandelions, white daises and the like:

It is difficult to say definitively whether or not the image is a real photograph without more information. However, there are a few factors that suggest it may be a digital painting or other type of artwork.

First, the image is very stylized. The woman's hair and dress are flowing in an unrealistic way, and the lighting is very dramatic. Additionally, the background is blurred, which is a common technique used in digital art to create a sense of depth and focus.

Second, the image is very high quality. It is possible to create very realistic digital paintings, and this image is certainly detailed enough to be one.

Of course, it is also possible that the image is a real photograph that has been heavily edited. However, given the factors mentioned above, it is more likely that the image is a digital painting or other type of artwork.

In contrast, GPT-4 argues that it is a "high likelihood" that it is a photograph:

The likelihood that this is a real photograph of a man with a red fabric draped over his head is high. The detailed description suggests the image shows a realistically rendered portrait with photographic qualities such as sharp focus, natural skin textures, lifelike shadows, and realistic clothing materials.

However, without knowing the context or the source of the image, it is impossible to determine with absolute certainty whether it is an unaltered photograph, a digitally manipulated image, or a photorealistic illustration. Digital editing techniques can create images that closely mimic the look of a traditional photograph, making it challenging to distinguish between a real photo and a digitally altered or illustrated one.

If the image has characteristics typical of digital manipulation, such as an unusual level of sharpness, color saturation, or an amalgamation of elements not typically seen together, this may suggest that the photo has been edited or is a composite. However, these techniques are also used in modern photography for aesthetic purposes, so their presence does not necessarily mean the image is not based on a real photograph.

The qualities of the image as described—such as the precision of the man's attire, the lighting, and the texture of the fabric—do suggest it could very well have started as a photograph before any potential digital enhancement.