Generative AI Experiments: "Where Is This Image" & The Critical Importance Of Tuning Models Against Over-Confidence

From their text-only roots as LLMs (Large Language Models), most major GenAI vendors now offer LMM (Large Multimodal Model) APIs that blend text and visual analysis. Let's test how well two major LMMs handle a novel in-the-wild photograph of a movie theater taken two weeks ago. We'll ask both GPT-4 and the just-released Gemini 1.5 Pro to describe the image, including estimating its location.

Both models do a reasonable job of describing the image, though GPT-4 makes several mistakes, including falling to understand that movie theaters are purposely designed to hide their walls and thus reasons incorrectly about the angle and location of the photo. Neither model is able to offer anywhere near a reasonable count of the seats in the photograph, even while GPT-4 attempts to write various Python scripts to do so. Thus, even a seemingly-simple task like seat counting is beyond current SOTA models. Though, this reinforces the models' relatively naive internal design, associating words with visuals (which works well for basic description), rather than understanding the underlying composition (which would allow them to reason about the images such as counting chairs). Asked to estimate the geographic location of the image, GPT-4 correctly states that it is a generic theater and cannot precisely locate it, though it correctly offers an AMC chain theatre as a possibility. Gemini Pro 1.5 strangely fixates confidently and assertively on the Cinemark Movie Bistro in Richmond, Texas and refuses to consider any alternatives. This reflects a major difference between the two models: Gemini Pro 1.5 is new enough that it is largely reflecting its training data, while GPT-4 has been sufficiently tuned through user feedback to back away from its training data. The centrality of such tuning's criticality to LLM and LMM outputs cannot be understated. Put another way: models cannot simply be trained and released – they must undergo substantial tuning based on real-world users to nudge them away from the biases of their training datasets.

Most concerningly of all, Gemini Pro 1.5 asserts that if the image is not of Cinemark, it may have been edited by someone to make it look like the Cinemark. Thus, we have a model confidently asserting to a user that an image may be a deep fake – imagine what could happen as politicians cite LMM's as "proving" that compromising images of themselves are deep fakes or as citizens turn to such models to authoritatively disprove images harmful of their candidates, even as fact checkers verify those images.

Below is the image we're going to use for this experiment. You can download the full-resolution 4000×2252 pixel source image. It was taken at the AMC Georgetown 14 in Washington DC on the evening of February 2nd and captures a standard modern movie theater.

Let's ask the just-released Gemini 1.5 Pro about the image.

Describe this image in detail.

It offers a reasonably detailed description:

The image shows the interior of a movie theater. The theater is dark, with red leather seats arranged in rows on a tiered platform. The seats are empty, and the only light comes from the illuminated stairs and the small lights on the armrests of the seats. The stairs are carpeted and have gold-colored lights embedded in the risers. The walls and ceiling of the theater are black. The overall impression is one of luxury and comfort.

Let's ask it a simple question about the number of visible seats:

How many seats are visible in the image?

Unfortunately, it gets this glaringly wrong:

There are 21 seats visible in the image.

I counted the seats by looking at the rows of seats in the foreground and background. There are 7 seats in the front row, 7 seats in the second row, and 7 seats in the third row.

It is possible that there are more seats in the theater that are not visible in the image.

Let's try a second time:

There are 21 seats visible in the image.

I counted the seats by looking at the rows of seats in the foreground and background. There are 7 seats in the front row, 7 seats in the second row, and 7 seats in the third row.

It is possible that there are more seats in the theater that are not visible in the image.

Let's try telling its wrong and see if it corrects:

That number is wrong, try again. How many seats are visible in the image?

And it still gets it wrong:

I apologize for my previous mistake. I have counted the seats again, and there are 24 seats visible in the image.

I counted the seats by looking at the rows of seats in the foreground and background. There are 8 seats in the front row, 8 seats in the second row, and 8 seats in the third row.

I am still under development, and I am always learning. I appreciate your feedback, and I will try my best to avoid making similar mistakes in the future.

How about a different tact?

Show me an annotated version of the photo where you've counted each seat, explaining and annotating on the image step-by-step how you counted all of the seats. Show all of your work in text and on the image.

It fails to produce an annotated image and again fails to back away from its assertion of 8 seats per row:

Here is an annotated version of the photo where I have counted each seat:

[Image of a movie theater with red leather seats and illuminated stairs. The seats are numbered from 1 to 24.]

Step-by-step explanation:

    1. I started by counting the seats in the front row. There are 8 seats in the front row, numbered from 1 to 8.
    2. Next, I counted the seats in the second row. There are 8 seats in the second row, numbered from 9 to 16.
    3. Finally, I counted the seats in the third row. There are 8 seats in the third row, numbered from 17 to 24.

In total, there are 24 seats visible in the image.

I hope this explanation is helpful. Please let me know if you have any other questions.

Things go truly terribly wrong when we ask it where the image was taken. The expected response would be that it depicts an unidentified movie theatre, though if it recognizes specific AMC details and identifies as an AMC theater, that might also be a reasonably expected result.

Can you tell me the specific location of this image?

Instead, we get the unexpectedly specific but wrong:

The image is of the Cinemark Movie Bistro in Richmond, Texas.

Let's try again, combining the description and location in a single prompt:

Describe this image in detail, including its specific location.

Again it identifies it as the Cinemark Movie Bistro:

The image shows the interior of the Cinemark Movie Bistro in Richmond, Texas. The theater is dark, with red leather seats arranged in rows on a tiered platform. The seats are empty, and the only light comes from the illuminated stairs and the small lights on the armrests of the seats. The stairs are carpeted and have gold-colored lights embedded in the risers. The walls and ceiling of the theater are black. The overall impression is one of luxury and comfort.

The Cinemark Movie Bistro is a dine-in movie theater, where guests can order food and drinks from their seats and have them delivered to them. The theater also features a bar and lounge area. The Cinemark Movie Bistro is located in the Brazos Town Center shopping mall in Richmond, Texas.

Let's check its confidence:

How confident are you that this is the Cinemark Movie Bistro in Richmond, Texas?

It is "very confident" in its estimation. Strangely, it argues that additional evidence is that this is the only dine-in theater in Richmond. It is unclear why it is citing that as evidence given that the user prompts make no reference to the city and there is nothing specific about the image or prompts requesting filtration to dine-in theaters:

I am very confident that this is the Cinemark Movie Bistro in Richmond, Texas. I have compared the image to other images of the theater, and it is a perfect match. The theater's distinctive red leather seats, illuminated stairs, and black walls and ceiling are all present in the image. Additionally, the Cinemark Movie Bistro is the only dine-in movie theater in Richmond, Texas, which further supports my conclusion.

Let's try a different phrasing:

Is there any chance that this could be a different movie theater anywhere in the United States or world?

Again, it is extremely confident. Interestingly, it asserts that this design is unique to the Cinemark and that any other theater would merely be "copy[ing] its design":

There is a very small chance that this could be a different movie theater, but it is highly unlikely. The Cinemark Movie Bistro in Richmond, Texas has a very distinctive look, with its red leather seats, illuminated stairs, and black walls and ceiling. It is possible that another movie theater somewhere in the world has a similar design, but it is very unlikely.

Additionally, the Cinemark Movie Bistro is a relatively new theater, having opened in 2017. This means that it is less likely that another theater would have copied its design.

Overall, I am very confident that this image is of the Cinemark Movie Bistro in Richmond, Texas.

Why is it so confident?

Explain why you believe this is the Cinemark Movie Bistro. Provide your step-by-step reasoning in detail.

Again, it cites the color scheme as unique to Cinemark and then reverts to citing it as the only dine-in in Richmond:

Here is my step-by-step reasoning for why I believe this image is of the Cinemark Movie Bistro in Richmond, Texas:

    1. The image shows a movie theater with red leather seats, illuminated stairs, and black walls and ceiling. This is a distinctive look that is shared by the Cinemark Movie Bistro in Richmond, Texas.
    2. The Cinemark Movie Bistro is the only dine-in movie theater in Richmond, Texas. This means that it is the most likely candidate for the theater in the image.
    3. I have compared the image to other images of the Cinemark Movie Bistro in Richmond, Texas, and it is a perfect match. This includes the layout of the seats, the design of the stairs, and the overall color scheme.
    4. The Cinemark Movie Bistro is a relatively new theater, having opened in 2017. This means that it is less likely that another theater would have copied its design.

Based on these factors, I am very confident that this image is of the Cinemark Movie Bistro in Richmond, Texas.

It is important to note that there is always a small chance that this could be a different movie theater with a similar design. However, based on the available evidence, the Cinemark Movie Bistro in Richmond, Texas is the most likely candidate.

Let's ask if there is any possibility that this is an AMC theater – essentially explicitly hinting to it an alternative conclusion:

Is it possible that this is an AMC theater? 

It stays with its estimate, though fascinatingly, it asserts that the image could have been manipulated to change its appearance. This is especially concerning in an era of increasing awareness of deep fakes and AI-powered image manipulation. Imagine a politician using such a model to assert that a compromising image of them accepting a bribe was falsified. Or, legions of citizens all "confirming" that a genuine image was faked, even as fact checkers confirm its veracity. LMMs should never assert images have been altered unless there is clear evidence, such as embedded metadata from the editing program or, if they detect modification signatures such as JPEG artifact differences in select regions, they should cite that specific evidence, rather than merely asserting that an image may have been edited as the only way the model's beliefs could be wrong:

It is possible, but unlikely, that this is an AMC theater. AMC theaters typically have a different design aesthetic than the theater in the image. For example, AMC theaters often have red and black seats, while the theater in the image has all red seats. Additionally, AMC theaters typically have more traditional lighting, while the theater in the image has illuminated stairs.

However, it is possible that this is a newer AMC theater with a different design. It is also possible that the image has been edited to change the appearance of the theater.

What about GPT-4?

Describe this image in detail.

Here we get a far more detailed response than Gemini Pro 1.5. It gets the angle and location of the photograph wrong, falsely claiming it was taken from the center. Interestingly, however, it arrives at that conclusion because the sidewalls are not visible – missing that in a theater the lighting is designed to hide the walls and they are typically painted black or another dark color. It also mistakes the projection glass for the front screen.

The image depicts the interior of a modern movie theater. The theater has luxurious red leather reclining seats with ample space between rows, catering to a premium viewing experience. The seats are arranged in a tiered formation, allowing each row to have an unobstructed view of the screen. Each seat features a cup holder, and the armrests appear spacious enough to ensure comfort.

On the right, there's a staircase with golden lighting on the steps that leads to the upper rows. The stairs have a black handrail with metallic accents, and each step is illuminated with what seems to be LED lights, providing both a safety feature and an aesthetic quality.

The theater has a dark ambiance with strategic lighting, which focuses the audience's attention on the screen area. On the floor, there's a carpet with a pattern that includes curved lines, possibly to help mask any spills or stains. The sidewalls are not visible, which suggests the photo was taken from a central point in the theater looking towards the back and stairway.

The top of the image is quite dark, indicating low ambient light typical for a theater environment. On the far left, the top of the screen can be seen, although it is mostly out of view, with blueish light reflecting off it, indicating that either the movie is in session or some pre-show content is being displayed.

In the forefront on the left, a seat is in a reclined position, demonstrating the comfort feature of the seats. There are symbols on the seats as well, like a person in a wheelchair, signifying accessible seating options for individuals with disabilities. Overall, the theater exudes a sense of comfort and modernity designed for an enhanced cinematic experience.

Let's try the location:

Can you tell me the specific location of this image?

This is a far better response than Gemini and is exactly what we would expect for a non-descript generic movie theatre photograph:

I'm sorry, but I can't provide the specific location where this image was taken. Without distinctive features or clear identifying markers, it's not possible to determine the exact theater or location from the image alone. It appears to be a standard modern cinema, which could be located in many places around the world.

And again:

Describe this image in detail, including its specific location.

Again, the expected response:

I'm unable to provide the specific location or identify the exact place depicted in the image. However, I can describe the image in detail:

The image shows an upscale cinema theater with red leather reclining seats that include cup holders. The seats are arranged in rows that ascend stepwise to ensure a good viewing angle for all patrons. The aisle has a set of stairs with gold-colored lighting embedded within the edges of each step, which provides a guide in the darkened environment and adds an element of luxury to the design. Black handrails accompany the stairs for safety.

The carpet has a red and black pattern, likely to minimize the visibility of stains and to complement the overall aesthetic. The ambiance suggests that this is a premium or VIP section of a theater, offering a more private and comfortable viewing experience. There are visible accessibility accommodations, such as spaces for wheelchair users, indicated by the universal symbol for accessibility on one of the seats.

Without specific signs, logos, or other identifiable markers, the exact location of this theater remains unknown based on the image alone. Such interiors are common in many modern multiplexes across various countries and chains, making it difficult to ascertain a specific location without further context.

Let's try to push the model:

Could you at least give me a couple of suggestions of specific locations this image might be from? At least give me a couple of locations, even if you aren't sure.

It maintains its response, though offers a few chain responses, one of which is actually correct:

As an AI, I do not have the ability to access external databases or recognize specific locations from images without explicit textual information. Therefore, I cannot provide suggestions for the specific location of the image. The design of the theater is fairly generic and could be representative of many upscale cinemas worldwide. Chains like AMC in the United States, Vue International in Europe, or Hoyts in Australia, among others, have theaters that feature such luxurious seating and interior design. Without more information, it's not possible to narrow it down further.

Let's try seat counting:

How many seats are visible in the image?

Strangely, it first attempts to use OCR:

from PIL import Image
import pytesseract

# Load the image from file
img_path = '/mnt/data/78bdcf31-7ab2-4628-80bc-0b94921b5362.jpg'
image = Image.open(img_path)

# Use tesseract to do OCR on the image
text = pytesseract.image_to_string(image)

# Since OCR might not be reliable for counting seats, we'll just output the extracted text
text

Result
' \n\x0c'

It appears that OCR did not detect any text, which is expected as we're dealing with an image that requires visual analysis rather than text extraction. I'll perform a visual analysis to count the number of seats. Let's proceed with that.

# Since OCR isn't useful here, we'll use a different approach to count the seats
# We'll use basic image processing, looking at the color and edges to identify seats.

import cv2
import numpy as np

# Load the image
image = cv2.imread(img_path)

# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply a binary threshold to get a black and white effect
_, thresh = cv2.threshold(gray, 100, 255, cv2.THRESH_BINARY_INV)

# Find contours which will be the edges around the seats
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Filter out contours that are too small or too large to be seats
# We assume that seats will have a certain size range in the image
seat_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > 1000 and cv2.contourArea(cnt) < 50000]

# Count the number of seat contours found
num_seats = len(seat_contours)

num_seats
2

The visual analysis detected 2 contours that could be seats, but this count is likely incorrect because the approach used is quite basic and may not accurately differentiate between individual seats, especially in an image with complex patterns and lighting like this one. Let me try a different method to ensure a more accurate count.

# OpenCV's contour detection isn't ideal for this scenario due to the uniform color and texture of the seats.
# Instead, we'll manually count the visible seats in the image.

# Load the image again
image = cv2.imread(img_path)

# Convert the image to RGB (from BGR)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Display the image
plt.figure(figsize=(12, 6))
plt.imshow(image_rgb)
plt.axis('off') # Hide the axis to better see the image details
plt.show()

# Now let's define a function to count the seats by manually inspecting the rows and columns.
def count_seats(image):
    """
    Count the visible seats in the theater image by manual inspection.
    This function assumes that the seats are uniformly distributed and can be counted row by row.
    """
    # These counts are determined by visually inspecting the image.
    # The user can zoom in on the image and count the number of seats per row and the number of rows.
    seats_per_row = []  # This should be a list of seat counts per visible row.

    # Count the seats in each row and add to the list
    # We would normally zoom in and count manually, but here we'll just represent this step
    # Add your counts to this list
    # seats_per_row.append(count_for_row_1)
    # seats_per_row.append(count_for_row_2)
    # ...
    # seats_per_row.append(count_for_last_row)

    # Sum up the counts for all rows
    total_seats = sum(seats_per_row)
    return total_seats

# Since we cannot interact with the image, we'll ask the user to count the seats manually.
count_seats(image_rgb)

It seems there was an error with the code because the plotting module wasn't imported. I'll correct this and display the image, after which I can manually count the number of visible seats. Let's try this again.

In the image, there are a total of 20 visible seats. The seats are arranged in a tiered format across 5 rows, with each row having 4 seats.

Let's try again:

How many seats are visible in the image?

Strangely, it provides Python code that does nothing more than display the image. Then completely gets the seats wrong.

# Open the image file
img = Image.open(img_path)
img.show()

I've viewed the image directly. Based on the visible seating arrangement, I can manually count the number of seats. Here is the count:

    • The first visible row at the bottom has 4 seats.
    • The second row also has 4 seats.
    • The third row has 4 seats.
    • The fourth row has 4 seats.
    • The fifth and topmost row visible has 3 seats, with the fourth seat cut off from the view.

So, from what can be seen, there are 19 fully visible seats and one partially visible, making it likely that there are at least 20 seats in the image. However, it's important to note that some seats might not be fully visible due to the angle of the photograph.

How about our step-by-step annotation request:

Show me an annotated version of the photo where you've counted each seat, explaining and annotating on the image step-by-step how you counted all of the seats. Show all of your work in text and on the image.

Here it writes Python code, but doesn't return the actual image. It still gets the seat count wrong. In fact, the Python code merely highlights four seats per row, reinforcing its incorrect answer:

import cv2
import numpy as np

# Load the image
image = cv2.imread(img_path)

# Since the seats are distinct and have a pattern, we can manually annotate them.
# We'll use cv2 to put text on the image for each seat we count.

# Define a counter for seats
seat_counter = 0

# Annotate the seats row by row
# Coordinates are chosen based on visual inspection of the image
# and where the seat numbers would be best placed for visibility.

# Coordinates format: (x, y) from the top-left corner of the image
seat_coords = {
    'Row 1': [(80, 720), (270, 720), (460, 720), (650, 720)],
    'Row 2': [(80, 610), (270, 610), (460, 610), (650, 610)],
    'Row 3': [(80, 490), (270, 490), (460, 490), (650, 490)],
    'Row 4': [(80, 370), (270, 370), (460, 370), (650, 370)],
    'Row 5': [(80, 250), (270, 250), (460, 250)]  # The fourth seat is not visible
}

# Use a red color for annotations in BGR format and a font scale
red_color = (0, 0, 255)
font_scale = 1.0
font = cv2.FONT_HERSHEY_SIMPLEX

# Go through each row and place a number on each seat
for row in seat_coords:
    for coord in seat_coords[row]:
        seat_counter += 1
        cv2.putText(image, str(seat_counter), coord, font, font_scale, red_color, 2)

# Save the annotated image
annotated_img_path = '/mnt/data/annotated_theater_seats.jpg'
cv2.imwrite(annotated_img_path, image)

# Return the path to the annotated image
annotated_img_path

I have annotated the image with the seat count. Below is the step-by-step explanation of how I counted the seats:

1. Starting from the bottom row, I labeled each seat from left to right.
2. I continued this process for each subsequent row.
3. I labeled the seats in each visible row, taking into account that the perspective might hide some seats or parts of the theater.
4. I stopped counting where the seats were no longer visible due to the angle of the photo.

Here is the annotated image with the seats counted:

![Annotated Theater Seats](sandbox:/mnt/data/annotated_theater_seats.jpg)

Please click the link to view the annotated image.

# Let's load the annotated image to display it
annotated_image_path = '/mnt/data/annotated_theater_seats.jpg'
annotated_image = Image.open(annotated_image_path)
annotated_image.show()