Earlier this week we demonstrated the limitations of using Large Multimodal Models (LMMs) like ChatGPT and Gemini to detect corrupted recordings in the TV News Archive. In that test ChatGPT 4o completely hallucinated a vividly rich scene fabricated from whole cloth and produced Python code to scan for visual artifacts that did not do what the model claimed it did, while Gemini 1.5 Pro correctly identified the video as corrupt. Let's try a different broadcast, this one with a distinctive Atari-like visual corruption. Despite being explicitly told that the video is from a news broadcast, ChatGPT 4o once again fails to detect any problems, while Gemini 1.5 Pro this time similarly detects no problems, seeing the video as gaming footage. Like ChatGPT last time, this time Gemini repeatedly hallucinates additional elements of the video.
Let's examine this broadcast from last September that has a strange distinctive Atari-style corruption:
You can download the file to run your own LMM experiments:
Let's ask Gemini 1.5 Pro to describe the video and check for corruption-related artifacts:
Describe the visual contents of this video in detail. What does it depict? Does the video show any signs of corrupted signal via artifacts in the video? If so, how much of the video appears to be corrupt?
While recognizing that it is a largely static scene, Gemini fails to see it as a corrupted file:
The video displays a pixelated scene resembling an abstract landscape or terrain, reminiscent of early video game graphics. A strip of green, designed as if it were a plateau with a slight incline, occupies the upper left corner. Several gray pixelated shapes, suggestive of rocks or small structures, are placed on this green area. The main visual focus is a large, green triangular form that dominates the lower right, resembling a steeply sloped hill or mountain, also with gray pixelated rock-like shapes scattered along its upward face. The overall style is simplistic and blocky, characteristic of retro digital art or gaming aesthetics. There are no signs of corrupted signals or artifacts. The image maintains its intended quality and design throughout its duration, which appears to be of a static screen.
Of course, given that we don't tell the model that it is a news broadcast, this could be a perfectly legitimate capture from an old video game, so let's provide additional context to the model to see if it changes its behavior and run it from scratch:
Describe the visual contents of this television news broadcast in detail. What does it depict? Does the video show any signs of corrupted signal via artifacts in the video? If so, how much of the video appears to be corrupt?
Unfortunately, this time Gemini detects corruption, but does so due to its hallucination of artifacts that are not present in the video, raising questions about its conclusion:
The entire video depicts a low-resolution animation of a pixellated green hill with gray rocks, resembling a side view from a classic video game. The hill is divided into two sections separated by a black void. The upper section is a flat, horizontal platform with a small, diagonal slope at the left end. A gray rock sits on this platform near the slope. The lower section shows a steeper hill ascending diagonally from left to right. This section has three more gray rocks placed at different heights. The video exhibits significant signs of a corrupted signal. Throughout the entire duration, flickering vertical lines and blocks of various colors (primarily pink, purple, and green) disrupt the image. These artifacts constantly shift and change, affecting almost every frame and appearing across the entire screen. The pixelation of the original content is made even more distorted by the corruption. This constant and pervasive artifacting indicates a heavily degraded video signal, significantly affecting the overall quality.
Run a third time from scratch, Gemini identifies the video as a video game. It identifies no corruption of the video other than hallucinating that a logo appears in the top-left. Worse, despite being explicitly told that this is a news broadcast, it identifies no possibility of corruption beyond the hallucinated logo. Strangely, despite being told that this is a news broadcast, it does not identify any concerns over game footage appearing during a news broadcast (which is entirely possible, but also should raise questions given the lack of any onscreen visuals denoting such):
The video displays a static image of a screen capture of an early video game. The game screen is divided roughly into two halves, separated by a black expanse across the middle. Both halves of the screen depict sloped green pixelated hills. Each hill features gray rocks on various level tiers. The game screen is not centered in the video frame, rather it is shifted toward the upper left corner, and occupies just over half of the overall space. The remaining portions of the video frame are black. There is a prominent āPā overlaid in transparent white in the top left corner of the video frame from beginning to end. The video does show signs of signal corruption. The white "P" appears to be a static video artifact throughout the duration. There is no visible disruption or distortion to the game screen imagery, only the persistent "P" artifact. Only a very small portion of the video, the āPā, is affected.
Similar to last time, ChatGPT 4o detects no issues with the video:
Describe the visual contents of this video in detail. What does it depict? Does the video show any signs of corrupted signal via artifacts in the video? If so, how much of the video appears to be corrupt?sdf
Yielding:
The video appears to be free of any corrupted signals or artifacts.
Let's try explicitly telling it that it is a news broadcast:
Describe the visual contents of this television news broadcast in detail. What does it depict? Does the video show any signs of corrupted signal via artifacts in the video? If so, how much of the video appears to be corrupt?
As before, ChatGPT detects no problems with the video:
Based on the extracted frames from the video, here's an analysis of the visual contents and the presence of any corrupted signals: Visual Contents: General Setup: The video depicts a television news broadcast. Common features of news broadcasts include a presenter, news graphics, text overlays, and possibly video inserts or footage. Graphics and Presentation: If present, the frames may include elements such as a news ticker, station logo, or text captions summarizing headlines or providing additional context. The visual style and color scheme are typical of television broadcasts, with clear separation between the presenter and graphical elements. Content Specifics: Depending on the frames, the visuals might show a news anchor, studio backdrop, or live footage related to a news story. Background elements may include dynamic or static designs, potentially highlighting a country's cultural or thematic elements. Corruption and Artifacts: Presence of Artifacts: No immediately obvious signs of significant corruption, such as pixelation, freezing, or screen tearing, were observed in the sample frames. Signal issues (e.g., color banding, blocky compression artifacts) can sometimes appear in isolated areas or during transitions between scenes. If these issues exist, they were not prominent in the sampled frames. Extent of Corruption: If any corruption is present, it appears minimal and not disruptive to understanding the broadcast content. The majority of the video appears to maintain its intended visual clarity.