Behind The Scenes: Identifying Failed Recordings: Using Large Multimodal Modals Like ChatGPT & Gemini: Part 3

Continuing our series examining whether Large Multimodal Models (LMMs) like ChatGPT and Gemini might be able to help us identify corrupt video files in the TV News Archive, let's try two Chinese broadcasts that have clear audio and corrupted video. As usual, ChatGPT fails to detect that either is corrupt, though for one of the videos it does detect the video as corrupt in one run, but when run again from scratch does not. Gemini correctly identifies both videos as corrupt, but consistently hallucinates a number of details about them.

Let's start with this corrupt broadcast from last year that does feature usable audio:

You can download the clip below to try with your own LMM experiments:

Let's try ChatGPT 4o:

Describe the visual contents of this television news broadcast in detail. What does it depict? Does the video show any signs of corrupted signal via artifacts in the video? If so, how much of the video appears to be corrupt?

As usual, ChatGPT identifies no problems with the video – again because of its overreliance on writing trivial Python scripts to do the analysis:

The video is approximately 30.16 seconds long, has a frame rate of 25 frames per second, and a resolution of 640x356 pixels. I will now analyze the visual content and check for any signs of corrupted signal or artifacts. ​​

I have analyzed frames extracted from the video at 2-second intervals. Each frame is evaluated for mean pixel value to detect signs of visual anomalies or corruption.

Key observations:

The mean pixel values are consistent across most frames, suggesting no significant visual corruption.
Minor fluctuations in mean pixel values are within normal ranges for variations in broadcast content.

Asking Gemini 1.5 Pro the same question hallucinates that the video is of a nightclub, but does correctly identify that it is corrupted:

The clip seems to show a nighttime or dark indoor scene with pink and purple lighting effects, possibly showing people dancing in a club environment, but it is difficult to tell for sure. There is a rhythmic thumping sound heard in the background as well, suggesting the presence of music.

Almost all of the visual data is corrupted. The image appears broken up into thin, vertical bands or lines, exhibiting a severe form of video distortion. These bands are smeared and stretched vertically, obscuring any underlying details of the scene. There are also noticeable color distortions, with random pixelation and flickering colors adding to the corrupted appearance. The constant visual flickering and shifting of the bands further suggests a problem with the video signal, making it nearly impossible to discern any coherent imagery.

Gemini 1.5 Flash offers similar results:

The video shows a blurry and distorted television news broadcast. The visual is predominantly shades of muted pink and purple, making it difficult to discern any clear images or text. The broadcast shows signs of severe signal corruption or artifacts across the entire screen. Vertical lines and pixelation are pervasive, obscuring any potential content.

Essentially, the entire video is visually corrupted; there is no portion that is free of artifacts. It's impossible to determine what news story, if any, is being broadcast due to the level of distortion.

How about this second video, with a richer more vibrant smearing effect?

And the video clip for you to try your own experiments with:

Remarkably, for the first time in the four tested videos, ChatGPT 4o detects corruption:

Most of the frames are visually corrupt, displaying minimal variance, which suggests heavy artifacts or damaged content. This level of corruption likely affects the visibility of any clear imagery or details. Only 1 frame was deemed visually clear.

Unfortunately, running a second time reverts back to its previous behavior of detecting no corruption:

The video consists of 749 frames, with no frames visibly corrupted. Below are key observations:

Corruption Analysis: All frames appear visually intact with no noticeable corruption or artifacts.
Visual Descriptions: Key moments every 100 frames are described as normal, with visible and seemingly uncorrupted content.

Gemini 1.5 Pro correctly detects that the video is corrupt, but hallucinates a number of details:

This video shows a Chinese news broadcast on a television screen. It begins with a static, gray screen showing illegible white text at the top center. Then, a section of illegible Chinese text with white and purple-blue coloring appears and quickly disappears at the top center of the screen. The screen is completely gray for a few seconds, then a section of illegible text, with what appear to be colorful pixels against a gray background, moves across the lower two thirds of the screen. At timestamp 0:19, the video cuts to another shot of a gray screen, this time with a thick white border on the left. Illegible text again appears in the lower two thirds of the screen.

The entire video shows signs of a corrupted signal, with multicolored pixels strewn across much of the screen. At times, these pixels are the only content other than a completely gray background.

Running a second time again detects the video as corrupted, but again hallucinates key details:

This video appears to be an excerpt of a Chinese television news broadcast. It opens with a title card written in Chinese. The rest of the video appears to show a news anchor presenting a news report, presumably covering a recent quote by Xi Jinping about the achievements and milestones reached by the People’s Republic of China.

The video shows clear signs of video corruption due to a distorted signal. Artifacts are visible throughout the video, making it difficult to fully understand the visual content being presented. Most of the footage is affected, with only the first title card and the news anchor’s audio being easily distinguishable.