ChatGPT's Unhelpful Take On Visually Summarizing Television News Broadcasts Through Thumbnails

Continuing our series on selecting "representative" frames from television news broadcasts to visually summarize them, thus far we've been experimenting with ffmpeg's built-in "thumbnail" filter. Given all of the attention these days on using foundational LLMs as brainstorming assistants, what might ChatGPT 4o offer in terms of suggestions for alternative thumbnail selection approaches?

Unfortunately, as we've come to expect from LLMs, the results were far from useful. ChatGPT 4o initially recommended code that created a single mean composite image of the entire video, outputting a muddy grey blurry mess and then finding the image closest to it, which selected a single isolated blurry grey muddy frame that bore no resemblance to the overall video. Despite repeatedly prompting ChatGPT to use ffmpeg's built-in thumbnail selection feature, ChatGPT steadfastly asserted that ffmpeg offers no such capability. Only when explicitly given the actual command to use did ChatGPT offer the correct code, but then steadfastly refused to accept that ffmpeg can operate on sequences of still images. No amount of prompting was able to dislodge ChatGPT from its false assertions during the brainstorming session. Overall the results from ChatGPT were exceptionally poor, in keeping with what we continue to see from LLMs to date.

Let's use a simple prompt, asking ChatGPT to use either ImageMagick or FFMPEG, since they are two of the most widely known and deployed image processing toolkits and are already widely deployed within our own video processing infrastructure:

Using ImageMagick or ffmpeg, examine a directory of images captured from a video and pick the one that is most representative of the set.

This yields the following explanation:

To select the most representative image from a directory of images using either ImageMagick or FFmpeg, there isn’t a direct built-in tool, but we can approach this by calculating statistical properties (such as histograms or color variance) across the set of images and picking the image that is closest to the average. Here’s how you can do it with either tool.

It then provided the following code (only the filename was changed in the code below) which we applied to this Russian news broadcast.

time convert RUSSIA1_20240911_140000_Vesti/*.jpg -evaluate-sequence mean ./RUSSIA1_20240911_140000_Vesti.AVG.JPG

for img in RUSSIA1_20240911_140000_Vesti/*.jpg; do
    compare -metric RMSE "$img" ./RUSSIA1_20240911_140000_Vesti.AVG.JPG null: 2>> ./differences.txt
    echo "$img" >> ./differences.txt
done
awk '{print $1}' ./differences.txt | sort -n | head -n 10

The code above first computes an image that represents the mean of all images in the video. Unfortunately, exactly as one might expect, this is merely a grey mess, since it represents the mean of all pixels of the entire video. Though, fascinatingly, we can see the CNN logo in the lower right from an extended clip of the presidential debate that appears in the broadcast, along with the ghostly image of a figure in the center facing the camera, which is a composite of the newsreaders, Putin and other figures:

The contents of differences is the RMSE of each frame from this average look like:

head -10 differences.txt
16244.7 (0.247878)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000001.jpg
30613 (0.467124)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000002.jpg
28715 (0.438163)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000003.jpg
17221.8 (0.262788)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000004.jpg
16943 (0.258533)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000005.jpg
16748 (0.255558)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000006.jpg
21297.5 (0.324979)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000007.jpg
21012.4 (0.320628)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000008.jpg
19609.9 (0.299229)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000009.jpg
21051.3 (0.321222)RUSSIA1_20240911_140000_Vesti/RUSSIA1_20240911_140000_Vesti-000010.jpg

In turn, the single frame that is closest to this composite is the following, which does not reflect the overall broadcast in any way:

 

Let's go back to ChatGPT with our findings and ask for a revised solution:

That didn't work well, it selected a grey blurry mess that was not representative of the images at all. Use one of ffmpeg or ImageMagick's built-in thumbnail selection features.

Unfortunately, it fails to recognize that ffmpeg has a built-in thumbnail feature:

Solution using FFmpeg:

Unfortunately, FFmpeg doesn't have built-in tools like ImageMagick's compare for analyzing structural similarity between images directly. FFmpeg is better suited for video and audio processing. However, you can use FFmpeg to extract features like histograms, and then compare the histograms of each image using a script, but it would require additional logic beyond FFmpeg itself to measure similarity.

Conclusion:

For your purpose, ImageMagick's compare tool with RMSE is the most straightforward and reliable method to use in a shell script to find the most representative image from a set.

What if we explicitly tell ChatGPT about ffmpeg's thumbnail filter and demand that it use it:

Write an ffmpeg command to process a directory of JPG images and use ffmpeg's built-in "thumbnail" filter to identify the most representative frame of the first 2 minutes.

This yields:

To use FFmpeg’s thumbnail filter to identify the most representative frame from a directory of JPG images over a time range (like the first 2 minutes), the typical usage is for video processing. If you want to find the most representative frame from a video (such as from the first 2 minutes), you can use the -vf thumbnail filter. However, applying it to a collection of still images is not supported, so this answer assumes you're referring to video frames.

It then provides the correct code, but for a video:

ffmpeg -i input_video.mp4 -vf "select='lte(t,120)',thumbnail" -frames:v 1 most_representative_frame.jpg

Despite repeated pushing, ChatGPT 4o refused to budge from its assertion that ffmpeg does not support image sequences, with the closest result it provided simply taking the images and forming them into a video and then processing the video:

ffmpeg -framerate 1 -i img%03d.jpg -c:v libx264 -r 30 -pix_fmt yuv420p output_video.mp4
ffmpeg -i output_video.mp4 -vf "select='lte(t,120)',thumbnail" -frames:v 1 most_representative_frame.jpg