ChatGPT's o1-Preview "Reasoning Model" Fairs Little Better Than 4o On Our Thumbnail Experiments

Yesterday we explored the ability of ChatGPT's 4o foundational LLM to act as a brainstorming partner in the development of better approaches to selecting representative frames to visually summarize television news broadcasts. Unfortunately, ChatGPT 4o yielded extremely poor results that required the user to already know the answer and explicitly instruct the model to use that answer, even while ultimately the model refused to correctly use the feature due to its refusal to accept that ffmpeg can process image sequences as well as video file inputs. OpenAI has heavily touted its new o1-preview model as the vanguard of a coming era of advanced "reasoning models" that "spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes." To what degree does this additional "reasoning" yield better results?

The end result is that ChatGPT's vaunted advanced reasoning o1-preview model yields results no better than than its baseline 4o model and neither output the trivial answer: using ffmpeg's built-in "thumbnail" capability that achieves the desired result in a single one-liner. Only after being explicitly given the answer did either model provide the correct response. Unfortunately, as usual, the hype and hyperbole around foundational LLMs as developer copilots fails to match with reality.

Let's use a similar prompt as last time:

Using ImageMagick or ffmpeg, examine a directory of images and pick the one that is most representative of the set.

This yields the following output:

Thought for 13 seconds

Identifying the representative image

To choose a representative image from a directory, I suggest calculating the average image of all images. Then, determine the image closest to this average, minimizing the sum of distances.

Adjusting my approach

I’m exploring how to use ImageMagick for averaging and comparing images, including computing root mean squared error. It’s interesting to see ImageMagick’s built-in tools streamline this process.

Mapping image differences

I’m pondering a script to compute RMSE between each image in a directory and a mean image, storing the RMSE values to pinpoint which image is closest to the mean.

Calculating RMSE

I’m thinking through computing the mean image from a set of JPG files and then evaluating the RMSE for each image against this mean, capturing and sorting these values to identify the image with the minimal RMSE.

Automating image analysis

I’m calculating the RMSE of all images relative to a mean image using ImageMagick and shell scripting. The image with the lowest RMSE is identified as the most representative.

To find the most representative image in a directory using ImageMagick or FFmpeg, you can follow a method where you calculate the average image (mean image) of the set and then select the image that is closest to this average. The image closest to the mean is considered the most representative because it minimizes the overall difference to all other images in the set.

Unfortunately, the code it provides is identical to ChatGPT 4o's approach:

convert *.jpg -evaluate-sequence mean mean.jpg
#!/bin/bash

# Compute the mean image
convert *.jpg -evaluate-sequence mean mean.jpg

# Initialize an empty file to store RMSE values
rmse_file="rmse_results.txt"
> "$rmse_file"

# Loop through each image and compute RMSE with the mean image
for img in *.jpg; do
  # Skip the mean image if it's in the same directory
  if [[ "$img" == "mean.jpg" ]]; then
    continue
  fi

  # Compute RMSE between the image and the mean image
  err=$(compare -metric RMSE "$img" mean.jpg null: 2>&1 | awk '{print $1}')
  echo "$err $img" >> "$rmse_file"
done

# Find the image with the minimal RMSE
best_img=$(sort -n "$rmse_file" | head -1 | awk '{print $2}')

echo "The most representative image is: $best_img"

Unfortunately, its ffmpeg recommendation is a similar approach merely adapted to ffmpeg:

ffmpeg -i image1.jpg -i image2.jpg -filter_complex "[0:v][1:v]blend=all_mode='average'" mean.jpg

Let's tell it the results are bad:

This yields really poor results, try again. Use built-in functions to get the best results.

Unfortunately, we get roughly the same results:

mkdir resized
for img in *.jpg; do
  convert "$img" -resize 100x100 "resized/$img"
done

#!/bin/bash

# Initialize an empty file to store total differences
total_diff_file="total_diff_results.txt"
> "$total_diff_file"

# Get a list of images (use the resized images if you resized them)
images=(resized/*.jpg)

# Total number of images
num_images=${#images[@]}

# Loop through each image
for ((i=0; i<num_images; i++)); do
  img1=${images[$i]}
  total_diff=0
  for ((j=0; j<num_images; j++)); do if [[ $i -eq $j ]]; then continue fi img2=${images[$j]} # Compute RMSE between img1 and img2 err=$(compare -metric RMSE "$img1" "$img2" null: 2>&1 | awk '{print $1}')
    total_diff=$(echo "$total_diff + $err" | bc)
  done
  echo "$total_diff $img1" >> "$total_diff_file"
done

# Find the image with the minimal total difference
best_img=$(sort -n "$total_diff_file" | head -1 | awk '{print $2}')

echo "The most representative image is: $best_img"

The full reasoning chain makes no mention of the most obvious choice (ffmpeg's built-in thumbnail filter):

Thought for 46 seconds

Identifying the issue

Piecing together the problem, the assistant suggests using ImageMagick to find the most representative image by computing the mean image and the RMSE difference. The script and explanation are provided.

Identifying image issues

I’m thinking about why the initial method might fail: the mean image might not match the visual variance, and using RMSE might not accurately gauge image similarity due to shifts, rotations, or brightness changes.

Weighing image selection methods

I’m mapping out alternatives for choosing a representative image from an album. Instead of computing a mean image, I’m thinking about histograms or feature descriptors to pinpoint the most representative option. Using ImageMagick to calculate perceptual hash values also seems promising.

Evaluating methods

I’m weighing the use of 'phash' for perceptual hashing, 'identify' for feature extraction, and 'compare' for histogram differences. Clustering or minimizing total distance between two images also seems worth exploring.

Evaluating another approach

I'm thinking about a new method using ImageMagick to find the image closest to all others by calculating total differences and selecting the minimal total difference.

Evaluating image quality

This bash script uses ImageMagick to assess image quality by computing the RMSE between each image pair, summing differences, and identifying the most representative image with the smallest total difference.

Gathering insights

I'm working through how to measure visual similarity using perceptual hashes, a dedicated tool like pHash, or FFmpeg's features, especially for tasks like computing average color.

Weighing possibilities

I'm considering PCA or 'ffmpeg' for perceptual hashes. Initial methods didn't deliver. These options might be more effective.

Developing pairwise comparisons

To mitigate the limitations of mean image comparisons, I’m suggesting evaluating each image by computing the sum of differences (e.g., RMSE) relative to all other images.

Script selection

I’m selecting a bash script that leverages ImageMagick to identify the most representative image by calculating RMSE differences between all images and identifying the one with the smallest total difference.

Implementing histogram comparison

I’m considering an alternative approach that involves comparing image histograms to identify the most representative image.

Pulling together histograms

First, I calculate and store histograms for each image in a dedicated directory. This process is more straightforward than comparing histograms in text form.

Evaluating comparisons

I'm thinking through histogram differences or using perceptual hashing, like DHash, to compare images uniquely and efficiently.

Crafting a new method

Simulating a DHash with ImageMagick involves resizing images to smaller sizes and calculating differences. However, using pairwise comparisons between images to select the closest one might be a more effective approach.

Resizing and analyzing

I’m working through resizing images to minimize computational load and calculating total differences using ImageMagick RMSE. The goal is to identify the most representative image.

Optimizing computational efficiency

I’m focusing on enhancing speed by reducing image dimensions and switching to simpler metrics like MAE. This method minimizes computational time for large image sets.

Let's go back and give o1-preview the actual answer and see how it performs:

This yields really poor results, try again. Use ImageMagick or ffmpeg's built-in capabilities to select representative frames. Use ffmpeg's built-in "thumbnail" or "scenedetect" filter.

This time we get the requested results for ffmpeg, but the ImageMagick results are still unusable:

ffmpeg -pattern_type glob -i '*.jpg' -vf "thumbnail" -frames:v 1 representative.jpg
ffmpeg -pattern_type glob -i '*.jpg' -vf "select='gt(scene,0.1)',scale=iw:ih" -frames:v 1 representative.jpg
convert *.jpg -evaluate-sequence median representative.jpg
convert *.jpg -evaluate-sequence mode representative.jpg
for img in *.jpg; do
  mean=$(convert "$img" -format "%[mean]" info:)
  echo "$mean $img" >> image_stats.txt
done
awk '{ total += $1; count++ } END { print total/count }' image_stats.txt
average_mean=$(awk '{ total += $1; count++ } END { print total/count }' image_stats.txt)
awk -v avg="$average_mean" '{ print ($1 - avg)^2, $2 }' image_stats.txt | sort -n | head -1 | awk '{print $2}'

And the reasoning chain:

Thought for 30 seconds

Evaluating image representation

I’m noting the user's request to select representative images using ImageMagick or ffmpeg. The initial mean image comparison approach is unsatisfactory, so alternative methods like pairwise comparisons are considered, leading to a revised script.

Investigating built-in tools

I'm mapping out the use of ImageMagick or ffmpeg's built-in capabilities to select representative frames, considering options like 'thumbnail' and 'select' in ffmpeg.

Evaluating tools

I'm examining ImageMagick and ffmpeg to select representative images. Ffmpeg's 'thumbnail' filter and ImageMagick's 'identify' command for feature extraction are potential options.

Weighing options

I'm considering ImageMagick's 'histogram' for a visual representation or ffmpeg's 'pHash' for perceptual similarity to pinpoint representative frames from given images.

Navigating options

I'm thinking through various methods such as ImageMagick's histogram analysis, k-means clustering, and ffmpeg's 'select' filter, to effectively choose representative frames based on image content.

Evaluating thumbnail filters

I’m considering using 'ffmpeg' to generate a representative frame from a video and exploring ImageMagick's '-evaluate-sequence median' for a median image approach.

Evaluating histogram options

I'm curious about calculating histograms for each image, aiming to identify the median histogram. Also, I’m considering ImageMagick’s 'magick compare' and 'unique' filter.

Employing FFmpeg filters

I’m thinking through FFmpeg filters thumbnail and select to spot key frames. The thumbnail filter excels at identifying representative frames in a video, which also works well for image sequences.

Utilizing FFmpeg

I’m working through using FFmpeg with the Thumbnail Filter to generate a representative image from JPEG files, or alternatively, the Select Filter with Scene Detection for video files.

Showcasing ImageMagick's Versatile Tools

I'm exploring ImageMagick's -evaluate-sequence function to pinpoint median or mode pixel values for images. Adjusting my strategy uncovering a promising approach.

Unfortunately, when asked to provide a code sample to generate multiple thumbnails, o1-preview yields the generic response that yields three identical thumbnails, without the necessary vsync vfr adjustment to make them different:

ffmpeg -pattern_type glob -i '*.jpg' -vf "thumbnail" -frames:v 3 representative_%02d.jpg