The Visual Explorer visually summarizes television news broadcasts through fixed 1/4fps thumbnail grids – an approach developed through extensive human factors experimentation testing various frame sampling methodologies against a range of actual analytic use cases presented by journalists and scholars. Often, however, we want to visually represent a broadcast even more succinctly, such as a single thumbnail image on the main Visual Explorer page for a given date or perhaps a brief 5-10 frame summary card. Is algorithmic or fixed time-based sampling better for such use cases?
Using two examples, one from Belarusian and one from US television news, both approaches seem to yield equally reasonable "representative" summaries of the two broadcasts. However, for the example broadcast containing commercial breaks, the fixed sampling approach yielded twice as many frames drawn from the commercials, suggesting algorithmic approaches might offer a small edge, but further comparisons will be required.
Let's start with this broadcast from Belarusian TV. We'll use ffmpeg to sample our existing 1/4fps Visual Explorer thumbnail grid and select one frame every 168 frames and every 84 frames (dividing the broadcast's 841 Visual Explorer frames into precisely 5 and 10 samples).
find BELARUSTV_20240903_190500 | wc -l ffmpeg -threads 10 -framerate 1 -i "BELARUSTV_20240903_190500/BELARUSTV_20240903_190500-%06d.jpg" -vf "select='not(mod(n\,168))" -frames:v 5 -vsync vfr keyframes-%01d.jpg time montage keyframes-%d.jpg[1-5] -tile 5x -geometry +0+0 -background black ./montage.jpg ffmpeg -threads 10 -framerate 1 -i "BELARUSTV_20240903_190500/BELARUSTV_20240903_190500-%06d.jpg" -vf "select='not(mod(n\,84))" -frames:v 10 -vsync vfr keyframes-%01d.jpg time montage keyframes-%d.jpg[1-10] -tile 5x -geometry +0+0 -background black ./montage.jpg
Here is the 5-frame fixed sample:
And the 10-frame fixed sample:
We'll use also algorithmic sampling – in this case using ffmpeg's built-in "thumbnail" filter designed to "select the most representative frame in a given sequence". In this case we asked it to identify the most representative frame for each batch of 168 and 84 frames, respectively, to find the "best" frame in each batch:
find BELARUSTV_20240903_190500 | wc -l ffmpeg -threads 10 -framerate 1 -i "BELARUSTV_20240903_190500/BELARUSTV_20240903_190500-%06d.jpg" -vf "thumbnail=168" -frames:v 5 -vsync vfr keyframes-%01d.jpg time montage keyframes-%d.jpg[1-5] -tile 5x -geometry +0+0 -background black ./montage.jpg ffmpeg -threads 10 -framerate 1 -i "BELARUSTV_20240903_190500/BELARUSTV_20240903_190500-%06d.jpg" -vf "thumbnail=84" -frames:v 10 -vsync vfr keyframes-%01d.jpg time montage keyframes-%d.jpg[1-10] -tile 5x -geometry +0+0 -background black ./montage.jpg
In contrast, here are the results for algorithmic selection for 5 frames:
And 10 frames:
What about this CNN broadcast?
Fixed 5:
Vs algorithmic 5:
Fixed 10:
Vs algorithmic 10: