Using FFMPEG's Scene Detection To Generate A Visual Shot Summary Of Television News

The first step in breaking a television news broadcast down into "stories" is to segment it into distinct visual "scenes". A scene change can represent switching between different clips, a transfer from the studio to a field reporter or simply a different camera angle. While a newsreader might discuss multiple stories in a row, the nature of modern television news production means that there will typically be a transition between stories of either a shift to a field reporter/guest, prerecorded footage or simply a camera angle change in the studio, meaning that if we can segment a broadcast into scenes we have a set of critical building blocks for story segmentation. Moreover, even within a single story, the rate at which imagery changes on television news can offer powerful cues to visual framing, while being able to collapse a broadcast by scenes enables rapid visual summaries.

The widely used open source "ffmpeg" utility offers two different methods for breaking videos into scenes and even offers the ability to create a mosaic of the first frame of each scene, all in a single command! (Note that Debian users might need to compile a modern ffmpeg from source or use a precompiled binary).

Lets see how the two approaches fare on the Dec. 17, 2020 3AM PST episode of CNN's New Day With Alisyn Camerota and John Berman.

The simplest approach is to use the "select" command's built-in "scene" variable which allows you to set the threshold at which scene division will occur:

time ./ffmpeg -i ./VIDEO.mp4 -vf "select=gt(scene\,0.4),scale=160:-1,tile=6x80" -frames:v 1 -qscale:v 3 preview.jpg

The "select" command sets the scene threshold, the "scale" resizes the images to thumbnail size and the "tile" command sets the number of columns and rows to use in the output mosaic image. The final output is the following:

Perhaps most readily apparent from the sequence above is that the majority of the scene changes occur during commercials. Look closely and actual scenes from the news portion of the broadcast are sparse, because an entire 5 minute segment in the studio will be reduced to a single frame if the camera remained locked onto the anchor without moving. An entire story might be made up of only a small number of shots. In contrast, advertisements tend to be extremely fast-paced, filled with rapid-fire scenes that can total five or more entirely different scenes in a single second of airtime. This suggests this kind of scene detection could be a powerful tool for understanding the visual motion of news versus advertising content on television news.

Alternatively, ffmpeg also offers the "scdet" filter that can additionally output rich information on the scene transitions to standard out for consumption by downstream analytic tools:

time ./ffmpeg -f lavfi -i "movie=VIDEO.mp4,scdet=s=1:t=14" -vf "scale=160:-1,tile=6x85" -frames:v 1 -qscale:v 3 preview.jpg

Here we set parameter "s" to 1 to tell it to pass scene change frames downstream, while we set the "t" threshold to 14. This outputs a list of scene change detections at the same time it compiles the collage. The first few rows can be seen below, with the score and timecode of the change:

lavfi.scd.score: 18.313, lavfi.scd.time: 5.5055
lavfi.scd.score: 46.487, lavfi.scd.time: 7.007
lavfi.scd.score: 18.181, lavfi.scd.time: 9.24257
lavfi.scd.score: 14.712, lavfi.scd.time: 10.5439
lavfi.scd.score: 21.134, lavfi.scd.time: 12.5459
lavfi.scd.score: 20.481, lavfi.scd.time: 14.8482
lavfi.scd.score: 19.946, lavfi.scd.time: 24.2409
lavfi.scd.score: 81.113, lavfi.scd.time: 24.391
lavfi.scd.score: 41.554, lavfi.scd.time: 25.4254
lavfi.scd.score: 38.609, lavfi.scd.time: 30.397
lavfi.scd.score: 15.266, lavfi.scd.time: 34.034
lavfi.scd.score: 15.522, lavfi.scd.time: 34.0674
lavfi.scd.score: 14.180, lavfi.scd.time: 34.1007
lavfi.scd.score: 14.727, lavfi.scd.time: 34.1341
lavfi.scd.score: 15.993, lavfi.scd.time: 34.1675
lavfi.scd.score: 16.839, lavfi.scd.time: 34.2008
lavfi.scd.score: 16.111, lavfi.scd.time: 34.2342
lavfi.scd.score: 14.451, lavfi.scd.time: 34.2676
lavfi.scd.score: 14.169, lavfi.scd.time: 34.8682
lavfi.scd.score: 14.198, lavfi.scd.time: 34.9015
lavfi.scd.score: 18.594, lavfi.scd.time: 45.5455

The final image can be seen below:

Both montages offer quite reasonable approximations of the key visual scene changes in the broadcast and their tunable parameters allow the scene change threshold to be adjusted from triggering at every minute camera shift through triggering only on large-scale changes.