The first step in breaking a television news broadcast down into "stories" is to segment it into distinct visual "scenes". A scene change can represent switching between different clips, a transfer from the studio to a field reporter or simply a different camera angle. While a newsreader might discuss multiple stories in a row, the nature of modern television news production means that there will typically be a transition between stories of either a shift to a field reporter/guest, prerecorded footage or simply a camera angle change in the studio, meaning that if we can segment a broadcast into scenes we have a set of critical building blocks for story segmentation. Moreover, even within a single story, the rate at which imagery changes on television news can offer powerful cues to visual framing, while being able to collapse a broadcast by scenes enables rapid visual summaries.
The widely used open source "ffmpeg" utility offers two different methods for breaking videos into scenes and even offers the ability to create a mosaic of the first frame of each scene, all in a single command! (Note that Debian users might need to compile a modern ffmpeg from source or use a precompiled binary).
Lets see how the two approaches fare on the Dec. 17, 2020 3AM PST episode of CNN's New Day With Alisyn Camerota and John Berman.
The simplest approach is to use the "select" command's built-in "scene" variable which allows you to set the threshold at which scene division will occur:
time ./ffmpeg -i ./VIDEO.mp4 -vf "select=gt(scene\,0.4),scale=160:-1,tile=6x80" -frames:v 1 -qscale:v 3 preview.jpg
The "select" command sets the scene threshold, the "scale" resizes the images to thumbnail size and the "tile" command sets the number of columns and rows to use in the output mosaic image. The final output is the following:
Perhaps most readily apparent from the sequence above is that the majority of the scene changes occur during commercials. Look closely and actual scenes from the news portion of the broadcast are sparse, because an entire 5 minute segment in the studio will be reduced to a single frame if the camera remained locked onto the anchor without moving. An entire story might be made up of only a small number of shots. In contrast, advertisements tend to be extremely fast-paced, filled with rapid-fire scenes that can total five or more entirely different scenes in a single second of airtime. This suggests this kind of scene detection could be a powerful tool for understanding the visual motion of news versus advertising content on television news.
Alternatively, ffmpeg also offers the "scdet" filter that can additionally output rich information on the scene transitions to standard out for consumption by downstream analytic tools:
time ./ffmpeg -f lavfi -i "movie=VIDEO.mp4,scdet=s=1:t=14" -vf "scale=160:-1,tile=6x85" -frames:v 1 -qscale:v 3 preview.jpg
Here we set parameter "s" to 1 to tell it to pass scene change frames downstream, while we set the "t" threshold to 14. This outputs a list of scene change detections at the same time it compiles the collage. The first few rows can be seen below, with the score and timecode of the change:
lavfi.scd.score: 18.313, lavfi.scd.time: 5.5055 lavfi.scd.score: 46.487, lavfi.scd.time: 7.007 lavfi.scd.score: 18.181, lavfi.scd.time: 9.24257 lavfi.scd.score: 14.712, lavfi.scd.time: 10.5439 lavfi.scd.score: 21.134, lavfi.scd.time: 12.5459 lavfi.scd.score: 20.481, lavfi.scd.time: 14.8482 lavfi.scd.score: 19.946, lavfi.scd.time: 24.2409 lavfi.scd.score: 81.113, lavfi.scd.time: 24.391 lavfi.scd.score: 41.554, lavfi.scd.time: 25.4254 lavfi.scd.score: 38.609, lavfi.scd.time: 30.397 lavfi.scd.score: 15.266, lavfi.scd.time: 34.034 lavfi.scd.score: 15.522, lavfi.scd.time: 34.0674 lavfi.scd.score: 14.180, lavfi.scd.time: 34.1007 lavfi.scd.score: 14.727, lavfi.scd.time: 34.1341 lavfi.scd.score: 15.993, lavfi.scd.time: 34.1675 lavfi.scd.score: 16.839, lavfi.scd.time: 34.2008 lavfi.scd.score: 16.111, lavfi.scd.time: 34.2342 lavfi.scd.score: 14.451, lavfi.scd.time: 34.2676 lavfi.scd.score: 14.169, lavfi.scd.time: 34.8682 lavfi.scd.score: 14.198, lavfi.scd.time: 34.9015 lavfi.scd.score: 18.594, lavfi.scd.time: 45.5455
The final image can be seen below:
Both montages offer quite reasonable approximations of the key visual scene changes in the broadcast and their tunable parameters allow the scene change threshold to be adjusted from triggering at every minute camera shift through triggering only on large-scale changes.