Visual Explorer: Master List Of ZIP Files For All 1.5 Million Broadcasts: Enabling At-Scale Non-Consumptive Visual Analysis Spanning 50 Countries Over 20 Years

The TV News Visual Explorer now encompasses selections from 98 channels spanning 50 countries and territories in 35 languages and dialects over 20 years on 5 continents. In all, 1.5 million broadcasts are now available in the Visual Explorer, with more coming online continually as we process the Internet Archive's Television News Archive historical backfile and its contemporary channels. Today we are releasing a master inventory file of all 1.5 million preview ZIP files to enable at-scale non-consumptive analysis.

For each broadcast, the Visual Explorer makes it "skimmable" by extracting one frame every 4 seconds at a fixed interval to represent the broadcast. These images are arrayed into a thumbnail grid in the Visual Explorer web interface. To enable at-scale non-consumptive visual analysis, each broadcast also makes available a ZIP file containing the full-resolution version of the images that make up the thumbnail grid.

You can download these ZIP files and analyze them through any off-the-shelf image analysis tool. Earlier this month we demonstrated running the ZIP file for a Russian television news broadcast through Google's Cloud Vision API and using the annotations to identify all of the clips from Fox News that were shown during the broadcast to examine how Russian state media is using Fox News coverage to advance its narratives about the invasion.

What if you want to scale up such an analysis, to look at all broadcasts from a given channel during a set of days? For some channels we have EPG program data that includes the name of each show, meaning you could filter to look just at all Tucker Carlson broadcasts, for example.

To help you with this, we've compiled a master inventory of the downloadable preview image ZIP files for all 1.5 million broadcasts as of yesterday:

VISUALEXPLORER-IDLIST-20220929.TXT

You can download this file and filter by channel, date or show name (for channels that provide it) to compile a list of the matching ZIP files to download, making it trivial to curate collections to answer specific research questions.

Here are some tips for working with the collection at scale:

EPG Data. Note that only some channels have the EPG data that allows us to split them by show and include the show name in the file name. Others are split monotonically into a new item every 30 minutes on a fixed interval. Note that EPG can experience periodic error (such as a show being preempted, etc), so you should spot check results. Also note that there can be EPG outages, meaning a channel may revert to monotonic splitting for a period of time, so always check the results when filtering by show name.
Image Dimensions. Typical neural computer vision algorithms resize input images to a fixed lower resolution representation for recognition. When resizing preview images for input, do not hardcode any assumptions about their pixel dimensions. Preview images are captured raw from the underlying MPEG2 or MP4 stream and reflect the actual raw pixel dimensions of the original source video. This will typically fall into standard SD/HD/FHD aspect ratios, but is not guaranteed. Some channels have non-standard dimensions. For example, BBC News broadcasts are converted from their native PAL resolution, yielding nonstandard resolution such as 1024×576 pixels. Critically, the same channel can vary in resolution over the course of a day, with different shows broadcasting at different resolutions. For channels that are split monotonically this can yield unusual results, with aspect ratios changing mid-item and black bars added (though the pixel resolution will be fixed for all images within a given item). Robust processing pipelines should therefore avoid any hardcoded assumptions about image dimensions and should instead adjust per-item.
Changing Aspect Ratios. In some countries it is not uncommon for the aspect ratio of a broadcast to change mid-broadcast for inserted footage. For example, a studio broadcast might air in native 720p, while archival or contributed footage might have a different aspect ratio. Most US-based television channels will dynamically resize such footage to enforce a consistent fixed aspect ratio using approaches such as fixed or dynamic zoon, pan and scan, etc. Such techniques were used heavily during the pandemic to blend Zoom and other video conferencing feeds into studio feeds. Not all countries use this approach or use it all the time, meaning broadcasts can suddenly have content with different aspect ratios appearing with black bars. Robust pipelines may wish to detect "black bar" sequences to detect whether the aspect ratio has changed mid-broadcast and adjust accordingly.
Comparing Across Countries, Channels And Time. Many research questions revolve around comparative analysis, such as contrasting how two channels covered a major event. It is important to recognize that the technical aspects of television broadcasting can impact these results and must be accounted for either through normalization or other mitigation procedures. At a macro level, differences can include NTSC vs PAL, different frame rates, colorspaces and resolutions. Across the world over the past 20 years, channels have been sourced via terrestrial cable, satellite and online streaming over myriad platforms, each of which enforces its own unique technical standards. The Archive's holdings span 20 years, with some channels transitioning from SD to HD and then potentially to FHD over this time, adding complexity to comparisons even within the same channel over time. A channel might be collected via cable, then transition to satellite or online collection. Some neural vision models may struggle to accurately compare SD content against FHD content and yield resolution-invariant results. Differing compression algorithms may introduce artifacting a model is less resistant to, such as a channel that transitions from MPEG2 to highly compressed online streaming, with an attendant increase in compression artifacts that damage fine detail. Critically, the transition from incandescent to florescent to LED lighting, changing camera technology and changing studio lighting standards can require careful consideration and mitigation strategies. For example, PBS news footage from a decade ago is starkly more muted and darker, more harshly lit and with strong dominance of the unique color profile of fluorescent lighting compared with modern PBS footage, making it more difficult to compare the two and may yield suboptimal results with neural models trained on modern imagery. There are a number of mitigation strategies depending on the underlying neural model, ranging from white balancing to a fixed neutral to color expansion to measuring model fit from the training to application imagery and thresholding imagery too dissimilar. When performing comparative analysis, we strongly recommend spot checking a random set of broadcasts evenly spaced across the comparison period to determine whether any of these issues appear to impact the results.

We are tremendously excited to see the kind of research that this immense and incredibly unique new collection enables!

The GDELT Project

Visual Explorer: Master List Of ZIP Files For All 1.5 Million Broadcasts: Enabling At-Scale Non-Consumptive Visual Analysis Spanning 50 Countries Over 20 Years

Archives