We are tremendously excited to announce today the debut of the new Advertising Inventory Files (AIF) dataset for television news covering more than 1.1 billion seconds of airtime 2009-present using data from the Internet Archive's Television News Archive! This dataset inventories the airtime of each broadcast second by second, cataloging whether it was "News" programming, "Advertising" programming or "Uncaptioned" (in the case of television news this is almost exclusively advertising).
For many kinds of research into television news coverage, the inclusion of advertising airtime skews results and can distort findings through the burstiness and timing of advertising campaigns that may be timed to major societal events. Many advertisements are not captioned and those that are typically are tangential enough to most queries or air infrequently enough to avoid impacting results. For other kinds of queries, especially those looking at micro-scale language shifts or investigating topics like medical discourse (an especially important topic in the midst of a global pandemic) that can be heavily impacted by commercials, it is critical to be able to completely remove all advertising-related captioning from each broadcast. Using this new dataset it is now trivial to precisely determine second by second what is advertising and what is news programming.
Closed captioning tends to lag the actual broadcast by several seconds, up to 5-10 seconds in some cases, depending on the speed of the human transcriptionist and broadcast encoding system used. Since we are using hidden codes embedded in the captioning channel to determine whether a given second of airtime is an advertisement, the results of this initial AIF dataset are in "captioning time" rather than "video time." Analyses wishing to merge this dataset with the Visual Global Entity Graph 2.0 should anticipate a lag of up to 10 seconds and may wish to treat the 10 seconds preceding each commercial break as likely advertising time as well, depending on their analytic needs. In the coming weeks we will be releasing a second AIF dataset for broadcasts like the VGEG 2.0 dataset that are realigned into "video time" so stay tuned.
How We Detect Commercials
Given the long legacy of "broadcast commercial detection" research and the high error rates of most algorithms that can range 5-10% or greater, what approach are we using and what is the anticipated accuracy?
On the American television news programming that comprises the majority of the searchable Internet Archive's Television News Archive, the closed captioning channel includes a special field called the "Caption Mode" that specifies how to display the captioning to the viewer. The three most important values for broadcast television news in the US are "Roll-Up", "Pop-On" and "Paint-On." The latter two relate to prerecorded captioning provided to the broadcaster already embedded in the video file ready for broadcast. In the case of US news channels, these modes are used exclusively for captioned commercials and is why captioning in commercials appears all at once in blocks. The "Roll-Up" code is used only for live-captioned content that is transcribed by a human or ASR system live as the broadcast airs and is used only for news programming. This captioning type is the distinctive "scrolling" captioning in which each new line of captioning appears at the bottom of the captioning block and "scrolls" the preceding lines upwards with the oldest line disappearing. Importantly, prerecorded news footage, such as clips of earlier broadcasts that are reaired, still use the "Roll-Up" captioning code even if their captioning is reused from the earlier broadcast.
Thus, identifying advertisements in a broadcast requires simply scanning the original raw MPEG2 capture file to identify the Caption Mode field for each second of airtime and flagging all airtime that carries a Pop-On or Paint-On code as advertising. In extensive manual spot checks of broadcasts from across the entire 2009-2020 archive, this approach appears to have 100% accuracy, though there will undoubtedly be some sporadic errors.
For those interested in the technical specifics, we use the open source "ccextractor" utility's "TTXT" output format:
cat VIDEOFILE | mbuffer -q -m 10M | ccextractor -out=ttxt -bi - -o ./OUTPUTTRANSCRIPT.TTXT
Coverage
Due to the enormous disk and network bandwidth requirements of scanning the original MPEG2 capture files which run around 1.7-3GB per broadcast, at this time we are computing AIF files only for the following channels back to their starting points in the Television News Archive:
- ABC Evening News + 24/7 KGO (San Francisco ABC Affiliate)
- CBS Evening News + 24/7 KPIX (San Francisco CBS Affiliate)
- NBC Evening News + 24/7 KNTV (San Francisco NBC Affiliate)
- CNN
- MSNBC
- Fox News
- KTVU (San Francisco Fox Affiliate)
- Telemundo (KSTS)
- Univision (KDTV)
- Bloomberg
- CNBC
- Fox Business
- KRON 4 (MyNetworkTV)
- KQED (San Francisco PBS Affiliate) (This station has "sponsorship" messages at the beginning and end of many broadcasts).
NOTE: Coverage runs through December 31th, 2020 and will be extended through present in the coming days.
These stations do not ordinarily carry advertising content and thus are not part of this collection because they do not require advertising inventory files:
- Al Jazeera
- BBC News London
- CSPAN1/CSPAN2/CSPAN3
- DeutscheWelle (DW)
- LinkTV
- Russia Today (RT)
- San Francisco Government Television
File Format
Each broadcast has its own tab-delimited AIF file with one row per second of airtime in the following format:
- date. The date/time of the given second of airtime in JSON datetime format in UTC.
- showOffset. The offset in seconds from the start of the broadcast.
- iaShowId. The unique identifier assigned by the Television News Archive to that broadcast.
- station. The channel the broadcast aired on such as "CNN".
- showName. The name of the specific broadcast such as "ABC World News With Diane Sawyer" (some earlier broadcasts may lack this information).
- type. The "type" of this second of airtime. At this time it is either "NEWS", "ADVERTISING" or "UNCAPTIONED". In almost all cases "UNCAPTIONED" can be treated as "ADVERTISING" but we distinguish the two for analyses that wish to compare captioned and uncaptioned advertising airtime.
The table below shows a sample extract from the file "KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer" (view the broadcast on the Archive's website), showing the transition from the news programming to a commercial break, with a few seconds of uncaptioned commercial airtime at the start.
2012-07-03T00:45:02Z 902 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer NEWS 2012-07-03T00:45:03Z 903 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer NEWS 2012-07-03T00:45:04Z 904 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer NEWS 2012-07-03T00:45:05Z 905 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer NEWS 2012-07-03T00:45:06Z 906 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer UNCAPTIONED 2012-07-03T00:45:07Z 907 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer UNCAPTIONED 2012-07-03T00:45:08Z 908 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer UNCAPTIONED 2012-07-03T00:45:09Z 909 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer UNCAPTIONED 2012-07-03T00:45:10Z 910 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer UNCAPTIONED 2012-07-03T00:45:11Z 911 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT 2012-07-03T00:45:12Z 912 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT 2012-07-03T00:45:13Z 913 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT 2012-07-03T00:45:14Z 914 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT 2012-07-03T00:45:15Z 915 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT 2012-07-03T00:45:16Z 916 KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer KGO ABC World News With Diane Sawyer ADVERTISEMENT
To manually review a given second of airtime, combine the iaShowID and showOffset into the following URL:
- https://archive.org/details/[iaShowId]/start/[showOffset]
So for example to examine 846 seconds into the ABC World News with Diane Sawyer episode "KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer" you would use this URL:
- https://archive.org/details/KGO_20120703_003000_ABC_World_News_With_Diane_Sawyer/start/846
Download The Data
To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 1, 2009 through present that lists all of the files for that day. For example, for June 1, 2020 the file would be:
- http://data.gdeltproject.org/gdeltv3/iatv/aif/20200601.captime.txt
The inventory files for the most recent 72 hours are updated throughout the day as shows complete processing.