Announcing The Television News Advertising Inventory Files (AIF) Video Time ASR-Aligned Dataset

We're tremendously excited to announce today the debut of the new "video time" Advertising Inventory Files (AIF) dataset for television news that uses Google Cloud Video API's automatic speech recognition (ASR) to precisely align the underlying closed captioning transcript and overlay the program's captioning mode information onto the airtime to flag each and every second of the broadcast as news, advertising or uncaptioned. The previously-released "captioning time" AIF files use the timecodes embedded in the closed captioning, which typically lag the actual broadcast by anywhere from a few seconds up to 5-10 seconds. This makes it difficult to use them as precision masks to exclude advertisements from analyses of the Visual Global Entity Graph 2.0, since the captioning boundaries do not align with the actual video footage itself. By using the ASR data to align the captioning transcripts we are able to generate a precisely aligned masking file that identifies the context of each second of airtime.

How We Align Closed Captioning Transcripts

Why do we need to align the closed captioning instead of just using the ASR that is already precisely aligned with the video? The reason is that the captioning data is what contains a special data field embedded in each line that tells whether that line of captioning is an advertisement or news programming. Over the past decade across all of the channels the Internet Archive monitors, this field appears to be 100% accurate in differentiating news from advertising. Thus, in order to overlay this field onto the actual broadcast itself we need to align the captioning so that the captioned words appear precisely when they are spoken so we can copy this information over.

Why is aligning closed captioning so difficult? In the US, television news closed captioning is typically generated by human transcriptionists watching the live broadcasts and typing what they hear in realtime. The lag time in their transcription, processing by the captioning company and transmission back to the broadcaster and out to television sets means there is considerable latency. Thus, captioning always lags the actual spoken audio by several seconds, but this delay is non-constant and varies continually throughout the broadcast.

Various solutions exist to align the captioning. Forced aligners use phonetic models and simplistic speech recognition to nudge the timecode of each captioned word forward, though it can struggle with complex words, accents, unusual cadence or prosody and substantial misalignment of spoken audio and captioned transcript. In our case, we use Google Cloud Video's ASR to generate a SOTA spoken word machine transcription that is remarkably robust against these issues. We then use "diff" set to minimal changes mode to align the machine and human transcripts.

Regardless of how the human captioning is aligned, the real challenge lies in merging the captioning mode information. You see, far from being a gold standard transcript of a broadcast, the human captioning of a typical broadcast is filled with an astounding array of errors, from typographical mistakes to the routine exclusion of titles and affiliations of speakers to missing words and passages to randomized gaps of all lengths scattered throughout the broadcast to innumerous paraphrasings and rephrasings to whole sections missing. Transcriptionists routinely condense complex or fast-paced remarks down to simpler versions, skip over dialog and correct misstatements. In contrast, the machine ASR transcribes the broadcast exactly as it hears it. While it does not transcribe disfluencies like "um," it does transcribe the audio as literally spoken meaning it will transcribe a passage as said like "i really do, really really do believe that…," whereas the human captioning might say "i believe" or skip the passage entirely.

Overall, around 20% of the airtime of each broadcast is uncaptioned, ranging from uncaptioned advertisements to transcriptionist error, meaning there is substantial missing content from the captioning.

In turn the ASR transcript can differ in transliterations, how it splits compound words and the fact that it transcribes what was actually said even if the person misspoke.

This means that instead of aligning two nearly identical transcripts with a small number of differences, the alignment process must contend with two wildly different transcripts filled with differences throughout and large gaps of all lengths where the human captioning is silent while the ASR picks up the actual spoken word transcript.

A single line of captioning may have multiple words missing or incorrect, making it impossible to simply look for the first and last word of a line to copy its captioning mode data over. The fact that the first and last word(s) of a captioning line can be missing or incorrect (or the words are spoken multiple times in the ASR transcript interspersed with words not in the captioning) makes it nearly impossible to correctly identify where to set the start and stop points of each line to transfer over its captioning mode information to the ASR. To address this, we look at the duration field of the captioning to see how long the given line was instructed to be visible for and use that to project the line forward when we can't find its endpoint. This can lead to errors like in the case of this Rachel Maddow broadcast where if you turn "best" closed captioning on, you'll see how the final captioning line from the show bleeds into the first few seconds of the advertisement, resulting in our incorrectly flagging those few seconds as news content. These errors are relatively rare and should only result in minor bleedover.

Another difficulty lies in the fact that newsreaders often speak in a "television journalism" cadence that relies on long pauses to emphasize words. This leads to gaps of a few seconds throughout a typical transcript during news programming. We use a postprocessing pass to connect these back together.

A more complex scenario that we see commonly is an uncaptioned gap of 15-60 seconds, typically around 15, 20, 30, 45 or 60 seconds, appearing right in the middle of a news segment. In many cases this is a solitary uncaptioned advertisement, but in other cases it is legitimate news programming such as this 41-second uncaptioned clip of tornado footage during a 2014 Lawrence O'Donnell segment. When encountering these, our rule of thumb is that if the ASR does not record any human speech during that gap we consider it to be uncaptioned news programming, whereas if there is substantial uncaptioned dialog we will flag it as uncaptioned. This isn't perfect and will result in the loss of some small segments of news programming, but works remarkably well in the general case.

Coverage

Given our use of the ASR transcript to align the captioning, we are only able to generate these files for news programming we analyze through Google's Cloud Video API. Thus, these files are created only for shows that are part of the Visual Global Entity Graph 2.0.

File Format

The file format is identical to that used for the "captioning time" AIF files. Each broadcast has its own tab-delimited AIF file with one row per second of airtime in the following format:

date. The date/time of the given second of airtime in JSON datetime format in UTC.
showOffset. The offset in seconds from the start of the broadcast.
iaShowId. The unique identifier assigned by the Television News Archive to that broadcast.
station. The channel the broadcast aired on such as "CNN".
showName. The name of the specific broadcast such as "ABC World News With Diane Sawyer" (some earlier broadcasts may lack this information).
type. The "type" of this second of airtime. At this time it is either "NEWS", "ADVERTISING" or "UNCAPTIONED". In almost all cases "UNCAPTIONED" can be treated as "ADVERTISING" but we distinguish the two for analyses that wish to compare captioned and uncaptioned advertising airtime.

The table below shows a sample extract from the file "MSNBCW_20140429_040000_The_Rachel_Maddow_Show.aif.vidtime.txt.gz" (view the broadcast on the Archive's website), showing the transition from the news programming to a commercial break, in which the first commercial is uncaptioned.

2014-04-29T04:18:30Z 1110 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:31Z 1111 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:32Z 1112 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:33Z 1113 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:34Z 1114 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:35Z 1115 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:36Z 1116 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:37Z 1117 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:38Z 1118 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:39Z 1119 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show NEWS
2014-04-29T04:18:40Z 1120 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:41Z 1121 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:42Z 1122 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:43Z 1123 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:44Z 1124 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:45Z 1125 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:46Z 1126 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:47Z 1127 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:48Z 1128 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:49Z 1129 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED
2014-04-29T04:18:50Z 1130 MSNBCW_20140429_040000_The_Rachel_Maddow_Show MSNBC The Rachel Maddow Show UNCAPTIONED

To manually review a given second of airtime, combine the iaShowID and showOffset into the following URL:

https://archive.org/details/[iaShowId]/start/[showOffset]

So for example to examine 1110 seconds into the MSNBC The Rachel Maddow Show episode "MSNBCW_20140429_040000_The_Rachel_Maddow_Show" you would use this URL:

https://archive.org/details/MSNBCW_20140429_040000_The_Rachel_Maddow_Show/start/1110

Download The Data

Downloading the data will be available later this week.

To determine which shows are available, a daily inventory file will be available in the format YYYYMMDD from July 1, 2009 through present that lists all of the files for that day. For example, for June 1, 2020 the file would be:

http://data.gdeltproject.org/gdeltv3/iatv/aif/20200601.captime.txt

The inventory files for the most recent 72 hours are updated throughout the day as shows complete processing.

The GDELT Project

Announcing The Television News Advertising Inventory Files (AIF) Video Time ASR-Aligned Dataset

How We Align Closed Captioning Transcripts

Coverage

File Format

Download The Data

Archives