Announcing The Television News Advertising Inventory Files (AIF) Modified TTXT Dataset

Expanding our new Advertising Inventory Files (AIF) Captioning Time dataset, we're particularly interested in understanding the landscape of closed captioning of advertisements on television news. Are most captioned advertisements for consumer products, while advertising related to public health, policy issues and other critical societal topics uncaptioned in ways that would disenfranchise members of society that rely upon captioning to understand the news? Is the language of advertising fundamentally different in some way from the language of news, either in word choice or emotional undercurrents? Do advertisements offer counternarratives to the news programming in which they are embedded, perhaps negating editorial decisions of news channels? Have advertisements changed in the Covid-19 era and either way, are advertisements presenting scenes and promoting behaviors at odds with public health messaging?

To help explore these questions and broaden understanding of working with Captioning Mode data, we are excited to announce the release of modified versions of the raw TTXT files generated by ccextractor for each broadcast. This modified file is the raw TTXT output format, but the actual text of the captioning is removed for all news-related lines and retained only for POP and PAI caption modes, which are advertisements. In this way, these TTXT files are exactly as output by ccextractor but with only advertising captioning retained, while news captioning is blanked out.

For example, for the July 29, 2010 ABC World News Now 1AM-3AM PST broadcast, the lines surrounding the first commercial break look like:

00:04:24,463|00:04:25,798|POP| Ohhh. Cheesecake.
00:04:25,816|00:04:28,201|POP| Ok. What if I just had
00:04:25,816|00:04:28,201|POP| a small slice?
00:04:28,235|00:04:29,636|POP| I was good today,
00:04:28,235|00:04:29,636|POP| I deserve it!
00:04:29,670|00:04:32,021|POP| Or, I could have a medium slice
00:04:29,670|00:04:32,021|POP| and some celery sticks
00:04:32,072|00:04:33,472|POP|and they would ccel each other
00:04:32,072|00:04:33,472|POP| out, right?
00:04:33,490|00:04:34,941|POP| Or...Ok.
00:04:33,490|00:04:34,941|POP| I could ha one large slice
00:04:34,975|00:04:36,359|POP| and jog in place as I eat it
00:04:36,410|00:04:39,078|POP| Or...Ok. How about one large
00:04:36,410|00:04:39,078|POP| slice while jogging in place
00:04:39,113|00:04:40,113|POP| followed by eight celery...
00:04:40,147|00:04:41,531|POP| MMM
00:04:40,147|00:04:41,531|POP| Raspberrcheesecake...
00:04:41,582|00:04:43,616|POP| I have been thinking about this
00:04:41,582|00:04:43,616|POP| all day.
00:04:43,651|00:04:44,817|POP| Wow, and you've
00:04:43,651|00:04:44,817|POP| lost weight!
00:04:44,835|00:04:45,652|POP|Oh yeah,
00:04:45,686|00:04:48,121|POP|You're welcome. thank you!
00:04:48,188|00:04:49,122|POP| [ Female Announcer ] Yop.
00:04:49,156|00:04:50,790|POP| With 30 delicious flavors
00:04:52,126|00:04:59,065|POP| Yoplait.
00:04:52,126|00:04:59,065|POP| It is so good.
00:06:01,695|00:06:02,945|POP| OUCH! OW! OOPS!
00:06:03,063|00:06:05,298|POP| IT'S NEO TO GO!®
00:06:05,366|00:06:07,083|POP|READY. AIM. PROTECT.
00:06:07,184|00:06:08,935|POP| NEOSPORIN® GIVES YOU
00:06:07,184|00:06:08,935|POP| INFECTION-PROTECTION,
00:06:09,019|00:06:10,086|POP| AND PAIN RELIEF.
00:06:10,187|00:06:12,104|POP| NEO TO GO!®
00:06:10,187|00:06:12,104|POP| PLUS PAIN RELIEF.
00:06:12,189|00:06:14,123|POP| EVERY CUT.
00:06:12,189|00:06:14,123|POP| EVERY TIME.
00:06:12,189|00:06:14,123|POP| EVERYWHERE.

You can see how the "RU3" news lines are blank since we have removed the text of the news-related captioning, while the advertising lines are intact. You can also see how the first advertisement ends at 4m59s, while the following commercial is uncaptioned, followed by another captioned commercial starting at 6m1s. Remember that captioning files do not encode uncaptioned time – blocks of time that do not have associated captions appear as gaps in the timecode sequence.

You'll notice that there is no delineation separating the two commercials other than in this case that a third uncaptioned commercial appears between them. Captioning files do not provide information separating distinct stories or commercials, so to distinguish the boundary of two commercials you would need to use semantic similarity estimation or construct a database of known commercials and the scripts.

To manually review a given second of airtime, combine the filename (minus the file extensions) and start time into the following URL:


So for example to examine the "Ohhh. Cheesecake." line above, you would take the start time of "00:04:24,463" and convert it to seconds (it has the format "HH:MM:SS,SSFRACTION" so this would be 4*60 + 24 = 264):


For those interested in the technical specifics, we use the open source "ccextractor" utility's "TTXT" output format:

cat VIDEOFILE | mbuffer -q -m 10M | ccextractor -out=ttxt -bi - -o ./OUTPUTTRANSCRIPT.TTXT

We're tremendously excited to see the kinds of new research into the language of advertising, the psychology of applied persuasion and myriad other questions that can be explored with this new dataset!


The Modified TTXT dataset is available for the same channels as the AIF Captioning Time dataset.

Technical Details

For more information on how "captioning mode" data works and how we generate these files, see the AIF Captioning Time dataset announcement.

Download The Data

To determine which shows are available, a daily inventory file is available in the format YYYYMMDD from July 1, 2009 through present that lists all of the files for that day. For example, for June 1, 2020 the file would be:


The inventory files for the most recent 72 hours are updated throughout the day as shows complete processing.