Diving deeper into the question of what is driving the high levels of uncaptioned airtime in the 2009-2013 era, we were able to use a copy of the Advertising Airtime Dataset loaded into a temporary BigQuery table to demonstrate that it a relatively small number of broadcasts with failed captioning encoding were the primary drivers of this odd artifact.
Once again, using a single SQL query we can instantly surface all of the problematic broadcasts:
SELECT iaShowId, countif(type='NEWS') cnt_news, countif(type='UNCAPTIONED') cnt_uncaptioned, countif(type='ADVERTISEMENT') cnt_ad, count(1) cnt_tot, countif(type='UNCAPTIONED')/count(1)*100 perc_uncap FROM `[TEMPTABLE]` where station='CNN' and (DATE(Date) >= "2009-07-01" and DATE(date) <= "2020-12-31") group by iaShowId having perc_uncap > 25 order by perc_uncap desc
This yields results like:
| Row | iaShowId | cnt_news | cnt_uncaptioned | cnt_ad | cnt_tot | perc_uncap | |
|---|---|---|---|---|---|---|---|
| 1 | CNN_20101031_060000_Boiling_Point_Inside_the_Tea_Party | 0 | 1 | 0 | 1 | 100.0 | |
| 2 | CNN_20091224_220000_Future_Fast_Forward | 0 | 1 | 0 | 1 | 100.0 | |
| 3 | CNN_20100819_090000_Anderson_Cooper_360 | 0 | 1 | 0 | 1 | 100.0 | |
| 4 | CNN_20100822_060000_In_America_New_Orleans_Rising | 0 | 1 | 0 | 1 | 100.0 | |
| 5 | CNNW_20160207_010000_Anthony_Bourdain_Parts_Unknown | 0 | 1 | 0 | 1 | 100.0 | |
| 6 | CNN_20100315_190000_Ricks_List | 0 | 1 | 0 | 1 | 100.0 | |
| 7 | CNNW_20171015_140000_Fareed_Zakaria_GPS | 0 | 1 | 0 | 1 | 100.0 | |
| 8 | CNN_20100402_053000 | 0 | 1 | 0 | 1 | 100.0 | 
We're in the process of updating the AIF dataset to remove these problematic shows so stay tuned!
Looking more closely at broadcasts around 30% uncaptioned airtime we can see that for the most part, the majority of advertisements on CNN are uncaptioned, ranging from almost no commercial captioning:
To broadcasts with just a few lines of captioned commercials total:
Thus, while commercials make up a very small portion of total captioning, where they do appear they can dramatically skew findings – the reason for our new AIF dataset!
We can also look at the total percentage of news captioning in each broadcast, which also catches shows where all of the commercials were captioned but not show airtime:
SELECT iaShowId, countif(type='NEWS') cnt_news, countif(type='UNCAPTIONED') cnt_uncaptioned, countif(type='ADVERTISEMENT') cnt_ad, count(1) cnt_tot, countif(type='NEWS')/count(1)*100 perc_news FROM `[TEMPTABLE]` where station='CNN' and (DATE(Date) >= "2009-07-01" and DATE(date) <= "2020-12-31") group by iaShowId having perc_news < 50 order by perc_news asc
This will surface odd cases like this one that shows that 49% of its airtime has news captioning even though a cursory glance shows that the majority of it has no captioning:
- https://archive.org/details/MSNBC_20100427_060000_The_Rachel_Maddow_Show
It turns out that the culprit is this line of captioning that states that "good night and good luck" should be displayed continuously from 2m8s until 20m3s:
00:02:08,428|00:29:03,491|RU2|GOOD NIGHT AND GOOD LUCK.
This suggests that we cannot accept the end timecode for each captioning block as-is and must instead perform sanity checks on them.
Another example can be found in this broadcast:
- https://archive.org/details/CNNW_20130827_070000_Piers_Morgan_Live
In this case, the broadcast purports to be 16.6 hours long. The culprit is this transition in which in the middle of a single advertisement the timecode jumps from 35m to 21hours:
00:35:06,805|00:35:09,056|POP| BY A HEART VALVE PROBLEM. 00:35:09,057|21:00:05,365|POP| THAT PUTS JIM AT A GREATER RISK 00:35:09,057|21:00:05,365|POP| OF STROKE. 21:00:05,366|21:00:08,719|POP|TO GET HIS BLOOD TESTED. 21:00:05,366|21:00:08,719|POP| NOT CAUSED 21:00:05,366|21:00:08,719|POP| BY A HEART VALVE PROBLEM.
Here's another example of such a video:
- https://archive.org/details/FOXNEWSW_20130720_180000_The_Journal_Editorial_Report
You can see that it took suffers from this timecode leap:
00:18:27,040|00:18:29,475|POP| GO TO OBAMACARE911.COM 00:18:29,476|00:18:32,010|POP| TO CLAIM YOUR COPY NOW. 04:04:19,128|04:04:21,563|POP| to save big during Sleep 04:04:19,128|04:04:21,563|POP| Train's Triple Choice Sale.
Here we can see the diagnostic output of ccextractor showing that it sees the correct 30 minutes of video content but that there is a timecode jump:
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- Input: stdin [Extract: 1] [Stream mode: Autodetect] [Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto] [Timing mode: Auto] [Debug: No] [Buffer input: Yes] [Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No] [Target format: .txt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No] [Add font color data: Yes] [Add font typesetting: Yes] [Convert case: No][Filter profanity: No] [Video-edit join: No] [Extraction start time: not set (from start)] [Extraction end time: not set (to end)] [Live stream: Yes, no timeout] [Clock frequency: 90000] [Teletext page: Autodetect] [Start credits text: None] [Quantisation-mode: CCExtractor's internal function] ----------------------------------------------------------------- Reading from standard input File seems to be a transport stream, enabling TS mode Analyzing data in general mode Streaming | 00:00 New video information found [1280 * 720] [AR: 03 - 16:9] [FR: 07 - 59.94] [progressive: yes] TS continuity counter not incremented prev/curr 9/9 XDS Notice: Network call letters now FNC XDS: ContentAdvisory: US TV Parental Guidelines. Age Rating: None XDS: XDS Notice: Network is now Fox News Channel XDS Notice: Program is now News, Fair and Balanced Streaming | 245:10 Skip forward to the next Sequence or GOP start. Premature end of file - Transport Stream packet is incomplete (expected 188 bytes, got 92). Number of NAL_type_7: 0 Number of VCL_HRD: 0 Number of NAL HRD: 0 Number of jump-in-frames: 0 Number of num_unexpected_sei_length: 0 Total frames time: 00:30:58:556 (111402 frames at 59.94fps) CC type 0: 17842 (NTSC line 21 field 1 closed captions) CC type 1: 22640 (NTSC line 21 field 2 closed captions) CC type 2: 44467 (DTVCC Channel Packet Data) CC type 3: 16226 (DTVCC Channel Packet Start) incl. pulldown frames: 00:00:30:346 (1819 frames at 59.94fps) Min PTS: -00:18:40:040 Max PTS: 03:57:54:788 Length: 04:16:34:828 Initial GOP time: 00:00:00:000 Final GOP time: 00:00:43:233+32F Diff. GOP length: 00:00:43:233+32F (00:00:44:300) Number of key frames: 2791 Total user data fields: 113069 HDTV type user data fields: 109518 Done, processing time = 7 seconds
Once again, the power of BigQuery to triage in realtime a massive dataset and interactively explore the contours of a set of issues.
