The GDELT Project

A Deeper Dive Into Uncaptioned Television News Airtime

Diving deeper into the question of what is driving the high levels of uncaptioned airtime in the 2009-2013 era, we were able to use a copy of the Advertising Airtime Dataset loaded into a temporary BigQuery table to demonstrate that it a relatively small number of broadcasts with failed captioning encoding were the primary drivers of this odd artifact.

Once again, using a single SQL query we can instantly surface all of the problematic broadcasts:

SELECT iaShowId, countif(type='NEWS') cnt_news, countif(type='UNCAPTIONED') cnt_uncaptioned, countif(type='ADVERTISEMENT') cnt_ad, count(1) cnt_tot, countif(type='UNCAPTIONED')/count(1)*100 perc_uncap FROM `[TEMPTABLE]` where station='CNN' and (DATE(Date) >= "2009-07-01" and DATE(date) <= "2020-12-31") group by iaShowId having perc_uncap > 25 order by perc_uncap desc

This yields results like:

Row iaShowId cnt_news cnt_uncaptioned cnt_ad cnt_tot perc_uncap
1
CNN_20101031_060000_Boiling_Point_Inside_the_Tea_Party
0
1
0
1
100.0
2
CNN_20091224_220000_Future_Fast_Forward
0
1
0
1
100.0
3
CNN_20100819_090000_Anderson_Cooper_360
0
1
0
1
100.0
4
CNN_20100822_060000_In_America_New_Orleans_Rising
0
1
0
1
100.0
5
CNNW_20160207_010000_Anthony_Bourdain_Parts_Unknown
0
1
0
1
100.0
6
CNN_20100315_190000_Ricks_List
0
1
0
1
100.0
7
CNNW_20171015_140000_Fareed_Zakaria_GPS
0
1
0
1
100.0
8
CNN_20100402_053000
0
1
0
1
100.0

We're in the process of updating the AIF dataset to remove these problematic shows so stay tuned!

Looking more closely at broadcasts around 30% uncaptioned airtime we can see that for the most part, the majority of advertisements on CNN are uncaptioned, ranging from almost no commercial captioning:

To broadcasts with just a few lines of captioned commercials total:

Thus, while commercials make up a very small portion of total captioning, where they do appear they can dramatically skew findings – the reason for our new AIF dataset!

We can also look at the total percentage of news captioning in each broadcast, which also catches shows where all of the commercials were captioned but not show airtime:

SELECT iaShowId, countif(type='NEWS') cnt_news, countif(type='UNCAPTIONED') cnt_uncaptioned, countif(type='ADVERTISEMENT') cnt_ad, count(1) cnt_tot, countif(type='NEWS')/count(1)*100 perc_news FROM `[TEMPTABLE]` where station='CNN' and (DATE(Date) >= "2009-07-01" and DATE(date) <= "2020-12-31") group by iaShowId having perc_news < 50 order by perc_news asc

This will surface odd cases like this one that shows that 49% of its airtime has news captioning even though a cursory glance shows that the majority of it has no captioning:

It turns out that the culprit is this line of captioning that states that "good night and good luck" should be displayed continuously from 2m8s until 20m3s:

00:02:08,428|00:29:03,491|RU2|GOOD NIGHT AND GOOD LUCK.

This suggests that we cannot accept the end timecode for each captioning block as-is and must instead perform sanity checks on them.

Another example can be found in this broadcast:

In this case, the broadcast purports to be 16.6 hours long. The culprit is this transition in which in the middle of a single advertisement the timecode jumps from 35m to 21hours:

00:35:06,805|00:35:09,056|POP| BY A HEART VALVE PROBLEM.
00:35:09,057|21:00:05,365|POP| THAT PUTS JIM AT A GREATER RISK
00:35:09,057|21:00:05,365|POP| OF STROKE.
21:00:05,366|21:00:08,719|POP|TO GET HIS BLOOD TESTED.
21:00:05,366|21:00:08,719|POP| NOT CAUSED
21:00:05,366|21:00:08,719|POP| BY A HEART VALVE PROBLEM.

Here's another example of such a video:

You  can see that it took suffers from this timecode leap:

00:18:27,040|00:18:29,475|POP| GO TO OBAMACARE911.COM
00:18:29,476|00:18:32,010|POP| TO CLAIM YOUR COPY NOW.
04:04:19,128|04:04:21,563|POP| to save big during Sleep
04:04:19,128|04:04:21,563|POP| Train's Triple Choice Sale.

Here we can see the diagnostic output of ccextractor showing that it sees the correct 30 minutes of video content but that there is a timecode jump:

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: stdin
[Extract: 1] [Stream mode: Autodetect]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: No] [Buffer input: Yes]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .txt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No][Filter profanity: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: Yes, no timeout] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Reading from standard input
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
Streaming |  00:00

New video information found
[1280 * 720] [AR: 03 - 16:9] [FR: 07 - 59.94] [progressive: yes]

TS continuity counter not incremented prev/curr 9/9
XDS Notice: Network call letters now FNC
XDS: ContentAdvisory: US TV Parental Guidelines. Age Rating: None
XDS:
  XDS Notice: Network is now Fox News Channel
XDS Notice: Program is now News, Fair and Balanced
Streaming |  245:10
Skip forward to the next Sequence or GOP start.
Premature end of file - Transport Stream packet is incomplete (expected 188 bytes, got 92).

Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Total frames time:        00:30:58:556  (111402 frames at 59.94fps)
CC type 0: 17842 (NTSC line 21 field 1 closed captions)
CC type 1: 22640 (NTSC line 21 field 2 closed captions)
CC type 2: 44467 (DTVCC Channel Packet Data)
CC type 3: 16226 (DTVCC Channel Packet Start)
incl. pulldown frames:  00:00:30:346  (1819 frames at 59.94fps)

Min PTS:                                -00:18:40:040
Max PTS:                                03:57:54:788
Length:                          04:16:34:828

Initial GOP time:          00:00:00:000
Final GOP time:          00:00:43:233+32F
Diff. GOP length:          00:00:43:233+32F     (00:00:44:300)

Number of key frames: 2791
Total user data fields: 113069
HDTV type user data fields: 109518
Done, processing time = 7 seconds

Once again, the power of BigQuery to triage in realtime a massive dataset and interactively explore the contours of a set of issues.