Behind The Scenes: Some Initial Archive-Scale Closed Captioning Statistics

Only a portion of the TV News Archive's broadcasts contain broadcaster-provided closed captioning, but by virtue of being largely human-transcribed rather than automatically generated, it tends to be more accurate than ASR transcripts (even if it is often less complete). Even more powerfully, for coax-originated broadcasts, the underlying MPEG2 streams contain "captioning mode" information that for advertising-supported channels uniquely identifies all advertisements, which we will be leveraging to exclude ads from the next generation of the TV Explorer to make analytic timelines far more accurate for journalists and scholars. Work is underway to compile this mode information for all captioned broadcasts in the Archive. With the first million broadcasts processed to date, we thought we'd share some preliminary statistics.

Over 1.1 million broadcasts have been analyzed, totaling 3.93 billion seconds of airtime. Of that airtime, 3.24 billion seconds (82.5%) was captioned, with the remaining 17.5% typically being uncaptioned advertising (advertisements are frequently uncaptioned). Of the captioned airtime, 2.9 billion seconds (89.5%) are part of the main program and 340.7 million seconds (10.5%) are captioned advertisements. In total, around 1 billion seconds (26%) of the analyzed broadcasts were likely paid advertisements, in keeping with our past analyses. These numbers are skewed slightly by the inclusion of some channels with lower ad densities and we'll explore this in more detail in future analyses.

In total, of the 3.24 billion seconds of captioning (53.9 million minutes / 899K hours), there were 1.67 billion captioning lines containing 7.78 billion words and 43.3 billion characters. The average word is 5.56 characters long and on average 2.4 words are spoken per second. The average captioning line contains 4.65 words.

Computing these statistics was as simple as:

select status, COUNT(1) numShows,
sum(CAST(secTot AS FLOAT64)) secTot,
sum(CAST(secCap AS FLOAT64)) secCap,
sum(CAST(secAd AS FLOAT64)) secAd,
sum(CAST(secProg AS FLOAT64)) secProg,
sum(CAST(totChars AS FLOAT64)) totChars,
sum(CAST(totWords AS FLOAT64)) totWords,
sum(CAST(totLines AS FLOAT64)) totLines
   from (
  select JSON_EXTRACT_SCALAR(CCEXTRACT, '$.status') status,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.secTot') secTot,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.secCap') secCap,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.secAd') secAd,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.secProg') secProg,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.totChars') totChars,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.totWords') totWords,
  JSON_EXTRACT_SCALAR(CCEXTRACT, '$.totLines') totLines,
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('CCEXTRACT') ) CCEXTRACT
      FROM `bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') ='SUCCESS' 
    and length(JSON_EXTRACT_SCALAR(DOWN, '$.metaCCNum')) > 1 and JSON_EXTRACT_SCALAR(DOWN, '$.metaCCNum') not in ('ocr', 'asr')
) group by status order by numShows desc

The GDELT Project

Behind The Scenes: Some Initial Archive-Scale Closed Captioning Statistics

Archives