Experiments With CCExtractor Using Our BigQuery + Bigtable + GCS Digital Twin

Kalev Leetaru

10 months ago

In December 2020 we unveiled a massive new initiative in collaboration with the Internet Archive's TV News Archive to catalog all of the advertising airtime across the Archive's captioned archives spanning more than a billion seconds of airtime to make it possible for the first time for scholars to analyze the advertising narratives of commercial television news, as well as to ultimately make it possible to filter ads out during keyword searches to improve the relevancy of search results for journalists and scholars. That initiative involved stream-processing more than 2.2 petabytes of MPEG2 video through the "ccextractor" closed captioning tool in order to use broadcaster-provided captioning mode information that precisely delineates advertising vs show airtime in each broadcast. That analysis looked only at 14 of the Archive's more than 200 channels. Using our new BigQuery + Bigtable + GCS digital twin, we wanted to see how many other channels at the Archive might be candidates for this analysis.

Of the Archive's more than 200 channels, 165 have at least one broadcast purporting to contain a closed captioning stream. Of those, we selected one random broadcast from each and examined both its raw MPEG2 capture stream and its derived MP4 file. While the Archive's modern workflow preserves captioning streams through from the MPEG2 to the MP4 derivative, historically this was not the case and thus for much of the collection the much larger MPEG2 raw streams must be examined, rather than the far more compact MP4 files. Of the 165 sample broadcasts, just 6 had captioning streams in their MP4 files. Interestingly, it turns out that more than half (91 out of 165) of the Archive's captioned channels were collected through special collection processes around major US elections and other events that resulted in an MP4 stream being the raw source stream, without a corresponding MPEG2 raw stream. Thus, of the 165 captioned channel sample broadcasts examined, just 74 had corresponding MPEG2 captures, of which just 59 yielded non-empty TTXT transcripts. An examination of those TTXT transcripts shows that all exhibit 100% accuracy captioning mode information that correctly delineated ad/show content airtime, thus suggesting that it is worth processing this extended archive.

The digital twin query used for this analysis can be seen below.

select chan, count(1) cnt, STRING_AGG(id ORDER BY RAND() LIMIT 1) sampShows from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.chan') chan,
  JSON_EXTRACT_SCALAR(DOWN, '$.id') id,
  DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 and SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) <= 20240831 
 ) 
) where JSON_EXTRACT_SCALAR(DOWN, '$.status') = 'SUCCESS' AND JSON_EXTRACT_SCALAR(DOWN, '$.metaCCNum') != '' AND JSON_EXTRACT_SCALAR(DOWN, '$.metaCCNum') not in ('ocr', 'asr', '0')
 group by chan order by cnt desc