Behind The Scenes: Identifying Failed Recordings: Examining Curation Metadata

Kalev Leetaru

8 months ago

Any large longitudinal audiovisual archive will have some number of recordings that suffer from technical errors, ranging from minor audio sync delays to corrupted files. How to identify these broadcasts at scale? Let's look at metadata curation notes using our digital twin and what these contemporary archivist notes can tell us about technical issues across an archive. Can they be used to identify all corrupted or missing broadcasts to avoid processing them through expensive AI pipelines as a cost-saving measure? Unfortunately, as we find here, these curation notes only reflect known problematic broadcasts and don't differentiate between broadcasts that are entirely processable and those that are unprocessable. At the same time, the power of digital twins is once again demonstrated as we use it to better understand the landscape of specific curation notes within the collection.

First, let's get a sense of the various kinds of curation notes present across the entire TV News Archive by using a regular expression to parse out the archivist comments:

select comment, count(1) cnt,  from (
  select REGEXP_EXTRACT( JSON_EXTRACT_SCALAR(DOWN, '$.metaCurationNotes'), r'\[comment\](.*?)\[/comment\]') comment, DOWN FROM (
    SELECT  
      rowkey,
      ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN,
    FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 
 ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') ='SUCCESS' and length(JSON_EXTRACT_SCALAR(DOWN, '$.metaCurationNotes')) > 2
) group by comment order by cnt desc

This yields 14 distinct values across 148,092 programs that have this field. The values include entries like "program was likely cut short due to a recording issue", "program likely has audio track damage", "negative duration issues", "program was broadcast with sound only" and other notes that give a sense of the kinds of technical errors that can arise at scale.

Using our digital twin we can then compile a list of all broadcasts that have a specific kind of technical error. For example, here is a blank Russian broadcast that has a curation note that it was truncated, even though it passed into our ingest pipeline. Perhaps we can use this field to identify similar blank broadcasts, so let's compile a list of all broadcasts containing a curation note of "[comment]program was likely cut short due to a recording issue[/comment]":

(
select id, JSON_EXTRACT_SCALAR(DOWN, '$.metaCurationNotes') from (
  select JSON_EXTRACT_SCALAR(DOWN, '$.id') id, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN,
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 
 ) order by rowkey asc
) where JSON_EXTRACT_SCALAR(DOWN, '$.status') ='SUCCESS' and JSON_EXTRACT_SCALAR(DOWN, '$.metaCurationNotes') like '%cut short%' 
)

This yields 46,247 broadcasts, but the majority are like this CTV broadcast that contains normal audio and video, rather than these legitimately corrupted broadcasts. This means that we can't just use this curation note as a proxy for blank broadcasts and exclude them.