One of the most complex and time-consuming aspects of working with vast historical archives is diagnosing and addressing the myriad oddities that surface in the most unexpected ways. For example, as we began to sweep through the complete TV News Archive to identify broadcasts whose ASR or OCR had failed, we stumbled across the strangest issue: shows of 15 minutes to an hour of airtime that ffmpeg would treat as if they were 12-40 hours long. In turn, this was causing highly unpredictable failure states of downstream AI workflows that were not designed for single 40-hour input files. After considerable in-depth diagnostic analysis, it turned out that these broadcasts had corrupted MPEG2 streams with invalid duration information. Many had entirely valid video and audio data for the expected duration, meaning we didn't want to simply discard them. Even if we did, how to identify them without streaming the header chunks of many petabytes of video files and scanning them all? Here we used our BigQuery + Bigtable digital twin to identify all broadcasts with substantial deviations between their airtime and reported MPEG2 file durations, then recomputed their durations based on airtime and recorded their MPEG2 durations in a new field that allows us down the road to analyze these in greater depth.
From an SVOT (Single Version Of the Truth) standpoint, this offers a critical reminder that in large audiovisual archives there will often be multiple sources for a given attribute that may not always align with one another. In this case we originally were advised years ago to use the MPEG2 file duration as the most accurate value in order to accommodate slight mismatches between recording and reported times due to stream reconstruction from the component source files, but here we discovered that while in theory the most accurate (due to it being from the file itself), airtime is a better measure since it allows us to sidestep file corruption issues.
First, we wrote an SQL query to identify all broadcasts with a greater than 30 minute deviation between the broadcast's UTC start and stop times and its reported duration:
select id, metaStartTimeUTC,metaStopTimeUTC, durSec, diffSec, ABS(durSec-diffSec) delta,runTime from ( select id, metaStartTimeUTC,metaStopTimeUTC, TIMESTAMP_DIFF(SAFE_CAST(metaStopTimeUTC AS TIMESTAMP),SAFE_CAST(metaStartTimeUTC AS TIMESTAMP), SECOND) diffSec,durSec,runTime from ( select JSON_EXTRACT_SCALAR(DOWN, '$.id') id,JSON_EXTRACT_SCALAR(DOWN, '$.metaStartTimeUTC') metaStartTimeUTC, JSON_EXTRACT_SCALAR(DOWN, '$.metaStopTimeUTC') metaStopTimeUTC, SAFE_CAST(JSON_EXTRACT_SCALAR(DOWN, '$.durSec') AS numeric) durSec,JSON_EXTRACT_SCALAR(DOWN, '$.metaRuntime') runTime FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN, FROM `bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) ) ) where ABS(durSec-diffSec) > (30*60) order by delta desc
This yielded 18,158 broadcasts that needed to be fixed. We could compute runtime as we did above or simply use the built-in "runTime" metadata field in each broadcast that reports its airtime. To verify whether the "runTime" metadata field would be a better fit, let's check if any shows are missing this value or have a value of 0:
select id, metaStartTimeUTC,metaStopTimeUTC, durSec, diffSec, ABS(durSec-diffSec) delta,runTime from ( select id, metaStartTimeUTC,metaStopTimeUTC, TIMESTAMP_DIFF(SAFE_CAST(metaStopTimeUTC AS TIMESTAMP),SAFE_CAST(metaStartTimeUTC AS TIMESTAMP), SECOND) diffSec,durSec,runTime from ( select JSON_EXTRACT_SCALAR(DOWN, '$.id') id,JSON_EXTRACT_SCALAR(DOWN, '$.metaStartTimeUTC') metaStartTimeUTC, JSON_EXTRACT_SCALAR(DOWN, '$.metaStopTimeUTC') metaStopTimeUTC, SAFE_CAST(JSON_EXTRACT_SCALAR(DOWN, '$.durSec') AS numeric) durSec,JSON_EXTRACT_SCALAR(DOWN, '$.metaRuntime') runTime FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN, FROM `bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) ) ) where (runTime is null OR runTime='00:00' OR runTime ='00:00:00')
In fact, it turns out that 45 broadcasts have a runtime of 0. These broadcasts have valid MPG streams, but durations of a second or less, such as these two broadcasts. Thus, it appears that runtime is valid surrogate.
In this case, we then swept through the entire digital twin and updated the records for all 10M+ broadcasts since most had at least 1 second deltas and many had deltas of a few seconds.