The GDELT Project

Using Our BigQuery + Bigtable + GCS Digital Twin To Track Historical Backfilling Progress

With our new BigQuery + Bigtable digital twin over our GCS archive, we can trivially compile ongoing inventories of our archive over time and perform whole-of-archive sweeps to identify and retry failed ingests. For example, we can use the following SQL to compile a breakdown by status code:

select status, count(1) cnt from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.status') status, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)]  from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) 
)  group by status order by cnt desc

Yielding:

status cnt
SUCCESS 7511136
FAILPERM_CORRUPT 38507
FAILTMP_SRTDOWNFAIL 5998
FAILPERM_AVCHECKCORRUPT 3442
FAILPERM_NOEXISTS 1029
FAILTMP_SRTNOTREADY 21
FAILTMP_VIDDOWN 14
FAILTMP_NOVALIDVIDEOFORMAT 13
READY 11
FAILTMP_VIDMD5MISMATCH 11
FAILTMP_VIDNOEXIST 4
FAILTMP_GCSRENAME 2

Similarly, we can compile a by-channel breakdown:

select chan, count(1) cnt from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.chan') chan, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)]  from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) 
)  group by chan order by cnt desc

This returns 327 total channels. However, at least a portion of these channels were brief monitoring experiments consisting of only a handful of shows that encountered technical issues. Thus, to arrive at the total number of channels containing at least one successfully monitored broadcast we can simply add a status filter:

select chan, count(1) cnt from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.chan') chan, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)]  from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 )  where JSON_EXTRACT_SCALAR(DOWN, '$.status') = 'SUCCESS'
)  group by chan order by cnt desc

This yields the ultimate total: 260 channels!