The GDELT Project

Using Our BigQuery + Bigtable + GCS Digital Twin To Identify Missing Channels

One of the most powerful aspects of our BigQuery-analyzable Bigtable-based GCS digital twin is the capability it makes possible to ask archive-scale questions like when did each channel switch to HD, the median resolution of active channels by day, diagnostic code breakdowns and even overall geographic growth. This same ability to use simple SQL queries to examine the entire archive makes it possible to perform quality assurance tasks. For example, historically the Visual Explorer only included the Television News Archive's 100 largest channels by monitored broadcasts. Yet there are at least 100 more channels which have been monitored for brief periods of time, such as around an election cycle or as a brief experiment. How can we trivially compile a list of these missing channels? Given that our Bigtable-based digital twin includes an entry for every single broadcast seen by the Archive and marks those that are not currently included in the Visual Explorer as "unconfigured", we can use a trivial SQL query to make a list of all of these channels, the number of broadcasts preserved by the Archive for each and a randomly selected set of sample shows from each for human review and addition, as seen below.

select chan, STRING_AGG(id ORDER BY id DESC LIMIT 5) sampleshows, count(1) cnt from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.status') status, JSON_EXTRACT_SCALAR(DOWN, '$.chan') chan, JSON_EXTRACT_SCALAR(DOWN, '$.id') id, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)]  from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 
 )  
)  where status='FAILTMP_NOCONFIGJSONCHANFILTER' group by chan order by cnt desc