The GDELT Project

Compiling A List Of All Non-News Broadcasts In The TV News Archive From Business Channels Over The Past Decade: Part 2

Last month we compiled a list of distinct show names marked as non-news from each of the three business news channels monitored by the Internet Archive's TV News Archive in order to update it, since it can become outdated over time. Our original query just provided a list of show names, making it hard for a human reviewer to evaluate ambiguous show titles. How might we update our original query to provide a list of sample broadcasts for each show and even go a step further by sorting those shows by various criteria? Bigtable's BigQuery integration means that we have access to BigQuery's full range of aggregation and other features, making this as trivial as a few tiny edits to our original SQL query!

As a reminder, this was our original query that returns only the show titles:

select showName, count(1) cnt from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) 
) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc

By extracting the show IDs and adding a simple STRING_AGG() we can additionally return a random selection of 10 shows for each show match:

select showName, count(1) cnt, STRING_AGG(id ORDER BY RAND() LIMIT 10) sampShows from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, JSON_EXTRACT_SCALAR(DOWN, '$.id') id, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) 
) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc

But what if we want to select the sample shows by a specific criterion, rather than merely randomly choosing them? We can do this trivially by simply changing the "ORDER BY" parameter to any metadata field we record for each broadcast. For example, to return the longest 10 broadcasts for each show title, we can use:

select showName, count(1) cnt, STRING_AGG(id ORDER BY durSec desc LIMIT 10) sampShows from (
select 
  JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, JSON_EXTRACT_SCALAR(DOWN, '$.id') id, JSON_EXTRACT_SCALAR(DOWN, '$.durSec') durSec, DOWN
   FROM (
  SELECT  
    rowkey,
    ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN
  FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000
 ) 
) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc