Last month we compiled a list of distinct show names marked as non-news from each of the three business news channels monitored by the Internet Archive's TV News Archive in order to update it, since it can become outdated over time. Our original query just provided a list of show names, making it hard for a human reviewer to evaluate ambiguous show titles. How might we update our original query to provide a list of sample broadcasts for each show and even go a step further by sorting those shows by various criteria? Bigtable's BigQuery integration means that we have access to BigQuery's full range of aggregation and other features, making this as trivial as a few tiny edits to our original SQL query!
As a reminder, this was our original query that returns only the show titles:
select showName, count(1) cnt from ( select JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, DOWN FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc
By extracting the show IDs and adding a simple STRING_AGG() we can additionally return a random selection of 10 shows for each show match:
select showName, count(1) cnt, STRING_AGG(id ORDER BY RAND() LIMIT 10) sampShows from ( select JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, JSON_EXTRACT_SCALAR(DOWN, '$.id') id, DOWN FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc
But what if we want to select the sample shows by a specific criterion, rather than merely randomly choosing them? We can do this trivially by simply changing the "ORDER BY" parameter to any metadata field we record for each broadcast. For example, to return the longest 10 broadcasts for each show title, we can use:
select showName, count(1) cnt, STRING_AGG(id ORDER BY durSec desc LIMIT 10) sampShows from ( select JSON_EXTRACT_SCALAR(DOWN, '$.metaProgram') showName, JSON_EXTRACT_SCALAR(DOWN, '$.id') id, JSON_EXTRACT_SCALAR(DOWN, '$.durSec') durSec, DOWN FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 ) ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') like '%FAILPERM_NOTNEWS%' AND JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ('FBC') group by showName order by cnt desc