Our massive new collaboration with the Internet Archive to OCR its complete quarter-century Television News Archive spans ten million broadcasts from 300 channels in 50 countries in 150+ languages. The sheer scale of this enormous new initiative and the considerable technical and methodological engineering that has gone into making it tractable means that it is also highly experimental, from its novel montaging workflow to the sheer scale of its underlying technical infrastructure. This highly experimental nature means that we are constantly evolving and tuning various pieces of the system, which sometimes requires rerunning a selection of shows to address a novel edge case we've discovered.
Our Bigtable+BigQuery+GCS digital twin makes these kinds of archive-scale subsamples trivial. For example, to find post-2020 broadcasts from the Belarusian, Russian and Ukrainian channels that encountered a fatal error that prevented them even from being montaged, we can use the following SQL in BigQuery, which runs against our full Bigtable digital twin and returns the list of missing broadcast IDs. Note how it fetches both the DOWN and OCR records associated with each broadcast and filters to all successful DOWN records, which are then filtered against the OCR records:
( select id from ( select JSON_EXTRACT_SCALAR(DOWN, '$.id') id, OCR, DOWN FROM ( SELECT rowkey, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('DOWN') ) DOWN, ( select array(select value from unnest(cell))[OFFSET(0)] from unnest(cf.column) where name in ('OCR') ) OCR FROM `[PROJECTID].bigtableconnections.digtwin` where SAFE_CAST(substr(rowkey, 0, 8) as NUMERIC) > 20000000 and rowkey like '202%' ) ) where JSON_EXTRACT_SCALAR(DOWN, '$.status') ='SUCCESS' and JSON_EXTRACT_SCALAR(OCR, '$.status') is null and JSON_EXTRACT_SCALAR(DOWN, '$.chan') in ( 'RUSSIA1', 'RUSSIA24', 'RT', 'NTV', '1TV', 'ESPRESO', 'BELARUSTV' ) )
We can then trivially export the resulting list of IDs to GCS as a JSONNL file directly from BigQuery. We then extract the IDs from the JSON and rewrap in JSON task order record format used by our orchestration infrastructure:
gcloud storage cp gs://[BUCKET]/records.json rm tmp.json; cat records.json | jq -c '. + {"mode":"MONTAGE", "queueServer":"[QUEUESERVER]", "ocrQueue":"IAVIDOCRBULK_OCR"}' > tmp.json shuf tmp.json >> TOQUEUE.json
Which is then passed to the queuing importer to be imported into the task queues.