Using BigQuery's New Limit-Free DML To Create Our Covid 19 Snippet Datasets

Over the past week and a half we've released two massive new datasets of news snippets (television news and online news) to support research into the global media narrative around Covid-19, from the disease itself to the falsehoods, shortages and price disruptions it has wreaked across the world. Underpinning the creation of these two datasets and their BigQuery tables are BigQuery's new limit-free DML capabilities.

The television news snippets dataset is actually created simply by querying the public TV API 2.0 using its CSV output format. However, since television news broadcasts can take up to 72 hours to complete processing in the Internet Archive's Television News Archive, the most recent three days of data are in a state of constant flux. Historically this meant that BigQuery tables for our television news datasets required a four day rolling update window to ensure the table was updated only after a day was complete, since BigQuery tables have historically been append-only or, more recently, enforced significant limitations on updating.

With BigQuery's new limit-free DML capabilities, the television news snippets BigQuery table is partitioned by day and each morning we simply copy the updated CSV files to GCS, issue a DELETE command to delete all data from each of the previous three days and then insert the new records using BigQuery's external temporary table support, referencing the CSV files in GCS. In fact, as part of the loading process we even convert the CSV-format timestamp field into a JSON-formatted one on-the-fly as the records are loaded using "PARSE_TIMESTAMP("%m/%d/%Y %X", MatchDateTime)".

The final workflow thus becomes (the WHERE statement at the end of the INSERT removes the header rows from the API-produced CSV files):

bq -q --headless=true --project_id [YOURPROJECTID] query --use_legacy_sql=false 'DELETE FROM `[YOURTABLE]` where DATE(MatchDateTimeStamp) = "[THEDATE]" '
bq -q --headless=true --project_id [YOURPROJECTID] query --external_table_definition="data::URL:STRING,MatchDateTime:STRING,Station:STRING,Show:STRING,IAShowID:STRING,IAPreview:STRING,Snippet:STRING@CSV=gs://YOURGCSBUCKET/PATH/[THEDATE]-*.csv" --use_legacy_sql=false 'INSERT INTO `[YOURTABLE]` select *, PARSE_TIMESTAMP("%m/%d/%Y %X", MatchDateTime) MatchDateTimeStamp from `data` where URL like "%archive.org%" '

Since the TV API 2.0's CSV files use Excel-style dates, we have to convert them to BigQuery's JSON-formatted date format. We do this on-the-fly above using the PARSE_TIMESTAMP() query. The final table schema can be seen below (NOTE that this includes the extra "MatchDateTimeStamp:TIMESTAMP" field at he end that is not present in the table definition schema used in the INSERT statement above, since the INSERT schema describes the CSV file's schema, rather than the schema of the table being inserted into. The table is partitioned on "MatchDateTimeStamp" by day.

URL:STRING,MatchDate:STRING,Station:STRING,Show:STRING,IAShowID:STRING,IAPreview:STRING,Snippet:STRING,MatchDateTimeStamp:TIMESTAMP

For the online news snippets dataset, the process is even simpler. Each morning we simply issue a cron-scheduled query to load the previous day's entries from a temporary BigQuery table into the public snippets table, truncating records that are greater than 800 characters while we're at it.

We're tremendously excited by the possibilities of limit-free DML and are working to upgrade the rest of our BigQuery data ingest workflows with these exciting new capabilities, so stay tuned!