Using BigQuery DML & External Temporary Tables To Perform Realtime Reformatting Inserts For Television News Ngrams

As we prepare to release the 2.0 release of our television news ngrams dataset, a guiding design goal has been to insert show ngrams into the BigQuery tables the moment the show has finished processing. Complicating this is the fact that on occasion it can take up to three days in some cases for the Internet Archive to finish processing a show, which in the past meant that we waited until all shows on a given day finished processing before inserting into BigQuery, meaning the BigQuery tables were always several days behind our downloadable TSV files. With BigQuery's unlimited insert DML support, coupled with the ability to construct temporary external tables that treat CSV files in GCS as live queryable tables, it is now possible to live-insert our data into BigQuery.

Two challenges remain: handling TSV rather than the CSV files that are supported natively by the inline DML operators and reformatting the ngram dataset's YYYYMMDD and HHMMSS time columns into a native BigQuery TIMESTAMP column to support date-based partitioning and BigQuery's myriad advanced date/time operators.

Ngram datasets are typically distributed in TSV format and there is a vast landscape of software packages, scripts and application support designed for tab-delimited ngram datasets in which each word is a row with a count column and a set of columns containing attributes. It was important that we maintain compatibility of the 2.0 ngram dataset with this landscape of tools as well as those tools built for our 1.0 ngram dataset. BigQuery natively supports querying CSV format files from GCS using its inline table definition format, but TSV files require altering the field separator from a comma to a tab.

It turns out that BigQuery makes this incredibly simple. You can use the "bq mkdef" command with an inline field definition list to construct an external table definition file:

bq mkdef --noautodetect --source_format=CSV "" "DATE:INTEGER,STATION:STRING,HOUR:INTEGER,NGRAM:STRING,COUNT:INTEGER,SHOW:STRING" > BQSCHEMA.JSON

This produces the table definition file needed by BigQuery. Edit the BQSCHEMA.JSON file to change "fieldDelimiter" from "," to "\t" to support tab-delimited files.

You will also have to edit the "sourceUris" field to list the specific GCS TSV file(s) you wish to insert.

The final result becomes:

{
  "csvOptions": {
    "allowJaggedRows": false,
    "allowQuotedNewlines": false,
    "encoding": "UTF-8",
    "fieldDelimiter": "\t",
    "quote": "\"",
    "skipLeadingRows": 0
  },
  "schema": {
    "fields": [
      {
        "name": "DATE",
        "type": "INTEGER"
      },
      {
        "name": "STATION",
        "type": "STRING"
      },
      {
        "name": "HOUR",
        "type": "INTEGER"
      },
      {
        "name": "NGRAM",
        "type": "STRING"
      },
      {
        "name": "COUNT",
        "type": "INTEGER"
      },
      {
        "name": "SHOW",
        "type": "STRING"
      }
    ]
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://yourbucket/yourfile.tsv"
  ]
}

Now you can use the file to perform the insertion like:

bq -q --headless=true --project_id [YOURPROJECTID] query --external_table_definition="data::BQSCHEMA.JSON" --use_legacy_sql=false 'INSERT INTO `gdelt-bq.gdeltv2.iatv_1gramsv2` select DATE, STATION, HOUR, NGRAM, COUNT, SHOW from `data`

In our case, the ngram engine uses the BQSCHEMA.JSON as a template and simply replaces the "sourceUris" field with the ngram file for each insert.

How to handle the timestamp issue?

If you look closely at the insertion code above, BigQuery's DML insertion statements take the form of a standard BigQuery query against a materialized external table created on-the-fly from the TSV file. This means that we can simply add a new TIMESTAMP column to our table and parse the date for each row on-the-fly during the insertion above.

Thus, we create our BigQuery table with the same columns as above, with the addition of a TIMESTAMP field to the front:

TIMESTAMP:TIMESTAMP,DATE:INTEGER,STATION:STRING,HOUR:INTEGER,NGRAM:STRING,COUNT:INTEGER,SHOW:STRING

NOTE that we don't change our BQSCHEMA.JSON file, since that defines the schema of our TSV files, not the BigQuery table.

Parsing our YYYYMMDD date field and HHMMSS time field from our ngram dataset is a simple matter of using "PARSE_TIMESTAMP":

PARSE_TIMESTAMP("%Y%m%d%H%M%S",CONCAT(CAST(DATE AS STRING), FORMAT("%04d", HOUR) , "00")) as TIMESTAMP

We also reformat the HOUR field to zero-pad to four digits since otherwise "3:20 AM UTC" is represented as "320", which this recasts as "0320".

Thus, the final insertion statement becomes:

bq -q --headless=true --project_id [YOURPROJECTID] query --external_table_definition="data::BQSCHEMA.JSON" --use_legacy_sql=false 'INSERT INTO `gdelt-bq.gdeltv2.iatv_1gramsv2` select PARSE_TIMESTAMP("%Y%m%d%H%M%S",CONCAT(CAST(DATE AS STRING), FORMAT("%04d", HOUR) , "00")) as TIMESTAMP, DATE, STATION, HOUR, NGRAM, COUNT, SHOW from `data`

That's all there is to it!