Using BigQuery DML & External Temporary Table Querying For Realtime Updates

Over the coming weeks GDELT's television-related BigQuery tables will begin loading data live as it is processed (as it rolls off the 24 hour rolling delay), rather than daily. Because television shows can complete processing out of order and up to 72 hours after airing, GDELT historically live-updated its downloadable TSV/JSON files, but updated the BigQuery version of each dataset only daily and after a rolling delay of 4 days to ensure that all videos had completed processing by the time the dataset was loaded into BigQuery.

BigQuery continues to evolve in leaps and bounds, adding new features sometimes every few days. The convergence of BigQuery's advanced DML support with its powerful external temporary table querying support means today it is possible to have BigQuery load an update file directly from GCS into a production native table with a single call to its CLI. This means that GDELT is now able to simply point BigQuery to each update file as it becomes available and load it directly into the native production table using the following syntax.

Simply create a file on local disk called "YOURTABLESCHEMA.JSON" that contains the JSON table schema used to create the BigQuery table. Then use the following command to load a JSON file directly from GCS in that format.

time bq --project_id YOURPROJECTID query --external_table_definition="data::./YOURTABLESCHEMA.JSON@NEWLINE_DELIMITED_JSON=gs://GCSPATH/FILE.JSON.GZ" --use_legacy_sql=false 'insert into `[BIGQUERYTABLE]` select * from `data`'

We're tremendously excited by the newfound possibilities this realtime updating will enable!