Using BigQuery Table Decorators To Lower Query Cost

As GDELT has grown at an exponential rate, the size and complexity of its data archives means it takes ever greater computing power to work with. The GDELT 1.0 Events database now totals 376 million records mentioned a total of 3.7 billion times, while the GDELT 2.0 Events database, which incorporates the massive firehose of local news coverage GDELT monitors and mass machine translates in realtime from 65 languages, totals 123 million records mentioned over 388 million times just in the past year.

Yet, perhaps the most difficult GDELT dataset to work with is the GDELT GKG 2.0 archive, which spans 266 million articles monitored in just the past year with over 2.65 TB of computed and extracted metadata. Google BigQuery is perhaps the only query platform in the world today capable of near-realtime exploration of the full GDELT archive, from one minute mapping to co-occurrence network construction, sentiment mining at 341 million words per second, to synthesizing macro-level trends from billions of location mentions and almost a trillion emotional assessments.

This past November we pulled together a cross-section of the BigQuery tutorials we've rolled out on this blog, but one of the challenges is that BigQuery performs full table scans for each column involved in a query. This means that even if your query filters for only those articles published yesterday, you will still be charged for the full disk bandwidth occupied by all 266 million records in that column.

Happily, BigQuery offers a powerful solution to this known as "table decorators." In a nutshell, a table decorator is a set of flags you add to the table reference that tell BigQuery to look at only the most recent records, ignoring older records as if they don't exist. At the time of this writing, you can tell BigQuery to look at anywhere up to the most recent 7 days of records in a given table. When using table decorators with the GKG 2.0 table set to the last 7 days, BigQuery will pretend that instead of the table being 2.65TB it is just 54GB, making queries return in just a fraction of the time and reducing data costs to just a fraction of the cost of scanning the entire table.

Table decorators express time in milliseconds, so to query the GKG in different time intervals (up a maximum of the last 7 days), use the following:

  • Last Hour: SELECT count(1) FROM [gdeltv2.gkg@-3600000-]
  • Last Day: SELECT count(1) FROM [gdeltv2.gkg@-86400000-]
  • Last 7 Days: SELECT count(1) FROM [gdeltv2.gkg@-604800000-]

For example, take the query below from our Sample Queries, which compiles a histogram of the top themes appearing in coverage of Greece. (The complex "DATE>…" statement is just computing the date/time of 7 days ago). It will take 14 seconds to complete and processes 311GB:

SELECT theme, COUNT(*) as count
FROM (
select UNIQUE(REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*', '')) theme
from [gdelt-bq:gdeltv2.gkg]
where V2Locations like '%Greece%' and DATE>INTEGER(STRFTIME_UTC_USEC(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -7, 'DAY'), "%Y%m%d%H%M%S"))
)
group by theme
having count > 10
ORDER BY 2 DESC
LIMIT 300

Despite filtering for only the last 7 days worth of data, the query above must still process the last year's worth of data, since it performs a full table scan on all involved columns (in this case both V2Themes and V2Locations).

Instead, by using a table decorator for the last 7 days, BigQuery looks at just the last 7 days worth of data, performing the exact same query, but requiring just 3.5 seconds and consuming just 6.28GB of data:

SELECT theme, COUNT(*) as count
FROM (
select UNIQUE(REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*', '')) theme
from [gdelt-bq:gdeltv2.gkg@-604800000-]
where V2Locations like '%Greece%'
)
group by theme
having count > 10
ORDER BY 2 DESC
LIMIT 300

As you can see, while they are limited to looking at up to the last 7 days worth of data, and so are not useful for longer historical queries, table decorators are enormously powerful when performing near-term queries extracting short-term trends or other details from the most recent hours to up to the last week's worth of data, vastly speeding up your queries while achieving massive cost savings in the amount of data consumed in a single query.