Announcing Partitioned GDELT BigQuery Tables

As GDELT has grown to truly extraordinary scale, the sheer magnitude of all of that data has made it increasingly difficult to query and interact with. The GDELT 1.0 Events table now contains 400 million records and is 144GB in size, the GDELT 2.0 GKG table contains 353 million records totaling 3.6TB, the GDELT 2.0 EventMentions table contains 501 million records totaling 104GB and the GDELT 2.0 Events table contains 159 million records totaling 63GB.

Google's BigQuery system is one of the few platforms today that makes it possible to interact with the entire GDELT dataset in near-realtime, performing complex analyses in seconds. However, GDELT's massive size means that a query examining a large number of columns of the GKG might consume a terabyte or more of BigQuery query quota with a single search. Since BigQuery performs a full table scan for each search, you will still consume a huge amount of query quota even when limiting your search to a small time window.

This is where BigQuery's newly added partitioned tables come to the rescue, in which BigQuery under the hood breaks a large table up into a massive collection of physically distinct tables organized by day. Just by adding a special additional SQL command to your search, you can instruct BigQuery to only search the specific days you are interested in, meaning your query quota consumes only the amount of data present on those specific days, rather than the entire 3.6TB GKG table.

All you have to do is to use the "_partitioned" version of each table and add the special _PARTITIONTIME >= TIMESTAMP("2016-06-01") syntax to your query (where TIMESTAMP is the specific start/stop day of data you want to search in YYYY-MM-DD format). See the BigQuery documentation for more details on how to use this command properly.

Take this simple query that counts up the total number of articles discussing Education every 15 minutes from June 1 to June 15 of this year and additionally the number of those articles that mention Syria. This query takes just 8.6 seconds to complete, but processes 423GB of data, consuming a fair bit of your query quota in a single go. Though, one has to admit that it is pretty incredible that BigQuery can crunch through 423GB of data in just 8.6 seconds!

SELECT DATE, sum(V2Locations like '%Syria%') nummatches, count(1) allarts FROM [gdelt-bq:gdeltv2.gkg] where V2Themes 
like '%EDUCATION%' and DATE>=20160601000000 AND DATE<=20160615999999 group by DATE

Here is the same query, but executed against the partitioned version of the table "gkg_partitioned" and adding in the _PARTITIONTIME sytnax. This query takes just 2 seconds and consumes just 15GB, since it looks at only the 15 days of GKG data relevant to the query. That's a decrease in query quota of more than 28 times and a speedup of fourfold!

SELECT DATE, sum(V2Locations like '%Syria%') nummatches, count(1) allarts FROM [gdelt-bq:gdeltv2.gkg_partitioned] where 
V2Themes like '%EDUCATION%' and DATE>=20160601000000 and _PARTITIONTIME >= TIMESTAMP("2016-06-01") 
AND DATE<=20160615999999 and _PARTITIONTIME <= TIMESTAMP("2016-06-15") group by DATE

For the moment we've made partitioned versions available of the GDELT 1.0 Events table and the GDELT 2.0 Events, EventMentions and GKG tables. We will be closely monitoring the other GDELT tables and making partitioned versions of them available as well as they grow too large for traditional querying.