The GDELT Project

Modernizing The 'Top Relationships By Year' And 'Protest Intensity' Queries From Legacy To Standard BigQuery SQL

In our original Google Cloud Platform blog post about GDELT, we offered two example SQL queries to get users started: one that computed the most common dyadic relationship of each year and one that computed the protest intensity of a selected country. Both queries were written in BigQuery's then-standard SQL dialect now called "Legacy SQL" whereas BigQuery today has standardized on "Standard SQL". What would both of these queries look like rewritten into today's Standard SQL?

The original dyadic relationship query looked like this in Legacy SQL:

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank
FROM 
(SELECT Actor1Name, Actor2Name,  Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),  (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name  and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year

Written in Standard SQL this query would look today like:

SELECT Year, Actor1Name, Actor2Name, Count FROM (
  SELECT Actor1Name, Actor2Name, Year, COUNT(1) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count(1) DESC) rank FROM (
    (SELECT Actor1Name, Actor2Name,  Year FROM `gdelt-bq.full.events` WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode)
    UNION ALL
    (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM `gdelt-bq.full.events` WHERE Actor1Name > Actor2Name  and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode)
)  WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP BY 1, 2, 3 HAVING Count > 100
) WHERE rank=1 ORDER BY Year

Similarly, the protest intensity timeline code in Legacy SQL was:

SELECT MonthYear MonthYear, INTEGER(norm*100000)/1000 Percent
FROM (
SELECT ActionGeo_CountryCode, EventRootCode, MonthYear, COUNT(1) AS c, RATIO_TO_REPORT(c) OVER(PARTITION BY MonthYear ORDER BY c DESC) norm FROM [gdelt-bq:full.events]
GROUP BY ActionGeo_CountryCode, EventRootCode, MonthYear
)
WHERE ActionGeo_CountryCode='UP' and EventRootCode='14'
ORDER BY ActionGeo_CountryCode, EventRootCode, MonthYear;

Modernizing this to Standard SQL it now looks like this:

SELECT MonthYear, Perc from (
SELECT ActionGeo_CountryCode, EventRootCode, MonthYear, ( COUNT(1) / sum(count(*)) OVER(PARTITION BY MonthYear) ) * 100 Perc FROM `gdelt-bq.full.events` GROUP BY ActionGeo_CountryCode, EventRootCode, MonthYear
) WHERE ActionGeo_CountryCode='UP' and EventRootCode='14' ORDER BY MonthYear ASC

In both cases the new Standard SQL queries are largely the same, just more compact and efficient.