The GDELT Project

Using BigQuery's UNNEST To Unroll Count-Based Datasets

Some applications like Google's Timeseries Insights API require that count-based datasets be unrolled since they examine discrete events. For example, the Television News Global Entity Graph 2.0 records how many times a given entity was seen in a particular 15 second interval. Each row is an entity-timeslot pair, counting how many times a given entity was mentioned during those 15 seconds and recording that count under a "numMentions" field.

What if we query the dataset for all CNN entities mentioned at 4:15:15AM UTC on July 4, 2021 on CNN:

SELECT
  FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime,
  STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count
FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00'

This yields the following results in which you can see each of the entities were mentioned once in that 15 second period, except for Michael Jackson, who was mentioned 3 times:

Row eventTime dimensions.name dimensions.value count
1
2021-07-04T04:15:15+00:00
EntityORGANIZATION
MTV
1
2
2021-07-04T04:15:15+00:00
EntityPERSON
MAN
1
3
2021-07-04T04:15:15+00:00
EntityPERSON
MICHAEL JACKSON
3
4
2021-07-04T04:15:15+00:00
EntityOTHER
PRESSURE
1
5
2021-07-04T04:15:15+00:00
EntityORGANIZATION
CBS RECORDS
1
6
2021-07-04T04:15:15+00:00
EntityORGANIZATION
LABEL
1
7
2021-07-04T04:15:15+00:00
EntityOTHER
GOOD
1
8
2021-07-04T04:15:15+00:00
EntityNUMBER
80
1
9
2021-07-04T04:15:15+00:00
EntityNUMBER
A MILLION
1

To import this dataset into the Timeseries Insights API, we need to duplicate the Michael Jackson row 3 times so that the final dataset we load into the API has three rows for him and one row for each of the other entities.

It turns out we can accomplish this through a simple combination of UNNEST() and GENERATE_ARRAY():

select FARM_FINGERPRINT(GENERATE_UUID()) groupId, eventTime, dimensions, count from (
  SELECT
    FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime,
    STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count
  FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00'
), UNNEST(GENERATE_ARRAY(1, count))

This yields the new table:

Row groupId eventTime dimensions.name dimensions.value count
1
4823357217132463915
2021-07-04T04:15:15+00:00
EntityORGANIZATION
MTV
1
2
-5233750653571321975
2021-07-04T04:15:15+00:00
EntityPERSON
MAN
1
3
-5110034607981475138
2021-07-04T04:15:15+00:00
EntityPERSON
MICHAEL JACKSON
3
4
-848530543353905329
2021-07-04T04:15:15+00:00
EntityPERSON
MICHAEL JACKSON
3
5
-4077076410842308013
2021-07-04T04:15:15+00:00
EntityPERSON
MICHAEL JACKSON
3
6
1146849324958964462
2021-07-04T04:15:15+00:00
EntityOTHER
PRESSURE
1
7
5502283991676827286
2021-07-04T04:15:15+00:00
EntityORGANIZATION
CBS RECORDS
1
8
-2949216086261319355
2021-07-04T04:15:15+00:00
EntityORGANIZATION
LABEL
1
9
-4638770088839180697
2021-07-04T04:15:15+00:00
EntityOTHER
GOOD
1
10
-7864655930198391083
2021-07-04T04:15:15+00:00
EntityNUMBER
80
1
11
529212137132583830
2021-07-04T04:15:15+00:00
EntityNUMBER
A MILLION
1

Note how efficient this duplication is, in that it applies only to each matching entity and does not require any JOINs or other expensive operands!

That's all there is to it!