Using BigQuery’s UNNEST To Unroll Count-Based Datasets

Some applications like Google's Timeseries Insights API require that count-based datasets be unrolled since they examine discrete events. For example, the Television News Global Entity Graph 2.0 records how many times a given entity was seen in a particular 15 second interval. Each row is an entity-timeslot pair, counting how many times a given entity was mentioned during those 15 seconds and recording that count under a "numMentions" field.

What if we query the dataset for all CNN entities mentioned at 4:15:15AM UTC on July 4, 2021 on CNN:

SELECT
  FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime,
  STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count
FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00'

This yields the following results in which you can see each of the entities were mentioned once in that 15 second period, except for Michael Jackson, who was mentioned 3 times:

Row	eventTime	dimensions.name	dimensions.value	count
1	2021-07-04T04:15:15+00:00	EntityORGANIZATION	MTV	1
2	2021-07-04T04:15:15+00:00	EntityPERSON	MAN	1
3	2021-07-04T04:15:15+00:00	EntityPERSON	MICHAEL JACKSON	3
4	2021-07-04T04:15:15+00:00	EntityOTHER	PRESSURE	1
5	2021-07-04T04:15:15+00:00	EntityORGANIZATION	CBS RECORDS	1
6	2021-07-04T04:15:15+00:00	EntityORGANIZATION	LABEL	1
7	2021-07-04T04:15:15+00:00	EntityOTHER	GOOD	1
8	2021-07-04T04:15:15+00:00	EntityNUMBER	80	1
9	2021-07-04T04:15:15+00:00	EntityNUMBER	A MILLION	1

To import this dataset into the Timeseries Insights API, we need to duplicate the Michael Jackson row 3 times so that the final dataset we load into the API has three rows for him and one row for each of the other entities.

It turns out we can accomplish this through a simple combination of UNNEST() and GENERATE_ARRAY():

select FARM_FINGERPRINT(GENERATE_UUID()) groupId, eventTime, dimensions, count from (
  SELECT
    FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime,
    STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count
  FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00'
), UNNEST(GENERATE_ARRAY(1, count))

This yields the new table:

Row	groupId	eventTime	dimensions.name	dimensions.value	count
1	4823357217132463915	2021-07-04T04:15:15+00:00	EntityORGANIZATION	MTV	1
2	-5233750653571321975	2021-07-04T04:15:15+00:00	EntityPERSON	MAN	1
3	-5110034607981475138	2021-07-04T04:15:15+00:00	EntityPERSON	MICHAEL JACKSON	3
4	-848530543353905329	2021-07-04T04:15:15+00:00	EntityPERSON	MICHAEL JACKSON	3
5	-4077076410842308013	2021-07-04T04:15:15+00:00	EntityPERSON	MICHAEL JACKSON	3
6	1146849324958964462	2021-07-04T04:15:15+00:00	EntityOTHER	PRESSURE	1
7	5502283991676827286	2021-07-04T04:15:15+00:00	EntityORGANIZATION	CBS RECORDS	1
8	-2949216086261319355	2021-07-04T04:15:15+00:00	EntityORGANIZATION	LABEL	1
9	-4638770088839180697	2021-07-04T04:15:15+00:00	EntityOTHER	GOOD	1
10	-7864655930198391083	2021-07-04T04:15:15+00:00	EntityNUMBER	80	1
11	529212137132583830	2021-07-04T04:15:15+00:00	EntityNUMBER	A MILLION	1

Note how efficient this duplication is, in that it applies only to each matching entity and does not require any JOINs or other expensive operands!

That's all there is to it!

The GDELT Project

Using BigQuery's UNNEST To Unroll Count-Based Datasets

Archives