Some applications like Google's Timeseries Insights API require that count-based datasets be unrolled since they examine discrete events. For example, the Television News Global Entity Graph 2.0 records how many times a given entity was seen in a particular 15 second interval. Each row is an entity-timeslot pair, counting how many times a given entity was mentioned during those 15 seconds and recording that count under a "numMentions" field.
What if we query the dataset for all CNN entities mentioned at 4:15:15AM UTC on July 4, 2021 on CNN:
SELECT FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime, STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00'
This yields the following results in which you can see each of the entities were mentioned once in that 15 second period, except for Michael Jackson, who was mentioned 3 times:
Row | eventTime | dimensions.name | dimensions.value | count | |
---|---|---|---|---|---|
1 |
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
MTV
|
1
|
|
2 |
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MAN
|
1
|
|
3 |
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MICHAEL JACKSON
|
3
|
|
4 |
2021-07-04T04:15:15+00:00
|
EntityOTHER
|
PRESSURE
|
1
|
|
5 |
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
CBS RECORDS
|
1
|
|
6 |
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
LABEL
|
1
|
|
7 |
2021-07-04T04:15:15+00:00
|
EntityOTHER
|
GOOD
|
1
|
|
8 |
2021-07-04T04:15:15+00:00
|
EntityNUMBER
|
80
|
1
|
|
9 |
2021-07-04T04:15:15+00:00
|
EntityNUMBER
|
A MILLION
|
1
|
To import this dataset into the Timeseries Insights API, we need to duplicate the Michael Jackson row 3 times so that the final dataset we load into the API has three rows for him and one row for each of the other entities.
It turns out we can accomplish this through a simple combination of UNNEST() and GENERATE_ARRAY():
select FARM_FINGERPRINT(GENERATE_UUID()) groupId, eventTime, dimensions, count from ( SELECT FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", date, "UTC") eventTime, STRUCT( CONCAT('Entity',entity.type) as name, entity.name as value) dimensions, entity.numMentions count FROM `gdelt-bq.gdeltv2.gegv2_iatv`, UNNEST(entities) entity WHERE DATE(date) = "2021-07-04" and station='CNN' and TIMESTAMP(date)='2021-07-04T04:15:15+00:00' ), UNNEST(GENERATE_ARRAY(1, count))
This yields the new table:
Row | groupId | eventTime | dimensions.name | dimensions.value | count | |
---|---|---|---|---|---|---|
1 |
4823357217132463915
|
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
MTV
|
1
|
|
2 |
-5233750653571321975
|
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MAN
|
1
|
|
3 |
-5110034607981475138
|
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MICHAEL JACKSON
|
3
|
|
4 |
-848530543353905329
|
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MICHAEL JACKSON
|
3
|
|
5 |
-4077076410842308013
|
2021-07-04T04:15:15+00:00
|
EntityPERSON
|
MICHAEL JACKSON
|
3
|
|
6 |
1146849324958964462
|
2021-07-04T04:15:15+00:00
|
EntityOTHER
|
PRESSURE
|
1
|
|
7 |
5502283991676827286
|
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
CBS RECORDS
|
1
|
|
8 |
-2949216086261319355
|
2021-07-04T04:15:15+00:00
|
EntityORGANIZATION
|
LABEL
|
1
|
|
9 |
-4638770088839180697
|
2021-07-04T04:15:15+00:00
|
EntityOTHER
|
GOOD
|
1
|
|
10 |
-7864655930198391083
|
2021-07-04T04:15:15+00:00
|
EntityNUMBER
|
80
|
1
|
|
11 |
529212137132583830
|
2021-07-04T04:15:15+00:00
|
EntityNUMBER
|
A MILLION
|
1
|
Note how efficient this duplication is, in that it applies only to each matching entity and does not require any JOINs or other expensive operands!
That's all there is to it!