The GDELT Project

Flattening TV News NGrams Using BigQuery

The Television News NGrams 2.0 dataset records how many times a given word was spoken in a given 10 minute period on a given station/show combination. Each row represents a unique TIME+STATION+SHOW+WORD combination. Thus, if the word "trump" appeared 22 times on CNN in a given 10 minute interval on a given show it will appear as a single record, with the "COUNT" field recording 22 mentions. This optimizes space requirements and is the most common representation form for most kinds of analyses. However, some kinds of analyses can require a flattened format in which if a word appeared 22 times it must appear as 22 separate rows. How can one easily flatten the ngrams dataset?

The query below shows that the word "trump" appears in hour 0 of CNN and MSNBC on January 1, 2020 9 and 12 times respectively:

select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM

Yielding the results:

Row TIMESTAMP STATION NGRAM COUNT
1
2020-01-01 00:00:00 UTC
CNN
trump
9
2
2020-01-01 00:00:00 UTC
MSNBC
trump
12

What if we want to flatten this such that "trump" appears in 9 rows for CNN and 12 rows for MSNBC? The query below uses GENERATE_ARRAY() paired with UNEST() to do precisely this!

WITH data AS (select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM)
select mentiontime, mentionid, name, word, STATION, id from (
SELECT TIMESTAMP mentiontime,
CONCAT(CAST(TIMESTAMP AS STRING),'-',STATION,'-',NGRAM) mentionid,
'1Gram' name,
NGRAM word,
STATION,
COUNT val,
GENERATE_ARRAY(1,COUNT,1) arr FROM data
),unnest(arr) id

This yields the flattened table below:

Row mentiontime mentionid name word STATION id
1
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
1
2
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
2
3
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
3
4
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
4
5
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
5
6
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
6
7
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
7
8
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
8
9
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-CNN-trump
1Gram
trump
CNN
9
10
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
1
11
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
2
12
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
3
13
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
4
14
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
5
15
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
6
16
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
7
17
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
8
18
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
9
19
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
10
20
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
11
21
2020-01-01 00:00:00 UTC
2020-01-01 00:00:00+00-MSNBC-trump
1Gram
trump
MSNBC
12

Hopefully this is useful for situations where you need to unroll and flatten ngrams data for specific use cases.