The Television News NGrams 2.0 dataset records how many times a given word was spoken in a given 10 minute period on a given station/show combination. Each row represents a unique TIME+STATION+SHOW+WORD combination. Thus, if the word "trump" appeared 22 times on CNN in a given 10 minute interval on a given show it will appear as a single record, with the "COUNT" field recording 22 mentions. This optimizes space requirements and is the most common representation form for most kinds of analyses. However, some kinds of analyses can require a flattened format in which if a word appeared 22 times it must appear as 22 separate rows. How can one easily flatten the ngrams dataset?
The query below shows that the word "trump" appears in hour 0 of CNN and MSNBC on January 1, 2020 9 and 12 times respectively:
select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM
Yielding the results:
Row | TIMESTAMP | STATION | NGRAM | COUNT | |
---|---|---|---|---|---|
1 |
2020-01-01 00:00:00 UTC
|
CNN
|
trump
|
9
|
|
2 |
2020-01-01 00:00:00 UTC
|
MSNBC
|
trump
|
12
|
What if we want to flatten this such that "trump" appears in 9 rows for CNN and 12 rows for MSNBC? The query below uses GENERATE_ARRAY() paired with UNEST() to do precisely this!
WITH data AS (select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM) select mentiontime, mentionid, name, word, STATION, id from ( SELECT TIMESTAMP mentiontime, CONCAT(CAST(TIMESTAMP AS STRING),'-',STATION,'-',NGRAM) mentionid, '1Gram' name, NGRAM word, STATION, COUNT val, GENERATE_ARRAY(1,COUNT,1) arr FROM data ),unnest(arr) id
This yields the flattened table below:
Row | mentiontime | mentionid | name | word | STATION | id | |
---|---|---|---|---|---|---|---|
1 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
1
|
|
2 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
2
|
|
3 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
3
|
|
4 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
4
|
|
5 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
5
|
|
6 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
6
|
|
7 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
7
|
|
8 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
8
|
|
9 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-CNN-trump
|
1Gram
|
trump
|
CNN
|
9
|
|
10 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
1
|
|
11 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
2
|
|
12 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
3
|
|
13 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
4
|
|
14 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
5
|
|
15 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
6
|
|
16 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
7
|
|
17 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
8
|
|
18 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
9
|
|
19 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
10
|
|
20 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
11
|
|
21 |
2020-01-01 00:00:00 UTC
|
2020-01-01 00:00:00+00-MSNBC-trump
|
1Gram
|
trump
|
MSNBC
|
12
|
Hopefully this is useful for situations where you need to unroll and flatten ngrams data for specific use cases.