Flattening TV News NGrams Using BigQuery

Kalev Leetaru

5 years ago

The Television News NGrams 2.0 dataset records how many times a given word was spoken in a given 10 minute period on a given station/show combination. Each row represents a unique TIME+STATION+SHOW+WORD combination. Thus, if the word "trump" appeared 22 times on CNN in a given 10 minute interval on a given show it will appear as a single record, with the "COUNT" field recording 22 mentions. This optimizes space requirements and is the most common representation form for most kinds of analyses. However, some kinds of analyses can require a flattened format in which if a word appeared 22 times it must appear as 22 separate rows. How can one easily flatten the ngrams dataset?

The query below shows that the word "trump" appears in hour 0 of CNN and MSNBC on January 1, 2020 9 and 12 times respectively:

select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM

Yielding the results:

Row	TIMESTAMP	STATION	NGRAM	COUNT
1	2020-01-01 00:00:00 UTC	CNN	trump	9
2	2020-01-01 00:00:00 UTC	MSNBC	trump	12

What if we want to flatten this such that "trump" appears in 9 rows for CNN and 12 rows for MSNBC? The query below uses GENERATE_ARRAY() paired with UNEST() to do precisely this!

WITH data AS (select TIMESTAMP, STATION, NGRAM, SUM(COUNT) COUNT FROM `gdelt-bq.gdeltv2.iatv_1gramsv2` WHERE DATE(TIMESTAMP) = "2020-01-01" and (STATION='CNN' OR STATION='MSNBC' OR STATION='FOXNEWS') and HOUR=0 and NGRAM='trump' group by TIMESTAMP,STATION,NGRAM)
select mentiontime, mentionid, name, word, STATION, id from (
SELECT TIMESTAMP mentiontime,
CONCAT(CAST(TIMESTAMP AS STRING),'-',STATION,'-',NGRAM) mentionid,
'1Gram' name,
NGRAM word,
STATION,
COUNT val,
GENERATE_ARRAY(1,COUNT,1) arr FROM data
),unnest(arr) id

This yields the flattened table below:

Row	mentiontime	mentionid	name	word	STATION	id
1	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	1
2	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	2
3	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	3
4	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	4
5	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	5
6	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	6
7	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	7
8	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	8
9	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-CNN-trump	1Gram	trump	CNN	9
10	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	1
11	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	2
12	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	3
13	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	4
14	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	5
15	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	6
16	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	7
17	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	8
18	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	9
19	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	10
20	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	11
21	2020-01-01 00:00:00 UTC	2020-01-01 00:00:00+00-MSNBC-trump	1Gram	trump	MSNBC	12

Hopefully this is useful for situations where you need to unroll and flatten ngrams data for specific use cases.