The Top Daily Terms By Station On American Television 2009-2019 Using BigQuery And TV Ngrams

Converting piles of unstructured text into structured tables like ngrams opens the door to all kinds of powerful new analyses of the narratives that shape our understanding of the world around us. Last week Felipe Hoffa showed how a single BigQuery query could process the new TV Ngram Dataset to generate a list of the top terms by day according to their TFIDF scores, performing the entire analysis inside of BigQuery.

Felipe's query is extremely flexible, allowing both the document set and background population to be configured.

What would it look like to apply Felipe's BigQuery analysis to the entire 2009-2019 dataset, computing the top 10 terms by station for each day by comparing each day/station's word usage against the background of the entire 2009-2019 period across all stations?

The final query took just 45.3 seconds to compute a decade of top daily terms for all dozen stations in the dataset. Once again, the power of BigQuery combined with the simplicity of ngrams demonstrates the incredible insights we can gain into how our news narratives shape our view of the world around us.

Download The Final Results (3.5 MB CSV).

Note that these terms are a mixture of news coverage and captioned advertisements. Few advertisements are captioned, so most ads are not present in the ngram dataset, but given that advertisements tend to be run in brief high-intensity bursts, brand names will often show up in these entries during ad runs.

And here is the query itself, written in Standard SQL and with an additional wrapper around Felipe's original query to flatten the results for CSV export, as well as setting the number of top terms to 10 and setting the date range to include the entire dataset (the exclusion of KSTS and KDTV is due to their predominately Spanish-language coverage since the other stations are in English):

WITH nested AS (

WITH data AS (
SELECT *
FROM `gdelt-bq.gdeltv2.iatv_1grams`
WHERE DATE BETWEEN 20090702 AND 20190630
AND station NOT IN ('KSTS', 'KDTV')
)
, word_day_station AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, station
FROM data
GROUP BY 1, 3, 4
)
, day_station AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, station
FROM data
GROUP BY 2,3
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, station, a.counts/b.counts tf
FROM word_day_station a
JOIN day_station b
USING(date, station)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%i %s', date, station)) indocs
FROM word_day_station
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%i %s', date, station)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date,
ARRAY_AGG(STRUCT(station, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY station) top_words
FROM (
SELECT date, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 5) words
FROM (
SELECT word, date, station, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, station
)
GROUP BY date
ORDER BY date DESC

) select date,station,wordlist from nested, UNNEST(top_words)