Computing TF-IDF Scores From The TV Ngram Dataset In BigQuery

The ever-amazing Felipe Hoffa came up with the following BigQuery query that computes the TF-IDF scores for the new Television News Ngrams Dataset and uses them to surface the most significant words per station per day, all in just a few seconds!

Here are the results of running the query:

And here is the query itself, written in Standard SQL:

WITH data AS (
SELECT *
FROM `gdelt-bq.gdeltv2.iatv_1grams`
WHERE DATE BETWEEN 20190601 AND 20190629
AND station NOT IN ('KSTS', 'KDTV')
)
, word_day_station AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, station
FROM data
GROUP BY 1, 3, 4
)
, day_station AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, station
FROM data
GROUP BY 2,3
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, station, a.counts/b.counts tf
FROM word_day_station a
JOIN day_station b
USING(date, station)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%i %s', date, station)) indocs
FROM word_day_station
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%i %s', date, station)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date,
ARRAY_AGG(STRUCT(station, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY station) top_words
FROM (
SELECT date, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 5) words
FROM (
SELECT word, date, station, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, station
)
GROUP BY date
ORDER BY date DESC

Once again, the incredible power of BigQuery is on display, able to take a 1.2 billion record ngram dataset, compute tfidf scores for a set of days and surface their most significant terms, all in just a few seconds with a single query!

Read Felipe's Full Stack Overflow Post.