A Topical Chronology Of A Decade Of Television News Using Cloud Natural Language & BigQuery

Last week we unveiled the results of having Google's Cloud Natural Language API non-consumptively read through the transcripts of a decade of television news from the Internet Archive's Television News Archive and catalog all of the entities it found in all of that coverage. The end result was a massive television news spoken entity dataset.

How might we use this enormous dataset to understand the most signficant topics and things that defined each day of television news over the past decade through the eyes of CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC and BBC News London for a third of a decade?

It turns out that through the power of BigQuery, doing so requires just a single SQL query and 22 seconds to construct a chronology of the past decade, surfacing the top topics by type by day using TFIDF.

You can download the final master spreadsheet in CSV format:

TECHNICAL DETAILS

Creating this spreadsheet required only a single query, modified from the original by Felipe Hoffa:

WITH nested AS (

WITH data AS (
SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities
)
, word_day_type AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, type
FROM data
GROUP BY 1, 3, 4
)
, day_type AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, type
FROM data
GROUP BY 2,3
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, type, a.counts/b.counts tf
FROM word_day_type a
JOIN day_type b
USING(date, type)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%s %s', date, type)) indocs
FROM word_day_type
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%s %s', date, type)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date,
ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words
FROM (
SELECT date, type, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words
FROM (
SELECT word, date, type, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, type
)
GROUP BY date
ORDER BY date DESC

) select date,type,wordlist from nested, UNNEST(top_words)