Last week we unveiled the results of having Google's Cloud Natural Language API non-consumptively read through the transcripts of a decade of television news from the Internet Archive's Television News Archive and catalog all of the entities it found in all of that coverage. The end result was a massive television news spoken entity dataset.
How might we use this enormous dataset to understand the most signficant topics and things that defined each day of television news over the past decade through the eyes of CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC and BBC News London for a third of a decade?
It turns out that through the power of BigQuery, doing so requires just a single SQL query and 22 seconds to construct a chronology of the past decade, surfacing the top topics by type by day using TFIDF.
You can download the final master spreadsheet in CSV format:
Creating this spreadsheet required only a single query, modified from the original by Felipe Hoffa:
WITH nested AS ( WITH data AS ( SELECT FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) date, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities ) , word_day_type AS ( # how many times a word is mentioned in each "document" SELECT word, SUM(count) counts, date, type FROM data GROUP BY 1, 3, 4 ) , day_type AS ( # total # of words in each "document" SELECT SUM(count) counts, date, type FROM data GROUP BY 2,3 ) , tf AS ( # TF for a word in a "document" SELECT word, date, type, a.counts/b.counts tf FROM word_day_type a JOIN day_type b USING(date, type) ) , word_in_docs AS ( # how many "documents" have a word SELECT word, COUNT(DISTINCT FORMAT('%s %s', date, type)) indocs FROM word_day_type GROUP BY 1 ) , total_docs AS ( # total # of docs SELECT COUNT(DISTINCT FORMAT('%s %s', date, type)) total_docs FROM data ) , idf AS ( # IDF for a word SELECT word, LOG(total_docs.total_docs/indocs) idf FROM word_in_docs CROSS JOIN total_docs ) SELECT date, ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words FROM ( SELECT date, type, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words FROM ( SELECT word, date, type, tf.tf * idf.idf tfidf FROM tf JOIN idf USING(word) ) GROUP BY date, type ) GROUP BY date ORDER BY date DESC ) select date,type,wordlist from nested, UNNEST(top_words)