Converting piles of unstructured text into structured tables like ngrams opens the door to all kinds of powerful new analyses of the narratives that shape our understanding of the world around us. Last week Felipe Hoffa showed how a single BigQuery query could process the new TV Ngram Dataset to generate a list of the top terms by day according to their TFIDF scores, performing the entire analysis inside of BigQuery.
Felipe's query is extremely flexible, allowing both the document set and background population to be configured.
What would it look like to apply Felipe's BigQuery analysis to the entire 2009-2019 dataset, computing the top 10 terms by station for each day by comparing each day/station's word usage against the background of the entire 2009-2019 period across all stations?
The final query took just 45.3 seconds to compute a decade of top daily terms for all dozen stations in the dataset. Once again, the power of BigQuery combined with the simplicity of ngrams demonstrates the incredible insights we can gain into how our news narratives shape our view of the world around us.
Note that these terms are a mixture of news coverage and captioned advertisements. Few advertisements are captioned, so most ads are not present in the ngram dataset, but given that advertisements tend to be run in brief high-intensity bursts, brand names will often show up in these entries during ad runs.
And here is the query itself, written in Standard SQL and with an additional wrapper around Felipe's original query to flatten the results for CSV export, as well as setting the number of top terms to 10 and setting the date range to include the entire dataset (the exclusion of KSTS and KDTV is due to their predominately Spanish-language coverage since the other stations are in English):
WITH nested AS ( WITH data AS ( SELECT * FROM `gdelt-bq.gdeltv2.iatv_1grams` WHERE DATE BETWEEN 20090702 AND 20190630 AND station NOT IN ('KSTS', 'KDTV') ) , word_day_station AS ( # how many times a word is mentioned in each "document" SELECT word, SUM(count) counts, date, station FROM data GROUP BY 1, 3, 4 ) , day_station AS ( # total # of words in each "document" SELECT SUM(count) counts, date, station FROM data GROUP BY 2,3 ) , tf AS ( # TF for a word in a "document" SELECT word, date, station, a.counts/b.counts tf FROM word_day_station a JOIN day_station b USING(date, station) ) , word_in_docs AS ( # how many "documents" have a word SELECT word, COUNT(DISTINCT FORMAT('%i %s', date, station)) indocs FROM word_day_station GROUP BY 1 ) , total_docs AS ( # total # of docs SELECT COUNT(DISTINCT FORMAT('%i %s', date, station)) total_docs FROM data ) , idf AS ( # IDF for a word SELECT word, LOG(total_docs.total_docs/indocs) idf FROM word_in_docs CROSS JOIN total_docs ) SELECT date, ARRAY_AGG(STRUCT(station, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY station) top_words FROM ( SELECT date, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 5) words FROM ( SELECT word, date, station, tf.tf * idf.idf tfidf FROM tf JOIN idf USING(word) ) GROUP BY date, station ) GROUP BY date ORDER BY date DESC ) select date,station,wordlist from nested, UNNEST(top_words)