Earlier today we showed how the Cloud Natural Language API could read through a decade of television news transcripts to annotate their core entities, with BigQuery running a TFIDF calculation over the massive dataset to compute the top 10 most signficant entries of each entity type by day. What if we broke that down even further, surfacing the top entities by station/day, to understand how stations compared against one another on a given day?
The end result for entities of type "Event" on October 31, 2019 looks like this (keep in mind that the all-caps nature of closed captioning leads to a higher error rate):
|Date||Station||Type||Top 10 Entries|
|10/31/2019||BBCNEWS||EVENT||ELECTION, MAILAND GENERAL ELECTION, CAMPAIGN, LABOUR, ELECTION CAMPAIGN, BREXIT GENERAL ELECTION, RAID, RUGBY WORLD CUP FINAL, GENERAL ELECTION, WORLD CUP|
|10/31/2019||CNN||EVENT||HEARINGS, TESTIMONY, INVESTIGATION, ELECTION, DEPOSITIONS, WORLD SERIES, IMPEACHMENT INVESTIGATION, MEETING, JULY 10TH MEETING, WASHINGTON NATIONALS CELEBRATING THEIR FIRST EVER WORLD SERIES|
|10/31/2019||FOXNEWS||EVENT||HEARINGS, ELECTION, INVESTIGATION, DEPOSITIONS, SUICIDE, WORLD SERIES, CAMPAIGN, AUTOPSY, EVENT, DEATH|
|10/31/2019||KGO||EVENT||ALCOHOL POISONING DEATH, POLICE CHASE, HEARINGS, PLANE CRASH, DEATH, THE HIGH SCHOOL THEATER GROUP PERFORMING, ND CELEBRATING THE WORLD SERIES, HOSTING A PARADE ON SATURDAY, AUDIO ANNOUNCEMENT, WILDLAND FIRE|
|10/31/2019||KNTV||EVENT||BATTLE, ROLLER COASTER RIDE OF HOPE, RAPID FIRE SPREAD, BYE BYE BYE BYE BYE BYE, FIRES, EXPLOSIVE TESTIMONY, AIR ATTACK, WIND EVENT, BURIED, IMPEACHMENT INVESTIGATION|
|10/31/2019||KPIX||EVENT||WAYONS BROTHERS SHOW, EPIC BATTLE IN AFGHANISTAN, HORSE FATALITIES, PRERACE PRESS CONFERENCE, EVE, RETRIBUTION ATTACK, IMPEACHMENT INQUIRY IN THE HOUSE, WEEKEND RAID, TERROR TAKEDOWN, SO.MAGIC MORNINGS|
|10/31/2019||MSNBC||EVENT||HEARINGS, DEPOSITIONS, JULY 18TH MEETING, TESTIMONY, ELECTION, INVESTIGATION, MEETING, BLASEY FORD TESTIMONY, IMPEACHMENT PROCEEDINGS, IMPEACHMENT TRIAL|
Immediately clear is that CNN, Fox News and MSNBC all largely focused on the impeachment hearings, though Fox News emphasized Epstein's death and the World Series more prominently. KGO, KPIX and KNTV include only the ABC, CBS and NBC evening news broadcasts on those stations and the shorter time frame means they cover a wider range of topics. Interestingly, the much greater variety in their topical focus in the table above aligns with the finding that they are signficantly less correlated. Finally, BBC represents an entirely parallel news environment emphasizing domestic UK and EU news, exactly as would be expected from the lone non-US station in the list above.
The complete spreadsheet can be downloaded in CSV format:
Constructing the spreadsheet above took only a simple SQL query, modified from the original by Felipe Hoffa.
WITH nested AS ( WITH data AS ( SELECT TIMESTAMP_TRUNC(date, DAY) date, station, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities ) , word_day_type AS ( # how many times a word is mentioned in each "document" SELECT word, SUM(count) counts, date, type, station FROM data GROUP BY 1, 3, 4, 5 ) , day_type AS ( # total # of words in each "document" SELECT SUM(count) counts, date, type, station FROM data GROUP BY 2,3,4 ) , tf AS ( # TF for a word in a "document" SELECT word, date, type, station, a.counts/b.counts tf FROM word_day_type a JOIN day_type b USING(date, type, station) ) , word_in_docs AS ( # how many "documents" have a word SELECT word, COUNT(DISTINCT FORMAT('%s %s %s', FORMAT_TIMESTAMP( "%m/%d/%E4Y", date), type, station)) indocs FROM word_day_type GROUP BY 1 ) , total_docs AS ( # total # of docs SELECT COUNT(DISTINCT FORMAT('%s %s %s', FORMAT_TIMESTAMP( "%m/%d/%E4Y", date), type, station)) total_docs FROM data ) , idf AS ( # IDF for a word SELECT word, LOG(total_docs.total_docs/indocs) idf FROM word_in_docs CROSS JOIN total_docs ) SELECT date, station, ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words FROM ( SELECT date, type, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words FROM ( SELECT word, date, type, station, tf.tf * idf.idf tfidf FROM tf JOIN idf USING(word) ) GROUP BY date, type, station ) GROUP BY date,station ORDER BY date DESC ) select FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) day,station,type,wordlist from nested, UNNEST(top_words) order by date asc, type asc, station asc