A By-Station Topical Chronology Of A Decade Of Television News Using Cloud Natural Language & BigQuery

Earlier today we showed how the Cloud Natural Language API could read through a decade of television news transcripts to annotate their core entities, with BigQuery running a TFIDF calculation over the massive dataset to compute the top 10 most signficant entries of each entity type by day. What if we broke that down even further, surfacing the top entities by station/day, to understand how stations compared against one another on a given day?

The end result for entities of type "Event" on October 31, 2019 looks like this (keep in mind that the all-caps nature of closed captioning leads to a higher error rate):

Date Station Type Top 10 Entries
10/31/2019 BBCNEWS EVENT ELECTION, MAILAND GENERAL ELECTION, CAMPAIGN, LABOUR, ELECTION CAMPAIGN, BREXIT GENERAL ELECTION, RAID, RUGBY WORLD CUP FINAL, GENERAL ELECTION, WORLD CUP
10/31/2019 CNN EVENT HEARINGS, TESTIMONY, INVESTIGATION, ELECTION, DEPOSITIONS, WORLD SERIES, IMPEACHMENT INVESTIGATION, MEETING, JULY 10TH MEETING, WASHINGTON NATIONALS CELEBRATING THEIR FIRST EVER WORLD SERIES
10/31/2019 FOXNEWS EVENT HEARINGS, ELECTION, INVESTIGATION, DEPOSITIONS, SUICIDE, WORLD SERIES, CAMPAIGN, AUTOPSY, EVENT, DEATH
10/31/2019 KGO EVENT ALCOHOL POISONING DEATH, POLICE CHASE, HEARINGS, PLANE CRASH, DEATH, THE HIGH SCHOOL THEATER GROUP PERFORMING, ND CELEBRATING THE WORLD SERIES, HOSTING A PARADE ON SATURDAY, AUDIO ANNOUNCEMENT, WILDLAND FIRE
10/31/2019 KNTV EVENT BATTLE, ROLLER COASTER RIDE OF HOPE, RAPID FIRE SPREAD, BYE BYE BYE BYE BYE BYE, FIRES, EXPLOSIVE TESTIMONY, AIR ATTACK, WIND EVENT, BURIED, IMPEACHMENT INVESTIGATION
10/31/2019 KPIX EVENT WAYONS BROTHERS SHOW, EPIC BATTLE IN AFGHANISTAN, HORSE FATALITIES, PRERACE PRESS CONFERENCE, EVE, RETRIBUTION ATTACK, IMPEACHMENT INQUIRY IN THE HOUSE, WEEKEND RAID, TERROR TAKEDOWN, SO.MAGIC MORNINGS
10/31/2019 MSNBC EVENT HEARINGS, DEPOSITIONS, JULY 18TH MEETING, TESTIMONY, ELECTION, INVESTIGATION, MEETING, BLASEY FORD TESTIMONY, IMPEACHMENT PROCEEDINGS, IMPEACHMENT TRIAL

Immediately clear is that CNN, Fox News and MSNBC all largely focused on the impeachment hearings, though Fox News emphasized Epstein's death and the World Series more prominently. KGO, KPIX and KNTV include only the ABC, CBS and NBC evening news broadcasts on those stations and the shorter time frame means they cover a wider range of topics. Interestingly, the much greater variety in their topical focus in the table above aligns with the finding that they are signficantly less correlated. Finally, BBC represents an entirely parallel news environment emphasizing domestic UK and EU news, exactly as would be expected from the lone non-US station in the list above.

The complete spreadsheet can be downloaded in CSV format:

TECHNICAL DETAILS

Constructing the spreadsheet above took only a simple SQL query, modified from the original by Felipe Hoffa.

WITH nested AS (

WITH data AS (
SELECT TIMESTAMP_TRUNC(date, DAY) date, station, entities.name word, entities.numMentions count, entities.type type FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities
)
, word_day_type AS (
# how many times a word is mentioned in each "document"
SELECT word, SUM(count) counts, date, type, station
FROM data
GROUP BY 1, 3, 4, 5
)
, day_type AS (
# total # of words in each "document"
SELECT SUM(count) counts, date, type, station
FROM data
GROUP BY 2,3,4
)
, tf AS (
# TF for a word in a "document"
SELECT word, date, type, station, a.counts/b.counts tf
FROM word_day_type a
JOIN day_type b
USING(date, type, station)
)
, word_in_docs AS (
# how many "documents" have a word
SELECT word, COUNT(DISTINCT FORMAT('%s %s %s', FORMAT_TIMESTAMP( "%m/%d/%E4Y", date), type, station)) indocs
FROM word_day_type
GROUP BY 1
)
, total_docs AS (
# total # of docs
SELECT COUNT(DISTINCT FORMAT('%s %s %s', FORMAT_TIMESTAMP( "%m/%d/%E4Y", date), type, station)) total_docs
FROM data
)
, idf AS (
# IDF for a word
SELECT word, LOG(total_docs.total_docs/indocs) idf
FROM word_in_docs
CROSS JOIN total_docs
)

SELECT date, station,
ARRAY_AGG(STRUCT(type, ARRAY_TO_STRING(words, ', ') AS wordlist) ORDER BY type) top_words
FROM (
SELECT date, type, station, ARRAY_AGG(word ORDER BY tfidf DESC LIMIT 10) words
FROM (
SELECT word, date, type, station, tf.tf * idf.idf tfidf
FROM tf
JOIN idf
USING(word)
)
GROUP BY date, type, station
)
GROUP BY date,station
ORDER BY date DESC

) select FORMAT_TIMESTAMP( "%m/%d/%E4Y", date) day,station,type,wordlist from nested, UNNEST(top_words) order by date asc, type asc, station asc