Last week we unveiled the results of having Google's Cloud Natural Language API non-consumptively read through the transcripts of a decade of television news from the Internet Archive's Television News Archive and catalog all of the entities it found in all of that coverage. The end result was a massive television news spoken entity dataset.
How might we use this enormous dataset to understand the most signficant topics and things that defined each day of television news over the past decade through the eyes of CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC and BBC News London for a third of a decade?
Earlier this month we showed how Google's Cloud Inference API could be used to construct a "trending topics" chronology over online news. It turns out that we can apply that exact same process to this enormous television news entity dataset to surface the most signficant topics and things discussed each day across the combined airtime of these top television news stations over the past decade.
Here are the final chronologies:
- GEG-TVIA-CONSUMER_GOOD.TXT
- GEG-TVIA-EVENT.TXT
- GEG-TVIA-LOCATION.TXT
- GEG-TVIA-ORGANIZATION.TXT
- GEG-TVIA-OTHER.TXT
- GEG-TVIA-PERSON.TXT
- GEG-TVIA-WORK_OF_ART.TXT
TECHNICAL DETAILS
How did we create the lists above?
Once again we start by creating a temporary BigQuery table and using a sequence of queries to translate and reshape the BigQuery version of the Television News Captioning Entity Dataset into the format used by the Inference API.
Once again we use FARM_FINGERPRINT() to generate unique numeric IDs for each "group" and in this case we define a "group" as the native resolution of the television news entity dataset: a station-minute.
First, we translate over the station and show names for each station-minute into our new temporary BigQuery table (set your query settings to append to a new table):
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Station' data_name, station data_value, date start_time FROM `gdelt-bq.gdeltv2.geg_iatv` SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'ShowName' data_name, showName data_value, date start_time FROM `gdelt-bq.gdeltv2.geg_iatv`
Then we translate over all of the entity references that had MID codes and use a join to compute the most common human name for each (the Cloud Natural Language API recognizes a number of name variants for entities with MID codes and thus this resolves each to its most common name):
select group_id, data_name, a.entity data_value, start_time from ( SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, CONCAT('Entity',entity.type) data_name, date start_time, entity.mid mid FROM `gdelt-bq.gdeltv2.geg_iatv`, UNNEST(entities) entity WHERE entity.mid is not null ) b JOIN ( SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities where entities.mid is not null group by entities.mid ) a USING(mid)
Now we go back and compile a list of all of the unique column names in the new table:
SELECT data_name, count(1) cnt FROM `[YOURTEMPBIGQUERYTABLE]` group by data_name order by data_name asc
And verify the start and end date ranges:
SELECT DATE(date) day, count(1) FROM `[YOURTEMPBIGQUERYTABLE]` group by day order by day asc
Now we take the list of column names above and form them into the final Inference API load command and instruct the Cloud Inference API to load the dataset directly from BigQuery into a new Inference API dataset called "gdeltbq_gegiatv":
curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{ "name":"gdeltbq_gegiatv", "data_names": [ "EntityCONSUMER_GOOD", "EntityEVENT", "EntityLOCATION", "EntityORGANIZATION", "EntityOTHER", "EntityPERSON", "EntityUNKNOWN", "EntityWORK_OF_ART", "Station", "ShowName" ], "data_sources": [ { "uri":"[YOURTEMPBIGQUERYTABLE]" }, ] }'
You can check the load status using the command below:
curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets
Once the dataset is loaded, download the PERL script and corresponding JSON query template:
Using the start and stop dates we determined earlier, run this PERL script over the entire dataset for each entity in turn. The PERL script will cycle over each day from the start to end dates, running a TED query using the Inference API for each day, asking the API to surface the most significant entities of the given type for the requested day:
time ./infer_run_day.pl 20090702 20191031 EntityLOCATION GEG-TVIA-LOCATION.TXT time ./infer_run_day.pl 20090702 20191031 EntityPERSON GEG-TVIA-PERSON.TXT time ./infer_run_day.pl 20090702 20191031 EntityEVENT GEG-TVIA-EVENT.TXT time ./infer_run_day.pl 20090702 20191031 EntityCONSUMER_GOOD GEG-TVIA-CONSUMER_GOOD.TXT time ./infer_run_day.pl 20090702 20191031 EntityORGANIZATION GEG-TVIA-ORGANIZATION.TXT time ./infer_run_day.pl 20090702 20191031 EntityOTHER GEG-TVIA-OTHER.TXT time ./infer_run_day.pl 20090702 20191031 EntityWORK_OF_ART GEG-TVIA-WORK_OF_ART.TXT
Internally, the PERL script is about as simple as it gets, since the Inference API does all of the actual work! The PERL script's job is simply to cycle through the days between the start and end dates and issue TED queries for each to the Inference API. It is the Inference API that, in the space of a fraction of second, examines the enormous archive of the entities mentioned over a decade of television news coverage and determines the ones that are most significant to the given day.