The GDELT Project

A By-Station Visual Chronology Of The Top Visual Themes Of Television Evening News 2010-2019 Using Cloud Video + Inference API

Using the new visual entity graph constructed by having the Cloud Video API watch a decade of evening news broadcasts on ABC, CBS and NBC in the Internet Archive's Television News Archive, what can we learn about the most significant visual themes of each evening? Yesterday we demonstrated how we could have BigQuery identify the most significant terms per day using TFIDF. Alternatively, we can use the Cloud Inference API to perform a much more powerful statistical analysis to surface the top 10 most meaningful visual labels of each evening's coverage on each station. Even more powerfully, we can use the same queries for interactive exploration to understand more about the findings we unearth.

We start by creating a temporary BigQuery table and translate the per-minute visual labels of the visual entity graph into the Cloud Inference API's format, creating five entity types, representing all entities seen each minute, only those entities seen for at least 5 seconds during that one minute interval, those entities seen at least 10 seconds during that minute, at least 30 seconds and at least 50 seconds. This allows us to examine everything from all entity co-occurrences, even fleeting co-occurrences in which an entity that appears for an entire minute co-appears with an entity that is visible for just one second, on through entities that appear for the majority of a minute together.

The final results can be seen below:

 

TECHNICAL DETAILS

How did we create the lists above?

Once again we start by creating a temporary BigQuery table and using a sequence of queries to translate and reshape the BigQuery version of the Television News Visual Entity Dataset into the format used by the Inference API.

Once again we use FARM_FINGERPRINT() to generate unique numeric IDs for each "group" and in this case we define a "group" as the native resolution of the television news entity dataset: a station-minute.

First, we translate over the station and show names for each station-minute into our new temporary BigQuery table (set your query settings to append to a new table):

SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Station' data_name, station data_value, date start_time FROM `gdelt-bq.gdeltv2.vgeg_iatv`
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'ShowName' data_name, showName data_value, date start_time FROM `gdelt-bq.gdeltv2.vgeg_iatv`

Next we translate over the entities, creating five different entity types, reflecting entities that appeared for 1, 5, 10, 30 and 50 seconds, respectively, during the given minute of airtime. This allows queries that surface relationships from cursory to systematic.

SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Entity1S' data_name, date start_time, entity.name data_value FROM `gdelt-bq.gdeltv2.vgeg_iatv`, UNNEST(entities) entity
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Entity5S' data_name, date start_time, entity.name data_value FROM `gdelt-bq.gdeltv2.vgeg_iatv`, UNNEST(entities) entity WHERE entity.numSeconds >= 5
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Entity10S' data_name, date start_time, entity.name data_value FROM `gdelt-bq.gdeltv2.vgeg_iatv`, UNNEST(entities) entity WHERE entity.numSeconds >= 10
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Entity30S' data_name, date start_time, entity.name data_value FROM `gdelt-bq.gdeltv2.vgeg_iatv`, UNNEST(entities) entity WHERE entity.numSeconds >= 30
SELECT FARM_FINGERPRINT( CONCAT(station,CAST(date AS STRING)) ) group_id, 'Entity50S' data_name, date start_time, entity.name data_value FROM `gdelt-bq.gdeltv2.vgeg_iatv`, UNNEST(entities) entity WHERE entity.numSeconds >= 50

Next we compile a list of the fields in our new temporary table.

SELECT data_name, count(1) cnt FROM `[YOURTEMPBIGQUERYTABLE]` group by data_name order by data_name asc

And instruct the Cloud Inference API to load the dataset from BigQuery.

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets -d '{
  "name":"gdeltbq_vgegiatv_v1",
  "data_names": [
    "Entity1S",
    "Entity5S",
    "Entity10S",
    "Entity30S",
    "Entity50S",
    "Station",
    "ShowName"
  ],
  "data_sources": [
    { "uri":"[YOURTEMPBIGQUERYTABLE]" },
  ]
}'

And check on its load status.

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" https://infer.googleapis.com/v1/projects/[YOURPROJECTID]/datasets

Next we determine the start and stop dates of the dataset from our temporary BigQuery table.

SELECT min(start_time),max(start_time) FROM `[YOURTEMPBIGQUERYTABLE]`

Now download the PERL script and JSON query template and modify the JSON query file to include your GCP Project ID and change the two references to the Cloud Inference API table to the one you created above.

Then, run the PERL scripts that call the Inference API for each day.

time ./infer_run_day.pl 20100715 20191130 Entity1S VGEG-TVIA-1SECOND.TXT
time ./infer_run_day.pl 20100715 20191130 Entity5S VGEG-TVIA-5SECOND.TXT
time ./infer_run_day.pl 20100715 20191130 Entity10S VGEG-TVIA-10SECOND.TXT
time ./infer_run_day.pl 20100715 20191130 Entity30S VGEG-TVIA-30SECOND.TXT
time ./infer_run_day.pl 20100715 20191130 Entity50S VGEG-TVIA-50SECOND.TXT

Internally, the PERL script is about as simple as it gets, since the Inference API does all of the actual work! The PERL script's job is simply to cycle through the days between the start and end dates and issue TED queries for each to the Inference API. It is the Inference API that, in the space of a fraction of second, examines the enormous archive of the visual labels describing a decade of television news coverage and determines the ones that are most significant to the given day.