The inaugural Television News Entity Graph processed the closed captioning of each broadcast on CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC from 2009 to 2019 from the Internet Archive's Television News Archive using Google's Cloud Natural Language API to identify all of the entities mentioned over the past decade of television news reporting on those stations, applying coreference resolution and contextual disambiguation to construct a rich chronology of the topics and things discussed on the news.
Using the same approach we used in October to construct a master list of the top "well known" entities across online news (ie those with MID codes), we've repeated the same process for television entities, compiling a histogram of the top entities that have MID codes.
Of the 16,215,630 total station-minutes in the dataset, there were 410,880,960 total entity annotations, of which 60,654,684 (14.7%) had a MID code. Of the entities with a MID code there were 713,889 distinct name variations resolving to 372,776 distinct MID codes.
The overall "type" distribution of all entities is:
The entities with MID codes fall into seven "types" as seen below:
You can see the top 15 entries below. Interestingly, unlike a literal search on the Television Explorer for "Trump" versus "Obama", which yields 2,680,791 mentions of Trump to just 1,445,809 mentions of Obama, the chart below shows Obama being mentioned more often than Trump and records just 39% the number of Trump mentions found by the Television Explorer. A closer inspection of the data suggests the primary reason for this is that the all-capitalized nature of closed captioning caused the Cloud Natural Language API to be unable to distinguish references to just "TRUMP" from the verb "trump" in order to understand it as a mention of the president, whereas the Television Explorer simply conducts a literal string match to find all of the mentions of the president's last name in a case-insensitive manner. This suggests that future work on automatic capitalization will be critical to ensuring the API's ability to fully understand name references.
The complete ranked list of 372,776 distinct MIDs is available as a UTF8 CSV file:
Compiling the dataset above took just 10 seconds to process 13.4GB into the final dataset.
SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid, APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type, APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl, avg(avgSalience) avgsalience, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities where entities.mid is not null group by entities.mid order by Count desc
Computing the number of distinct entity names can be achieved with:
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null
While the number of distinct MID codes is available through:
SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null
Computing the overall top types is as simple as:
SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity group by type order by count
While computing the top types for those entities with MID codes requires only the addition of a simple filter:
SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null group by type order by count
Once again, BigQuery makes it trivial to explore these vast entity datasets!