Television News GEG: Master Ranked List Of Unique Spoken Entities From Television News 2009-2019

The inaugural Television News Entity Graph processed the closed captioning of each broadcast on CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC from 2009 to 2019 from the Internet Archive's Television News Archive using Google's Cloud Natural Language API to identify all of the entities mentioned over the past decade of television news reporting on those stations, applying coreference resolution and contextual disambiguation to construct a rich chronology of the topics and things discussed on the news.

Using the same approach we used in October to construct a master list of the top "well known" entities across online news (ie those with MID codes), we've repeated the same process for television entities, compiling a histogram of the top entities that have MID codes.

Of the 16,215,630 total station-minutes in the dataset, there were 410,880,960 total entity annotations, of which 60,654,684 (14.7%) had a MID code. Of the entities with a MID code there were 713,889 distinct name variations resolving to 372,776 distinct MID codes.

The overall "type" distribution of all entities is:

Type Count
OTHER 188261768
PERSON 77750586
LOCATION 36585133
NUMBER 27489311
EVENT 20991363
WORK_OF_ART 12542339
DATE 2211849
PRICE 1517236

The entities with MID codes fall into seven "types" as seen below:

Type Count
LOCATION 19595314
PERSON 18524288
OTHER 5302997
WORK_OF_ART 1669310
EVENT 1215466

You can see the top 15 entries below. Interestingly, unlike a literal search on the Television Explorer for "Trump" versus "Obama", which yields 2,680,791 mentions of Trump to just 1,445,809 mentions of Obama, the chart below shows Obama being mentioned more often than Trump and records just 39% the number of Trump mentions found by the Television Explorer. A closer inspection of the data suggests the primary reason for this is that the all-capitalized nature of closed captioning caused the Cloud Natural Language API to be unable to distinguish references to just "TRUMP" from the verb "trump" in order to understand it as a mention of the president, whereas the Television Explorer simply conducts a literal string match to find all of the mentions of the president's last name in a case-insensitive manner. This suggests that future work on automatic capitalization will be critical to ensuring the API's ability to fully understand name references.

Entity MID Type WikipediaURL AvgSalience Count
U.S. /m/09c7w0 LOCATION 0.005993 3949669
PRESIDENT OBAMA /m/02mjmr PERSON 0.134543 1280875
REPUBLICAN /m/07wbk ORGANIZATION 0.009171 1239768
DONALD TRUMP /m/0cqt90 PERSON 0.206801 1045748
DEMOCRATS /m/0d075m ORGANIZATION 0.004926 1017736
WHITE HOUSE /m/081sq LOCATION 0.003894 814832
CONGRESS /m/07t31 ORGANIZATION 0.009805 506107
HILLARY CLINTON /m/0d06m5 PERSON 0.041871 486391
NEW YORK /m/02_286 LOCATION 0.003574 479174
RUSSIA /m/06bnz LOCATION 0.005853 470465
SENATE /m/07t58 ORGANIZATION 0.011019 452868
UK /m/07ssc LOCATION 0.006587 394611
CNN /m/0gsgr ORGANIZATION 0.002264 386340
FBI /m/02_1m ORGANIZATION 0.014512 320678
WASHINGTON /m/081yw LOCATION 0.001002 319718

The complete ranked list of 372,776 distinct MIDs is available as a UTF8 CSV file:


Compiling the dataset above took just 10 seconds to process 13.4GB into the final dataset.

SELECT APPROX_TOP_COUNT(, 1)[OFFSET(0)].value entity,
entities.mid mid,
APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type,
APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl,
avg(avgSalience) avgsalience,
count(1) count 
 FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities
where entities.mid is not null
group by entities.mid order by Count desc

Computing the number of distinct entity names can be achieved with:

SELECT count(distinct( FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null

While the number of distinct MID codes is available through:

SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null

Computing the overall top types is as simple as:

SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity group by type order by count

While computing the top types for those entities with MID codes requires only the addition of a simple filter:

SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null group by type order by count

Once again, BigQuery makes it trivial to explore these vast entity datasets!