Television News GEG: Master Ranked List Of Unique Spoken Entities From Television News 2009-2019

Kalev Leetaru

5 years ago

The inaugural Television News Entity Graph processed the closed captioning of each broadcast on CNN, MSNBC, Fox News and the evening news broadcasts of ABC, CBS and NBC from 2009 to 2019 from the Internet Archive's Television News Archive using Google's Cloud Natural Language API to identify all of the entities mentioned over the past decade of television news reporting on those stations, applying coreference resolution and contextual disambiguation to construct a rich chronology of the topics and things discussed on the news.

Using the same approach we used in October to construct a master list of the top "well known" entities across online news (ie those with MID codes), we've repeated the same process for television entities, compiling a histogram of the top entities that have MID codes.

Of the 16,215,630 total station-minutes in the dataset, there were 410,880,960 total entity annotations, of which 60,654,684 (14.7%) had a MID code. Of the entities with a MID code there were 713,889 distinct name variations resolving to 372,776 distinct MID codes.

The overall "type" distribution of all entities is:

Type	Count
OTHER	188261768
PERSON	77750586
LOCATION	36585133
ORGANIZATION	36579842
NUMBER	27489311
EVENT	20991363
WORK_OF_ART	12542339
CONSUMER_GOOD	6824047
DATE	2211849
PRICE	1517236
PHONE_NUMBER	112688
ADDRESS	14798

The entities with MID codes fall into seven "types" as seen below:

Type	Count
LOCATION	19595314
PERSON	18524288
ORGANIZATION	13677013
OTHER	5302997
WORK_OF_ART	1669310
EVENT	1215466
CONSUMER_GOOD	670296

You can see the top 15 entries below. Interestingly, unlike a literal search on the Television Explorer for "Trump" versus "Obama", which yields 2,680,791 mentions of Trump to just 1,445,809 mentions of Obama, the chart below shows Obama being mentioned more often than Trump and records just 39% the number of Trump mentions found by the Television Explorer. A closer inspection of the data suggests the primary reason for this is that the all-capitalized nature of closed captioning caused the Cloud Natural Language API to be unable to distinguish references to just "TRUMP" from the verb "trump" in order to understand it as a mention of the president, whereas the Television Explorer simply conducts a literal string match to find all of the mentions of the president's last name in a case-insensitive manner. This suggests that future work on automatic capitalization will be critical to ensuring the API's ability to fully understand name references.

Entity	MID	Type	WikipediaURL	AvgSalience	Count
U.S.	/m/09c7w0	LOCATION	https://en.wikipedia.org/wiki/United_States	0.005993	3949669
PRESIDENT OBAMA	/m/02mjmr	PERSON	https://en.wikipedia.org/wiki/Barack_Obama	0.134543	1280875
REPUBLICAN	/m/07wbk	ORGANIZATION	https://en.wikipedia.org/wiki/Republican_Party_(United_States)	0.009171	1239768
DONALD TRUMP	/m/0cqt90	PERSON	https://en.wikipedia.org/wiki/Donald_Trump	0.206801	1045748
DEMOCRATS	/m/0d075m	ORGANIZATION	https://en.wikipedia.org/wiki/Democratic_Party_(United_States)	0.004926	1017736
WHITE HOUSE	/m/081sq	LOCATION	https://en.wikipedia.org/wiki/White_House	0.003894	814832
CONGRESS	/m/07t31	ORGANIZATION	https://en.wikipedia.org/wiki/United_States_Congress	0.009805	506107
HILLARY CLINTON	/m/0d06m5	PERSON	https://en.wikipedia.org/wiki/Hillary_Clinton	0.041871	486391
NEW YORK	/m/02_286	LOCATION	https://en.wikipedia.org/wiki/New_York_City	0.003574	479174
RUSSIA	/m/06bnz	LOCATION	https://en.wikipedia.org/wiki/Russia	0.005853	470465
SENATE	/m/07t58	ORGANIZATION	https://en.wikipedia.org/wiki/United_States_Senate	0.011019	452868
UK	/m/07ssc	LOCATION	https://en.wikipedia.org/wiki/United_Kingdom	0.006587	394611
CNN	/m/0gsgr	ORGANIZATION	https://en.wikipedia.org/wiki/CNN	0.002264	386340
FBI	/m/02_1m	ORGANIZATION	https://en.wikipedia.org/wiki/Federal_Bureau_of_Investigation	0.014512	320678
WASHINGTON	/m/081yw	LOCATION	https://en.wikipedia.org/wiki/Washington_(state)	0.001002	319718

The complete ranked list of 372,776 distinct MIDs is available as a UTF8 CSV file:

Master List of MIDs In GEG-IATV 2009-2019. (10MB).

TECHNICAL DETAILS

Compiling the dataset above took just 10 seconds to process 13.4GB into the final dataset.

SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity,
entities.mid mid,
APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type,
APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl,
avg(avgSalience) avgsalience,
count(1) count 
 FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) entities
where entities.mid is not null
group by entities.mid order by Count desc

Computing the number of distinct entity names can be achieved with:

SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null

While the number of distinct MID codes is available through:

SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null

Computing the overall top types is as simple as:

SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity group by type order by count

While computing the top types for those entities with MID codes requires only the addition of a simple filter:

SELECT entity.type, count(1) count FROM `gdelt-bq.gdeltv2.geg_iatv`, unnest(entities) as entity where entity.mid is not null group by type order by count

Once again, BigQuery makes it trivial to explore these vast entity datasets!