Television News Global Entity Graph 2.0: Master VGEG 2.0 Captioning Entity List

Last week we released the Television News Global Entity Graph 2.0, a compilation of more than 477 million entity annotations resulting from using Google's Cloud Natural Language API to read through the textual transcripts of more than 300,000 broadcasts over the past decade. For those interested in merging these textual annotations with the visual annotations of the Visual Global Entity Graph 2.0, we've released a cross-walk of all 69,836 distinct entities with MID codes that were identified by the Cloud Natural Language API in the transcripts of broadcasts that have been processed to date by the Visual Global Entity Graph.

The table below shows an extract of a few entries from the list. Each row represents one distinct MID code with the first column containing the MID and the second column containing the most common human-readable label assigned to that MID (since multiple references can resolve to single MID like "Donald Trump", "Donald J. Trump", "President Trump", etc all resolving to a single MID code for Donald Trump).

/m/086sx_ AARON SWARTZ
/m/07x31y AARON TAYLOR
/m/010x2dgd AARON TOPPEN
/g/11fhym26gn AARON TREJO
/m/011p3qb4 AARON VAUGHN
/m/08k8l7 AARON WARD
/m/0bx_l4r AARON WILLIAMS
/m/010g6b15 AARON WILLIAMSON
/g/11f0mckdxf AARON WILLIS
/g/11gg_x6gwb AARON ZEBLEY
/g/11f00d365b AARON ZELINSKY
/m/04l0nj AARON'S
/m/023_22 AARP
/m/03cl9wp AARP MAGAZINE

You can download the complete spreadsheet below:

TECHNICAL DETAILS

Compiling this spreadsheet took just a single SQL query in BigQuery:

SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and iaShowId in (SELECT distinct(iaShowId) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`) group by entity.mid order by name asc

To compile the list of textual MID codes from the Natural Language API that also appear as visual MID codes from the Cloud Video API, you can explore using this query:

SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and DATE(date) = "2020-02-01" and entity.mid in (SELECT distinct(entity.mid) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, unnest(entities) entity where length(entity.mid) > 1) group by entity.mid order by name asc

To sort entities by the number of times they appear:

SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name, count(1) count FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and iaShowId in (SELECT distinct(iaShowId) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`) group by entity.mid order by count desc