Last week we released the Television News Global Entity Graph 2.0, a compilation of more than 477 million entity annotations resulting from using Google's Cloud Natural Language API to read through the textual transcripts of more than 300,000 broadcasts over the past decade. For those interested in merging these textual annotations with the visual annotations of the Visual Global Entity Graph 2.0, we've released a cross-walk of all 69,836 distinct entities with MID codes that were identified by the Cloud Natural Language API in the transcripts of broadcasts that have been processed to date by the Visual Global Entity Graph.
The table below shows an extract of a few entries from the list. Each row represents one distinct MID code with the first column containing the MID and the second column containing the most common human-readable label assigned to that MID (since multiple references can resolve to single MID like "Donald Trump", "Donald J. Trump", "President Trump", etc all resolving to a single MID code for Donald Trump).
/m/086sx_ | AARON SWARTZ |
/m/07x31y | AARON TAYLOR |
/m/010x2dgd | AARON TOPPEN |
/g/11fhym26gn | AARON TREJO |
/m/011p3qb4 | AARON VAUGHN |
/m/08k8l7 | AARON WARD |
/m/0bx_l4r | AARON WILLIAMS |
/m/010g6b15 | AARON WILLIAMSON |
/g/11f0mckdxf | AARON WILLIS |
/g/11gg_x6gwb | AARON ZEBLEY |
/g/11f00d365b | AARON ZELINSKY |
/m/04l0nj | AARON'S |
/m/023_22 | AARP |
/m/03cl9wp | AARP MAGAZINE |
You can download the complete spreadsheet below:
TECHNICAL DETAILS
Compiling this spreadsheet took just a single SQL query in BigQuery:
SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and iaShowId in (SELECT distinct(iaShowId) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`) group by entity.mid order by name asc
To compile the list of textual MID codes from the Natural Language API that also appear as visual MID codes from the Cloud Video API, you can explore using this query:
SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and DATE(date) = "2020-02-01" and entity.mid in (SELECT distinct(entity.mid) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`, unnest(entities) entity where length(entity.mid) > 1) group by entity.mid order by name asc
To sort entities by the number of times they appear:
SELECT entity.mid, APPROX_TOP_COUNT(entity.name, 1)[OFFSET(0)].value name, count(1) count FROM `gdelt-bq.gdeltv2.gegv2_iatv`, unnest(entities) entity WHERE length(entity.mid) > 1 and iaShowId in (SELECT distinct(iaShowId) FROM `gdelt-bq.gdeltv2.vgegv2_iatv`) group by entity.mid order by count desc