GEG: Master Ranked List Of 13.9 Million Unique Entities From 103 Million English News Articles 2016-2019

The inaugural GDELT Global Entity Graph (GEG) contained more than 11.4 billion entity annotations from over 103 million English language articles 2016-2019, annotated through Google's Cloud Natural Language API. What are the most common entities found in global English language online news coverage of the past three years?

Out of the 11,416,958,078 total entity annotations, a total of 1,584,550,381 entities (13.9%) have a non-null MID. Of the entities with a MID code, there were 27,833,754 distinct entity names resolving to 13,892,261 distinct MID entity IDs (name variants of an entity all resolve to the same MID).

The entities with MID codes fall into eight "types" as seen below:

Type	Count
PERSON	6332329
LOCATION	2849856
ORGANIZATION	2196649
OTHER	1326627
WORK_OF_ART	795123
EVENT	253562
CONSUMER_GOOD	136813
UNKNOWN	1302

For each entity, its most common name, MID, most common type, most common Wikipedia URL, average salience and total number of appearances was computed. Since the Cloud Natural Language API recognizes many different alternative and variant names for entities, the 11.4 billion entities were collapsed by MID code (entities without a MID code were excluded). For each MID code, a list of every textual name associated with that MID was compiled and the most common one selected, with the same process applied to its type (since a name in different contexts can take on different type roles) and Wikipedia URL. The average salience of all mentions of the entity across the 103 million articles was also recorded, allowing entities to be examined through how "important" they are on average to the documents they appear in, while the total count reflects how often the entity appeared.

The top 10 entities appear below:

Entity	MID	Type	WikipediaURL	AvgSalience	Count
U.S.	/m/09c7w0	LOCATION	https://en.wikipedia.org/wiki/United_States	0.014269125	31447173
Donald Trump	/m/0cqt90	PERSON	https://en.wikipedia.org/wiki/Donald_Trump	0.102800737	12805048
UK	/m/07ssc	LOCATION	https://en.wikipedia.org/wiki/United_Kingdom	0.010535596	8679941
The Associated Press	/m/0cv_2	ORGANIZATION	https://en.wikipedia.org/wiki/Associated_Press	0.012310849	7978374
Europe	/m/02j9z	LOCATION	https://en.wikipedia.org/wiki/Europe	0.005559554	7367460
New York	/m/02_286	LOCATION	https://en.wikipedia.org/wiki/New_York_City	0.006931887	6991149
Republican	/m/07wbk	ORGANIZATION	https://en.wikipedia.org/wiki/Republican_Party_(United_States)	0.010974562	6672985
Twitter	/m/0289n8t	OTHER	https://en.wikipedia.org/wiki/Twitter	0.004768608	6662767
China	/m/0d05w3	LOCATION	https://en.wikipedia.org/wiki/China	0.024354636	6460596
Democratic	/m/0d075m	PERSON	https://en.wikipedia.org/wiki/Democratic_Party_(United_States)	0.011383625	6126569

The complete list of 13,892,261 distinct MIDs is available as a UTF8 CSV file:

Master List Of MIDs In GEG-GCNLAPI 2016-2019. (346MB compressed / 937MB uncompressed).

TECHNICAL DETAILS

Compiling the dataset above took just a single SQL query and 153 seconds to process 382GB of annotations into the final dataset.

SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity,
entities.mid mid,
APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type,
APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl,
avg(avgSalience) avgsalience,
count(1) count 
 FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities
where entities.mid is not null
group by entities.mid order by Count desc

The GDELT Project

GEG: Master Ranked List Of 13.9 Million Unique Entities From 103 Million English News Articles 2016-2019

TECHNICAL DETAILS

Archives