GEG: Master Ranked List Of 23.9 Million Unique Entities From 200 Million English News Articles 2016-2021

Kalev Leetaru

4 years ago

Interest in structured knowledge graphs is surging as organizations search for more powerful ways of understanding the world. Each day GDELT annotates a small random sample of global online news coverage through Google's Cloud Natural Language API, recording the resulting entities in the Global Entity Graph (GEG). Today the GEG contains more than 21 billion entity annotations of 412 million distinct entity names spanning more than 200 million articles in 11 languages 2016-present and updated every minute.

The Cloud Natural Language API organizes each entity into one of 13 classifications based on its usage in the article. Which are the most and least common types?

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc

Using the simple SQL query above we get the table below, showing that people are the most common type of entity mentioned in the news, followed by locations, organizations, events, numbers and works of art (typically things like publications and other reports being announced in the news).

Row	type	cnt
1	OTHER	10080245264
2	PERSON	3460102811
3	LOCATION	2220884897
4	ORGANIZATION	1839565758
5	EVENT	1105138233
6	NUMBER	1045114567
7	WORK_OF_ART	618669861
8	CONSUMER_GOOD	409501218
9	DATE	191402153
10	PRICE	63708896
11	PHONE_NUMBER	5942363
12	ADDRESS	4918115
13	UNKNOWN	1394571

The most valuable kinds of entity mentions from the standpoint of machine understanding and automated reasoning are entities for which structured information is available. A reference to a person named "Joe Biden" in a news article is indistinguishable from any other person name without the additional information that he is the current president of the United States. Moreover, without additional information, the names "Joe Biden," "Joseph Biden" and "Joseph Robinette Biden Jr." are not connected to one another.

What is needed is some kind of unique universal identifier that identities key entities, linking their name variants under a common identifier and attaching it to a wealth of structured descriptors about that entity and its relationship to other entities, enabling complex reasoning.

Google's Natural Language API offers this exact capability, automatically returning a unique ID code (MID/GID) for entities it knows, along with the URL of that entity's Wikipedia page if it exists. What percent of those 412 million distinct entities in the GEG have assigned codes? (As discussed in a moment, these ID codes can be either MID or GID codes, but we'll use "MID" to refer to both types here for the purposes of simplicity.)

SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity 
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

Using the queries above we see that of the GEG's 412,418,817 distinct entities, 44,543,323 (10.8%) have assigned MID/GID codes.

Of course, some entities may be more common than others, so the queries below count the total number of entity mentions and what percent of those have MID codes:

SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity 
SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

Of the 21,046,588,707 total entity mentions in the GEG, 2,680,674,189 (12.7%) had MID codes, showing that they make up a minority of all entity mentions across the news each day, which is unsurprising given the range of different entity types in the GEG. What if we break this down by entity type?

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null group by entity.type order by cnt desc

The final breakdown is shown below, showing that 50% of UNKNOWN entities have MID codes, 44% of LOCATION mentions, 37% of ORGANIZATION mentions, 19% of PERSON mentions, 12% of WORK_OF_ART mentions, 7% of CONSUMER_GOOD mentions, 4% of EVENT mentions and 2% of OTHER mentions.

Type	With MID	Total	% With MID
OTHER	226,529,516	10,080,245,264	2.25
PERSON	654,485,433	3,460,102,811	18.92
LOCATION	973,468,295	2,220,884,897	43.83
ORGANIZATION	680,416,623	1,839,565,758	36.99
EVENT	40,927,168	1,105,138,233	3.70
NUMBER	0	1,045,114,567	0.00
WORK_OF_ART	74,243,538	618,669,861	12.00
CONSUMER_GOOD	29,991,226	409,501,218	7.32
DATE	0	191,402,153	0.00
PRICE	0	63,708,896	0.00
PHONE_NUMBER	0	5,942,363	0.00
ADDRESS	0	4,918,115	0.00
UNKNOWN	693,961	1,394,571	49.76

Recall that the Natural Language API recognizes an array of name variants for known entities. What is the ratio of recognized names to unique entity identifiers?

SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

The queries above show that the GEG contains 44,544,537 unique names resolving to 23,902,430 unique MID codes, capturing the considerable breadth of the API's ability to recognize name variants.

Of course, the most interesting question lies in what those specific recognized entities are. The query below compiles all 23.9 million unique MID codes and for each returns its most common textual name as it appeared in news coverage and its most common type and WikipediaURL, average document-level salience and total number of appearances. Note that an entity can have different types depending on its textual context, with "Hillary Clinton announced today" being a PERSON, while "Hillary Clinton's home is in" being a LOCATION and so on.

SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid, APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type, APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl, avg(avgSalience) avgsalience, count(1) count  FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null group by entities.mid order by Count desc

The complete list of 23,899,566 distinct MIDs and the values above are available as a UTF8 CSV file:

Master List Of MIDs In GEG-GCNLAPI 2016-2021. (586MB compressed).

You can see the top 15 below:

Row	entity	mid	type	wikipediaurl	avgsalience	count
1	U.S.	/m/09c7w0	LOCATION	https://en.wikipedia.org/wiki/United_States	0.015321233087389351	50081461
2	Donald Trump	/m/0cqt90	PERSON	https://en.wikipedia.org/wiki/Donald_Trump	0.09679160921134368	19058565
3	UK	/m/07ssc	LOCATION	https://en.wikipedia.org/wiki/United_Kingdom	0.01195910443517211	15134317
4	China	/m/0d05w3	LOCATION	https://en.wikipedia.org/wiki/China	0.024024387600363576	13163499
5	Europe	/m/02j9z	LOCATION	https://en.wikipedia.org/wiki/Europe	0.006135497649549499	12682037
6	AP	/m/0cv_2	ORGANIZATION	https://en.wikipedia.org/wiki/Associated_Press	0.01276531524191298	11333807
7	COVID-19	/g/11j2cc_qll	OTHER	https://en.wikipedia.org/wiki/Coronavirus_disease_2019	0.01605697602725304	11185535
8	New York	/m/02_286	LOCATION	https://en.wikipedia.org/wiki/New_York_City	0.008451004515370585	11127326
9	Twitter	/m/0289n8t	OTHER	https://en.wikipedia.org/wiki/Twitter	0.006084569065768032	11120386
10	Republican	/m/07wbk	ORGANIZATION	https://en.wikipedia.org/wiki/Republican_Party_(United_States)	0.010449243385157157	9825139
11	Democratic	/m/0d075m	PERSON	https://en.wikipedia.org/wiki/Democratic_Party_(United_States)	0.011473229277320482	9656437
12	Facebook	/m/02y1vz	OTHER	https://en.wikipedia.org/wiki/Facebook	0.015362068051782654	9543894
13	California	/m/01n7q	LOCATION	https://en.wikipedia.org/wiki/California	0.010135325367891837	9522092
14	Russia	/m/06bnz	LOCATION	https://en.wikipedia.org/wiki/Russia	0.017300212593793374	9193481
15	India	/m/03rk0	LOCATION	https://en.wikipedia.org/wiki/India	0.02032965158632092	8321803

Look closely and you'll notice that 14 of the top 15 have MID codes beginning with "/m/" indicating their Freebase heritage. Look more closely, however, and you'll see that Covid-19's code begins with "/g/" indicating its provenance as a Google Knowledge Graph entity. What percentage of entities in the GEG come from Freebase versus Google's Knowledge Graph?

SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc

The output of the two queries above can be seen in the table below, which breaks down the total unique MID codes by type (MID codes that have appeared as multiple types are recorded under each they have appeared in). In all, 20,156,803 (84%) of the entities in the GEG are from the Google Knowledge Graph, showing how fast a knowledge graph must evolve to keep pace with an ever-changing world. In particular, this means that organizations cannot simply create a knowledge graph as a one-time endeavor and then apply it from there forward. Such static knowledge graphs will quickly age and miss new entities. Instead, knowledge graphs must be constantly updated.

Type	Has ID	MID	GID	%MID	%GID
OTHER	3,769,687	1,028,197	2,741,490	27.28	72.72
PERSON	12,617,449	2,148,550	10,468,899	17.03	82.97
LOCATION	5,650,830	1,054,504	4,596,326	18.66	81.34
ORGANIZATION	4,860,657	1,044,274	3,816,383	21.48	78.52
EVENT	687,358	194,575	492,783	28.31	71.69
WORK_OF_ART	1,702,192	577,265	1,124,927	33.91	66.09
CONSUMER_GOOD	701,026	252,263	448,763	35.98	64.02
UNKNOWN	66,459	50,900	15,559	76.59	23.41

At the same time, it is likely that some entities will be mentioned far more often than others. The queries below thus repeat the same analysis, but this time instead of counting unique MID codes, they count the total number of appearances of each entity. In other words, in the table above, "Donald Trump" would be counted as a single MID code under the PERSON category, despite his name appearing far more often in the news than most other names, while in the query below every mention of his name counts.

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc

This yields a markedly different table showing that the long tail of new entities not found in the Freebase IDs actually don't appear that often in the news. While 81% of LOCATION entities have Google Knowledge Graph codes, just 5% of all LOCATION entity mentions across the news seen by GDELT were among those Google Knowledge Graph entities. This intuitively makes sense, since as new entities emerge it will take time for them to accumulate mentions, whereas entities that existed in Freebase have been a part of the public conversation for longer.

Type	Has ID	MID	GID	%MID	%GID
OTHER	226,529,516	186,523,847	40,005,669	82.34	17.66
PERSON	654,485,433	509,726,906	144,758,527	77.88	22.12
LOCATION	973,468,295	921,706,340	51,761,955	94.68	5.32
ORGANIZATION	680,416,623	611,166,005	69,250,618	89.82	10.18
EVENT	40,927,168	37,069,242	3,857,926	90.57	9.43
WORK_OF_ART	74,243,538	57,144,383	17,099,155	76.97	23.03
CONSUMER_GOOD	29,991,226	25,107,021	4,884,205	83.71	16.29
UNKNOWN	693,961	651,035	42,926	93.81	6.19

How might we test this theory? One simple way would be to limit our analysis to only news coverage published since the start of this year, focusing on entities being discussed right now:

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/g/%' group by entity.type order by cnt desc

This yields the table below in which Google Knowledge Graph entities account for a much larger portion of entity mentions, though still not a majority, offering a reminder that a small number of long-standing entities tends to dominate news coverage. It suggests that static knowledge graphs will capture a large portion of entity mentions, but that graphs age quickly, missing more mentions each day. Most importantly, it means that static knowledge graphs will only encode the world as it was rather than as it is.

Type	Has ID	MID	GID	%MID	%GID
OTHER	25,035,360	17,385,741	7,649,619	69.44	30.56
PERSON	68,018,501	45,639,843	22,378,658	67.10	32.90
LOCATION	98,406,356	91,802,549	6,603,807	93.29	6.71
ORGANIZATION	67,661,532	57,172,304	10,489,228	84.50	15.50
EVENT	3,721,561	3,242,638	478,923	87.13	12.87
WORK_OF_ART	6,819,170	4,572,723	2,246,447	67.06	32.94
CONSUMER_GOOD	3,667,290	2,765,229	902,061	75.40	24.60

Hopefully this analysis inspires you with new ideas of how to use the Global Entity Graph!