GEG: Master Ranked List Of 23.9 Million Unique Entities From 200 Million English News Articles 2016-2021

Interest in structured knowledge graphs is surging as organizations search for more powerful ways of understanding the world. Each day GDELT annotates a small random sample of global online news coverage through Google's Cloud Natural Language API, recording the resulting entities in the Global Entity Graph (GEG). Today the GEG contains more than 21 billion entity annotations of 412 million distinct entity names spanning more than 200 million articles in 11 languages 2016-present and updated every minute.

The Cloud Natural Language API organizes each entity into one of 13 classifications based on its usage in the article. Which are the most and least common types?

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc

Using the simple SQL query above we get the table below, showing that people are the most common type of entity mentioned in the news, followed by locations, organizations, events, numbers and works of art (typically things like publications and other reports being announced in the news).

Row type cnt
1
OTHER
10080245264
2
PERSON
3460102811
3
LOCATION
2220884897
4
ORGANIZATION
1839565758
5
EVENT
1105138233
6
NUMBER
1045114567
7
WORK_OF_ART
618669861
8
CONSUMER_GOOD
409501218
9
DATE
191402153
10
PRICE
63708896
11
PHONE_NUMBER
5942363
12
ADDRESS
4918115
13
UNKNOWN
1394571

The most valuable kinds of entity mentions from the standpoint of machine understanding and automated reasoning are entities for which structured information is available. A reference to a person named "Joe Biden" in a news article is indistinguishable from any other person name without the additional information that he is the current president of the United States. Moreover, without additional information, the names "Joe Biden," "Joseph Biden" and "Joseph Robinette Biden Jr." are not connected to one another.

What is needed is some kind of unique universal identifier that identities key entities, linking their name variants under a common identifier and attaching it to a wealth of structured descriptors about that entity and its relationship to other entities, enabling complex reasoning.

Google's Natural Language API offers this exact capability, automatically returning a unique ID code (MID/GID) for entities it knows, along with the URL of that entity's Wikipedia page if it exists. What percent of those 412 million distinct entities in the GEG have assigned codes? (As discussed in a moment, these ID codes can be either MID or GID codes, but we'll use "MID" to refer to both types here for the purposes of simplicity.)

SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity 
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

Using the queries above we see that of the GEG's 412,418,817 distinct entities, 44,543,323 (10.8%) have assigned MID/GID codes.

Of course, some entities may be more common than others, so the queries below count the total number of entity mentions and what percent of those have MID codes:

SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity 
SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

Of the 21,046,588,707 total entity mentions in the GEG, 2,680,674,189 (12.7%) had MID codes, showing that they make up a minority of all entity mentions across the news each day, which is unsurprising given the range of different entity types in the GEG. What if we break this down by entity type?

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null group by entity.type order by cnt desc

The final breakdown is shown below, showing that 50% of UNKNOWN entities have MID codes, 44% of LOCATION mentions, 37% of ORGANIZATION mentions, 19% of PERSON mentions, 12% of WORK_OF_ART mentions, 7% of CONSUMER_GOOD mentions, 4% of EVENT mentions and 2% of OTHER mentions.

Type With MID Total % With MID
OTHER 226,529,516 10,080,245,264 2.25
PERSON 654,485,433 3,460,102,811 18.92
LOCATION 973,468,295 2,220,884,897 43.83
ORGANIZATION 680,416,623 1,839,565,758 36.99
EVENT 40,927,168 1,105,138,233 3.70
NUMBER 0 1,045,114,567 0.00
WORK_OF_ART 74,243,538 618,669,861 12.00
CONSUMER_GOOD 29,991,226 409,501,218 7.32
DATE 0 191,402,153 0.00
PRICE 0 63,708,896 0.00
PHONE_NUMBER 0 5,942,363 0.00
ADDRESS 0 4,918,115 0.00
UNKNOWN 693,961 1,394,571 49.76

Recall that the Natural Language API recognizes an array of name variants for known entities. What is the ratio of recognized names to unique entity identifiers?

SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null

The queries above show that the GEG contains 44,544,537 unique names resolving to 23,902,430 unique MID codes, capturing the considerable breadth of the API's ability to recognize name variants.

Of course, the most interesting question lies in what those specific recognized entities are. The query below compiles all 23.9 million unique MID codes and for each returns its most common textual name as it appeared in news coverage and its most common type and WikipediaURL, average document-level salience and total number of appearances. Note that an entity can have different types depending on its textual context, with "Hillary Clinton announced today" being a PERSON, while "Hillary Clinton's home is in" being a LOCATION and so on.

SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid, APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type, APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl, avg(avgSalience) avgsalience, count(1) count  FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null group by entities.mid order by Count desc

The complete list of 23,899,566 distinct MIDs and the values above are available as a UTF8 CSV file:

You can see the top 15 below:

Row entity mid type wikipediaurl avgsalience count
1
U.S.
/m/09c7w0
LOCATION
https://en.wikipedia.org/wiki/United_States
0.015321233087389351
50081461
2
Donald Trump
/m/0cqt90
PERSON
https://en.wikipedia.org/wiki/Donald_Trump
0.09679160921134368
19058565
3
UK
/m/07ssc
LOCATION
https://en.wikipedia.org/wiki/United_Kingdom
0.01195910443517211
15134317
4
China
/m/0d05w3
LOCATION
https://en.wikipedia.org/wiki/China
0.024024387600363576
13163499
5
Europe
/m/02j9z
LOCATION
https://en.wikipedia.org/wiki/Europe
0.006135497649549499
12682037
6
AP
/m/0cv_2
ORGANIZATION
https://en.wikipedia.org/wiki/Associated_Press
0.01276531524191298
11333807
7
COVID-19
/g/11j2cc_qll
OTHER
https://en.wikipedia.org/wiki/Coronavirus_disease_2019
0.01605697602725304
11185535
8
New York
/m/02_286
LOCATION
https://en.wikipedia.org/wiki/New_York_City
0.008451004515370585
11127326
9
Twitter
/m/0289n8t
OTHER
https://en.wikipedia.org/wiki/Twitter
0.006084569065768032
11120386
10
Republican
/m/07wbk
ORGANIZATION
https://en.wikipedia.org/wiki/Republican_Party_(United_States)
0.010449243385157157
9825139
11
Democratic
/m/0d075m
PERSON
https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
0.011473229277320482
9656437
12
Facebook
/m/02y1vz
OTHER
https://en.wikipedia.org/wiki/Facebook
0.015362068051782654
9543894
13
California
/m/01n7q
LOCATION
https://en.wikipedia.org/wiki/California
0.010135325367891837
9522092
14
Russia
/m/06bnz
LOCATION
https://en.wikipedia.org/wiki/Russia
0.017300212593793374
9193481
15
India
/m/03rk0
LOCATION
https://en.wikipedia.org/wiki/India
0.02032965158632092
8321803

Look closely and you'll notice that 14 of the top 15 have MID codes beginning with "/m/" indicating their Freebase heritage. Look more closely, however, and you'll see that Covid-19's code begins with "/g/" indicating its provenance as a Google Knowledge Graph entity. What percentage of entities in the GEG come from Freebase versus Google's Knowledge Graph?

SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc

The output of the two queries above can be seen in the table below, which breaks down the total unique MID codes by type (MID codes that have appeared as multiple types are recorded under each they have appeared in). In all, 20,156,803 (84%) of the entities in the GEG are from the Google Knowledge Graph, showing how fast a knowledge graph must evolve to keep pace with an ever-changing world. In particular, this means that organizations cannot simply create a knowledge graph as a one-time endeavor and then apply it from there forward. Such static knowledge graphs will quickly age and miss new entities. Instead, knowledge graphs must be constantly updated.

Type Has ID MID GID %MID %GID
OTHER 3,769,687 1,028,197 2,741,490 27.28 72.72
PERSON 12,617,449 2,148,550 10,468,899 17.03 82.97
LOCATION 5,650,830 1,054,504 4,596,326 18.66 81.34
ORGANIZATION 4,860,657 1,044,274 3,816,383 21.48 78.52
EVENT 687,358 194,575 492,783 28.31 71.69
WORK_OF_ART 1,702,192 577,265 1,124,927 33.91 66.09
CONSUMER_GOOD 701,026 252,263 448,763 35.98 64.02
UNKNOWN 66,459 50,900 15,559 76.59 23.41

At the same time, it is likely that some entities will be mentioned far more often than others. The queries below thus repeat the same analysis, but this time instead of counting unique MID codes, they count the total number of appearances of each entity. In other words, in the table above, "Donald Trump" would be counted as a single MID code under the PERSON category, despite his name appearing far more often in the news than most other names, while in the query below every mention of his name counts.

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc

This yields a markedly different table showing that the long tail of new entities not found in the Freebase IDs actually don't appear that often in the news. While 81% of LOCATION entities have Google Knowledge Graph codes, just 5% of all LOCATION entity mentions across the news seen by GDELT were among those Google Knowledge Graph entities. This intuitively makes sense, since as new entities emerge it will take time for them to accumulate mentions, whereas entities that existed in Freebase have been a part of the public conversation for longer.

Type Has ID MID GID %MID %GID
OTHER 226,529,516 186,523,847 40,005,669 82.34 17.66
PERSON 654,485,433 509,726,906 144,758,527 77.88 22.12
LOCATION 973,468,295 921,706,340 51,761,955 94.68 5.32
ORGANIZATION 680,416,623 611,166,005 69,250,618 89.82 10.18
EVENT 40,927,168 37,069,242 3,857,926 90.57 9.43
WORK_OF_ART 74,243,538 57,144,383 17,099,155 76.97 23.03
CONSUMER_GOOD 29,991,226 25,107,021 4,884,205 83.71 16.29
UNKNOWN 693,961 651,035 42,926 93.81 6.19

How might we test this theory? One simple way would be to limit our analysis to only news coverage published since the start of this year, focusing on entities being discussed right now:

SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/m/%' group by entity.type order by cnt desc
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/g/%' group by entity.type order by cnt desc

This yields the table below in which Google Knowledge Graph entities account for a much larger portion of entity mentions, though still not a majority, offering a reminder that a small number of long-standing entities tends to dominate news coverage. It suggests that static knowledge graphs will capture a large portion of entity mentions, but that graphs age quickly, missing more mentions each day. Most importantly, it means that static knowledge graphs will only encode the world as it was rather than as it is.

Type Has ID MID GID %MID %GID
OTHER 25,035,360 17,385,741 7,649,619 69.44 30.56
PERSON 68,018,501 45,639,843 22,378,658 67.10 32.90
LOCATION 98,406,356 91,802,549 6,603,807 93.29 6.71
ORGANIZATION 67,661,532 57,172,304 10,489,228 84.50 15.50
EVENT 3,721,561 3,242,638 478,923 87.13 12.87
WORK_OF_ART 6,819,170 4,572,723 2,246,447 67.06 32.94
CONSUMER_GOOD 3,667,290 2,765,229 902,061 75.40 24.60

Hopefully this analysis inspires you with new ideas of how to use the Global Entity Graph!