Interest in structured knowledge graphs is surging as organizations search for more powerful ways of understanding the world. Each day GDELT annotates a small random sample of global online news coverage through Google's Cloud Natural Language API, recording the resulting entities in the Global Entity Graph (GEG). Today the GEG contains more than 21 billion entity annotations of 412 million distinct entity names spanning more than 200 million articles in 11 languages 2016-present and updated every minute.
The Cloud Natural Language API organizes each entity into one of 13 classifications based on its usage in the article. Which are the most and least common types?
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc
Using the simple SQL query above we get the table below, showing that people are the most common type of entity mentioned in the news, followed by locations, organizations, events, numbers and works of art (typically things like publications and other reports being announced in the news).
Row | type | cnt | |
---|---|---|---|
1 |
OTHER
|
10080245264
|
|
2 |
PERSON
|
3460102811
|
|
3 |
LOCATION
|
2220884897
|
|
4 |
ORGANIZATION
|
1839565758
|
|
5 |
EVENT
|
1105138233
|
|
6 |
NUMBER
|
1045114567
|
|
7 |
WORK_OF_ART
|
618669861
|
|
8 |
CONSUMER_GOOD
|
409501218
|
|
9 |
DATE
|
191402153
|
|
10 |
PRICE
|
63708896
|
|
11 |
PHONE_NUMBER
|
5942363
|
|
12 |
ADDRESS
|
4918115
|
|
13 |
UNKNOWN
|
1394571
|
The most valuable kinds of entity mentions from the standpoint of machine understanding and automated reasoning are entities for which structured information is available. A reference to a person named "Joe Biden" in a news article is indistinguishable from any other person name without the additional information that he is the current president of the United States. Moreover, without additional information, the names "Joe Biden," "Joseph Biden" and "Joseph Robinette Biden Jr." are not connected to one another.
What is needed is some kind of unique universal identifier that identities key entities, linking their name variants under a common identifier and attaching it to a wealth of structured descriptors about that entity and its relationship to other entities, enabling complex reasoning.
Google's Natural Language API offers this exact capability, automatically returning a unique ID code (MID/GID) for entities it knows, along with the URL of that entity's Wikipedia page if it exists. What percent of those 412 million distinct entities in the GEG have assigned codes? (As discussed in a moment, these ID codes can be either MID or GID codes, but we'll use "MID" to refer to both types here for the purposes of simplicity.)
SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null
Using the queries above we see that of the GEG's 412,418,817 distinct entities, 44,543,323 (10.8%) have assigned MID/GID codes.
Of course, some entities may be more common than others, so the queries below count the total number of entity mentions and what percent of those have MID codes:
SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity SELECT count(1) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null
Of the 21,046,588,707 total entity mentions in the GEG, 2,680,674,189 (12.7%) had MID codes, showing that they make up a minority of all entity mentions across the news each day, which is unsurprising given the range of different entity types in the GEG. What if we break this down by entity type?
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity group by type order by cnt desc SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null group by entity.type order by cnt desc
The final breakdown is shown below, showing that 50% of UNKNOWN entities have MID codes, 44% of LOCATION mentions, 37% of ORGANIZATION mentions, 19% of PERSON mentions, 12% of WORK_OF_ART mentions, 7% of CONSUMER_GOOD mentions, 4% of EVENT mentions and 2% of OTHER mentions.
Type | With MID | Total | % With MID |
OTHER | 226,529,516 | 10,080,245,264 | 2.25 |
PERSON | 654,485,433 | 3,460,102,811 | 18.92 |
LOCATION | 973,468,295 | 2,220,884,897 | 43.83 |
ORGANIZATION | 680,416,623 | 1,839,565,758 | 36.99 |
EVENT | 40,927,168 | 1,105,138,233 | 3.70 |
NUMBER | 0 | 1,045,114,567 | 0.00 |
WORK_OF_ART | 74,243,538 | 618,669,861 | 12.00 |
CONSUMER_GOOD | 29,991,226 | 409,501,218 | 7.32 |
DATE | 0 | 191,402,153 | 0.00 |
PRICE | 0 | 63,708,896 | 0.00 |
PHONE_NUMBER | 0 | 5,942,363 | 0.00 |
ADDRESS | 0 | 4,918,115 | 0.00 |
UNKNOWN | 693,961 | 1,394,571 | 49.76 |
Recall that the Natural Language API recognizes an array of name variants for known entities. What is the ratio of recognized names to unique entity identifiers?
SELECT count(distinct(entity.mid)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null SELECT count(distinct(entity.name)) FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid is not null
The queries above show that the GEG contains 44,544,537 unique names resolving to 23,902,430 unique MID codes, capturing the considerable breadth of the API's ability to recognize name variants.
Of course, the most interesting question lies in what those specific recognized entities are. The query below compiles all 23.9 million unique MID codes and for each returns its most common textual name as it appeared in news coverage and its most common type and WikipediaURL, average document-level salience and total number of appearances. Note that an entity can have different types depending on its textual context, with "Hillary Clinton announced today" being a PERSON, while "Hillary Clinton's home is in" being a LOCATION and so on.
SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, entities.mid mid, APPROX_TOP_COUNT(entities.type, 1)[OFFSET(0)].value type, APPROX_TOP_COUNT(entities.wikipediaUrl, 1)[OFFSET(0)].value wikipediaurl, avg(avgSalience) avgsalience, count(1) count FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities where entities.mid is not null group by entities.mid order by Count desc
The complete list of 23,899,566 distinct MIDs and the values above are available as a UTF8 CSV file:
- Master List Of MIDs In GEG-GCNLAPI 2016-2021. (586MB compressed).
You can see the top 15 below:
Row | entity | mid | type | wikipediaurl | avgsalience | count | |
---|---|---|---|---|---|---|---|
1 |
U.S.
|
/m/09c7w0
|
LOCATION
|
https://en.wikipedia.org/wiki/United_States
|
0.015321233087389351
|
50081461
|
|
2 |
Donald Trump
|
/m/0cqt90
|
PERSON
|
https://en.wikipedia.org/wiki/Donald_Trump
|
0.09679160921134368
|
19058565
|
|
3 |
UK
|
/m/07ssc
|
LOCATION
|
https://en.wikipedia.org/wiki/United_Kingdom
|
0.01195910443517211
|
15134317
|
|
4 |
China
|
/m/0d05w3
|
LOCATION
|
https://en.wikipedia.org/wiki/China
|
0.024024387600363576
|
13163499
|
|
5 |
Europe
|
/m/02j9z
|
LOCATION
|
https://en.wikipedia.org/wiki/Europe
|
0.006135497649549499
|
12682037
|
|
6 |
AP
|
/m/0cv_2
|
ORGANIZATION
|
https://en.wikipedia.org/wiki/Associated_Press
|
0.01276531524191298
|
11333807
|
|
7 |
COVID-19
|
/g/11j2cc_qll
|
OTHER
|
https://en.wikipedia.org/wiki/Coronavirus_disease_2019
|
0.01605697602725304
|
11185535
|
|
8 |
New York
|
/m/02_286
|
LOCATION
|
https://en.wikipedia.org/wiki/New_York_City
|
0.008451004515370585
|
11127326
|
|
9 |
Twitter
|
/m/0289n8t
|
OTHER
|
https://en.wikipedia.org/wiki/Twitter
|
0.006084569065768032
|
11120386
|
|
10 |
Republican
|
/m/07wbk
|
ORGANIZATION
|
https://en.wikipedia.org/wiki/Republican_Party_(United_States)
|
0.010449243385157157
|
9825139
|
|
11 |
Democratic
|
/m/0d075m
|
PERSON
|
https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
|
0.011473229277320482
|
9656437
|
|
12 |
Facebook
|
/m/02y1vz
|
OTHER
|
https://en.wikipedia.org/wiki/Facebook
|
0.015362068051782654
|
9543894
|
|
13 |
California
|
/m/01n7q
|
LOCATION
|
https://en.wikipedia.org/wiki/California
|
0.010135325367891837
|
9522092
|
|
14 |
Russia
|
/m/06bnz
|
LOCATION
|
https://en.wikipedia.org/wiki/Russia
|
0.017300212593793374
|
9193481
|
|
15 |
India
|
/m/03rk0
|
LOCATION
|
https://en.wikipedia.org/wiki/India
|
0.02032965158632092
|
8321803 |
Look closely and you'll notice that 14 of the top 15 have MID codes beginning with "/m/" indicating their Freebase heritage. Look more closely, however, and you'll see that Covid-19's code begins with "/g/" indicating its provenance as a Google Knowledge Graph entity. What percentage of entities in the GEG come from Freebase versus Google's Knowledge Graph?
SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc SELECT entity.type, count(distinct(entity.mid)) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc
The output of the two queries above can be seen in the table below, which breaks down the total unique MID codes by type (MID codes that have appeared as multiple types are recorded under each they have appeared in). In all, 20,156,803 (84%) of the entities in the GEG are from the Google Knowledge Graph, showing how fast a knowledge graph must evolve to keep pace with an ever-changing world. In particular, this means that organizations cannot simply create a knowledge graph as a one-time endeavor and then apply it from there forward. Such static knowledge graphs will quickly age and miss new entities. Instead, knowledge graphs must be constantly updated.
Type | Has ID | MID | GID | %MID | %GID |
OTHER | 3,769,687 | 1,028,197 | 2,741,490 | 27.28 | 72.72 |
PERSON | 12,617,449 | 2,148,550 | 10,468,899 | 17.03 | 82.97 |
LOCATION | 5,650,830 | 1,054,504 | 4,596,326 | 18.66 | 81.34 |
ORGANIZATION | 4,860,657 | 1,044,274 | 3,816,383 | 21.48 | 78.52 |
EVENT | 687,358 | 194,575 | 492,783 | 28.31 | 71.69 |
WORK_OF_ART | 1,702,192 | 577,265 | 1,124,927 | 33.91 | 66.09 |
CONSUMER_GOOD | 701,026 | 252,263 | 448,763 | 35.98 | 64.02 |
UNKNOWN | 66,459 | 50,900 | 15,559 | 76.59 | 23.41 |
At the same time, it is likely that some entities will be mentioned far more often than others. The queries below thus repeat the same analysis, but this time instead of counting unique MID codes, they count the total number of appearances of each entity. In other words, in the table above, "Donald Trump" would be counted as a single MID code under the PERSON category, despite his name appearing far more often in the news than most other names, while in the query below every mention of his name counts.
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/m/%' group by entity.type order by cnt desc SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where entity.mid like '/g/%' group by entity.type order by cnt desc
This yields a markedly different table showing that the long tail of new entities not found in the Freebase IDs actually don't appear that often in the news. While 81% of LOCATION entities have Google Knowledge Graph codes, just 5% of all LOCATION entity mentions across the news seen by GDELT were among those Google Knowledge Graph entities. This intuitively makes sense, since as new entities emerge it will take time for them to accumulate mentions, whereas entities that existed in Freebase have been a part of the public conversation for longer.
Type | Has ID | MID | GID | %MID | %GID |
OTHER | 226,529,516 | 186,523,847 | 40,005,669 | 82.34 | 17.66 |
PERSON | 654,485,433 | 509,726,906 | 144,758,527 | 77.88 | 22.12 |
LOCATION | 973,468,295 | 921,706,340 | 51,761,955 | 94.68 | 5.32 |
ORGANIZATION | 680,416,623 | 611,166,005 | 69,250,618 | 89.82 | 10.18 |
EVENT | 40,927,168 | 37,069,242 | 3,857,926 | 90.57 | 9.43 |
WORK_OF_ART | 74,243,538 | 57,144,383 | 17,099,155 | 76.97 | 23.03 |
CONSUMER_GOOD | 29,991,226 | 25,107,021 | 4,884,205 | 83.71 | 16.29 |
UNKNOWN | 693,961 | 651,035 | 42,926 | 93.81 | 6.19 |
How might we test this theory? One simple way would be to limit our analysis to only news coverage published since the start of this year, focusing on entities being discussed right now:
SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/m/%' group by entity.type order by cnt desc SELECT entity.type, count(1) cnt FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entity where DATE(date) >= "2021-01-01" and entity.mid like '/g/%' group by entity.type order by cnt desc
This yields the table below in which Google Knowledge Graph entities account for a much larger portion of entity mentions, though still not a majority, offering a reminder that a small number of long-standing entities tends to dominate news coverage. It suggests that static knowledge graphs will capture a large portion of entity mentions, but that graphs age quickly, missing more mentions each day. Most importantly, it means that static knowledge graphs will only encode the world as it was rather than as it is.
Type | Has ID | MID | GID | %MID | %GID |
OTHER | 25,035,360 | 17,385,741 | 7,649,619 | 69.44 | 30.56 |
PERSON | 68,018,501 | 45,639,843 | 22,378,658 | 67.10 | 32.90 |
LOCATION | 98,406,356 | 91,802,549 | 6,603,807 | 93.29 | 6.71 |
ORGANIZATION | 67,661,532 | 57,172,304 | 10,489,228 | 84.50 | 15.50 |
EVENT | 3,721,561 | 3,242,638 | 478,923 | 87.13 | 12.87 |
WORK_OF_ART | 6,819,170 | 4,572,723 | 2,246,447 | 67.06 | 32.94 |
CONSUMER_GOOD | 3,667,290 | 2,765,229 | 902,061 | 75.40 | 24.60 |
Hopefully this analysis inspires you with new ideas of how to use the Global Entity Graph!