The GDELT Project

GKG Person Name Histogram 2015-2019

Among the myriad metadata fields computed by the Global Knowledge Graph (GKG) is a list of all person names found in each article. To date, the GKG records 2,458,844,208  mentions of 77,641,688 distinct person names. This list of names is unnormalized and represents the raw name as-is, typographical errors, transliteration and spelling variants and all, that was found in the article.

With the release of the new Global Entity Graph (GEG) there has been considerable interest in constructing normalization tables of the GKG's person name list. To assist with such efforts, we've released a master histogram of all unique person names identified in the one billion global news articles processed by the GKG 2015-2019.

All names were algorithmically extracted using language models, meaning you will undoubtedly find plenty of errors in this list, comprised of a mixture of algorithmic error and errors in the original articles.

The master name list is available for download below as a UTF8 CSV file with the first column being the lowercased name as extracted from the English machine translation of the article and the second being the number of articles in which the name was found:

TECHNICAL DETAILS

Constructing the list above required just a single line of SQL in BigQuery:

SELECT Name, count(1) Count FROM `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(Persons,';')) AS name WHERE length(Persons) > 3 group by Name

The sharded CSV GCS export files were then concatenated together using "tail +2" to remove the header row from each shard.