Geographic FeatureID To Common Name Crosswalk

In support of its global geocoding efforts, GDELT maintains gazetteers containing several hundred million names for locations on earth across all of the languages it monitors. These range from myriad transliterations and common name variants to inflected and localized forms. When it records a geographic mention in the GKG or EVENTS databases, GDELT stores the actual reference as found in the text, meaning that 100 different articles using 100 different names for the same location will all contain different values in the "location name" field. This is why we strongly recommend using the FeatureID field (drawn from the GNS/GNIS featureid) for all geographic analysis, since it resolves to the same unique identifier regardless of the name used in the text.

However, in many cases you want to display a human-readable name in your application or cross-walk to other geographic datasets that use non-standard geographic IDs. To assist you, we've compiled the following crosswalk which lists all FeatureIDs found 25 or more times in 2015-2016 and the most common English transliterated reference to that location found across all monitored coverage. You can simply cross-walk the FeatureID from your GKG or EVENT export against this list to retrieve the most common name of the location for display.

 

TECHNICAL DETAILS

For those interested in how this crosswalk was compiled, here is the BigQuery SQL used to generate it:

select featureid, location, lat, long from (
select featureid, location, min(lat) lat, min(long) long, count(1) cnt, DENSE_RANK() OVER (PARTITION BY featureid ORDER BY cnt DESC) dense_rank from (
select REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#(.*?)#') as location, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#\d+') as featureid,
 REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#.*?#.*?#\d+') as lat, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#.*?#\d+') as long
from [gdelt-bq:gdeltv2.gkg] 
) group by featureid, location
) having dense_rank = 1 and cnt >= 25
order by location asc