Geographic FeatureID To Common Name Crosswalk May 2018 Update

While we tend to think of geographic locations as having just one name, in reality they can have dozens or even hundreds of alternative names, name variants and name transliterations. When geocoding a news article, GDELT records the location name as it appeared in the article, which is why we recommend using the FeatureID field to aggregate on the geographic fields to ensure correctness. In many analyses you want the ability to display a textual name to the user for each location, so in 2016 we released a crosswalk for all locations that appear more than 25 times in our dataset that maps the FeatureID to the most common textual name used to reference that feature across all mentions of it.

Today we've updated that crosswalk, examining more than three quarters of a billion articles spanning the last three years via the GDELT 2.0 GKG to determine the most common name used over that time period to refer to each location that appeared at least 25 times. To use this dataset, simply perform your analysis as usual and then crosswalk the FeatureIDs against this table to look up the proper human names to display for each.

As a reminder, here is the BigQuery SQL code used to generate the table:

select featureid, location, lat, long from (
select featureid, REGEXP_REPLACE(location, r'^-', '') location, min(lat) lat, min(long) long, count(1) cnt, DENSE_RANK() OVER (PARTITION BY featureid ORDER BY cnt DESC) dense_rank from (
select REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#(.*?)#') as location, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#\d+') as featureid,
 REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#.*?#.*?#\d+') as lat, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5].*#(.*?)#.*?#\d+') as long
from [gdelt-bq:gdeltv2.gkg]
) group by featureid, location
) having dense_rank = 1 and cnt >= 25 and featureid is not null
order by location asc

As a quick helper routine, if you're looking to see what the FeatureID of a given location is, this short BigQuery SQL query will return the full GKG location record of the locations containing a given word/phrase in their transliterated English name from the first 100 matching articles (this query takes 1.9s and consumes 171MB of BigQuery quota):

select loc from (
SELECT SPLIT(V2Locations,';') loc FROM [gdelt-bq:gdeltv2.gkg_partitioned] WHERE _PARTITIONTIME >= "2018-05-13 00:00:00" AND _PARTITIONTIME < "2018-05-14 00:00:00" 
) having loc like '%West Bank%' limit 100

 

DOWNLOAD THE CROSSWALK (CSV)