In 2015 we created a map of the geography of the world's media as seen through a year of GDELT's eyes, yielding a clickable map where you could see the top 5 outlets publishing the greatest volume of content about each city on earth.
Today we've updated that analysis, enhancing it from a single absolute volume count to a far more sophisticated self-normalizing query that actually acts as a template from which you can perform incredibly rich geographic analyses on the world's media outlets. The BigQuery SQL query below may look a bit complex, but essentially what it does is process through the entire 2015-present GKG 2.0 dataset, extract all landmark/city/state-level geographic mentions (anything more specific than a country reference), compute a ranking of the top most commonly mentioned locations in each outlet normalized by all locations mentioned in that outlet (a high-volume newswire outlet may have published more articles over the past year mentioning a given city than that city's local press, but as a percent of all locations mentioned by that newswire, the city will actually rank fairly low), and then generates a final output table.
In all, the query below processes more than 6.2 billion mentions of location across more than 756 million global news articles from 65 languages spanning three years of global events totaling 343GB of compact geographic data and performs a complex set of aggregations and normalizations to yield a final result in just 33 seconds.
select location, latitude, longitude, featureid, domain, cnt, ratio, ratio_sum, rank from ( select location, latitude, longitude, featureid, domain, cnt, ratio, SUM(ratio) OVER(PARTITION BY domain ORDER BY ratio) ratio_sum, RANK() OVER(PARTITION BY domain ORDER BY ratio desc) rank FROM ( select location, latitude, longitude, featureid, domain, cnt, RATIO_TO_REPORT(cnt) OVER(PARTITION BY domain) ratio from ( select max(location) location, max(latitude) latitude, max(longitude) longitude, featureid, domain, count(1) cnt from ( SELECT REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#(.*?)#') as location, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#(.*?)#') as latitude, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#(.*?)#') as longitude, REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[2-5]#.*?#.*?#.*?#.*?#.*?#.*?#(.*?)#') as featureid, IFNULL(DOMAIN(DocumentIdentifier), 'broadcast_source') as domain FROM [gdeltv2.gkg] WHERE DocumentIdentifier contains 'http' having featureid is not null ) group by domain, featureid having cnt > 50 ) ) ) where rank <= 20 ORDER BY domain, ratio DESC
The output of the query above was then filtered against the Alexa Top 1 Million ranking dataset to mix geographic affinity with visitor popularity. The total set of all domains returned by the query above for each location was sorted by the formula "cnt*score" and the top 20 domains selected for each location. These 20 domains were then sorted by their Alexa ranks (any with ranks above 75,000 were excluded) and the top 10 retained for each location. This combined filtering ensured that the pool of domains considered for each location prioritized a synthesis of total volume of coverage against what percent of the outlet's total coverage is centered on that location. This final pool is then reranked by visitor popularity via their Alexa ranks. Finally, we rewrite the location names to normalize them according to our new May 2018 Update FeatureID-Name Lookup.
The BigQuery query above outputs a number of statistics that can be used to perform all sorts of other kinds of rankings of domains and locations, opening up all sorts of additional possibilities. Or, you can jump right to it and download the results of the query saved to the CSV file below, or the final location-domain rankings in the tab-delimited TXT file below.
We're excited to see what you can do with this new data!