Mapping The Media: A Geographic Lookup Of GDELT's Sources 2015-2021

Three years ago we showed how a single SQL query in BigQuery could process the entire GKG 2.0 and compile the top country most mentioned by each news outlet as an estimate of its likely country of origin. Given that news outlets tend to cover stories in their home country more often than stories elsewhere in the world, this simple approach yields reasonable estimations of the geographic landscape of GDELT's sources. Most importantly, it sidesteps the complexities of "news outlets in exile" that are an important independent source in some countries and regional news outlets, assigning each to the country they prioritize the most. Given that DNS registration records are increasingly masked around the world and the physical location of news outlet servers is increasingly centralized to out-of-country global cloud providers around the world, coupled with the fact that many outlets around the world do not publish clear location information, this estimation approach allows us to at least get a rough understanding of the geographic trends of global news.

When we last computed this lookup, we needed a heavily nested set of queries to process the data. Today, using BigQuery's Standard SQL, we can achieve the same result in just two nested queries, or three if we want flattened CSV output. In the query below, the innermost query compiles a list of all locations mentioned in each news outlet and collapses by the top-level domain of that outlet. The next level up query combines the country codes and counts into a STRUCT and uses the ARRAY_AGG operator to compile the top 5 for each domain. Finally, the outermost query just flattens that array so it can be output as a CSV file.

select domain, topcountry.countrycode countrycode, topcountry.cnt cnt from (
  select domain, ARRAY_AGG(STRUCT(countrycode AS countrycode, cnt as cnt) order by cnt desc limit 5) topcountries from (
    SELECT REGEXP_EXTRACT(location,r'^[1-5]#.*?#(.*?)#') countrycode, IFNULL(NET.REG_DOMAIN(DocumentIdentifier), 'unknown') as domain, count(1) cnt FROM `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(V2Locations,';')) location WHERE LOWER(DocumentIdentifier) like '%http%' group by countrycode, domain having cnt > 5 and countrycode is not null
  ) group by domain
), unnest(topcountries) topcountry order by domain asc, topcountry.cnt desc

You can download the final results below! Country codes are in FIPS 10-4 circa-2015 and may not reflect all modern country divisions.