The GDELT Project

Mapping The Media: A Geographic Lookup Of GDELT's Sources

The Datasets Of GDELT

Given GDELT's incredibly unique ability to peer across the world's languages courtesy of its mass machine translation infrastructure, a common application is to compare coverage of major events in different countries. Doing so requires the ability to place news outlets geographically in space.

At first glance, placing a news outlet on a map might seem simple – just place it where its headquarters or place of legal incorporation is. Thus, the New York Times would be placed in New York City, CNN in Atlanta and so on. The problem is that this approach breaks down quickly.

Take the simple example of the Los Angeles Times. If using "whois" domain registration information, historically one would find its website was registered to the Tribune Company in Chicago, Illinois. Similarly, the website for Nigerian news outlet Vanguard News was long registered to a residential address in the town of Bridgnorth, Shropshire, England, population 12,000, home to a legal entity called Africa Business Information Service.

What about just using the location of an outlet's physical offices, instead of its domain information? This too quickly runs into complications. African news service allAfrica has offices in Cape Town, Dakar, Abuja, Monrovia, Nairobi, and Washington, DC, yet their domain is registered to their DC office. In fact, looking globally, smaller news outlets in particular in many parts of the world are often registered to legal entities in other countries. Making matters even more difficult, an ever-increasing number of domains use so-called "domain privacy" shield companies that mask all of their information in whois databases, making it impossible to easily research their ownership and location. Examining incorporation and other legal document sources don't necessarily offer any clarity beyond domain registrations, since domains are frequently registered to the same foreign legal entities or offices.

Alternatively, one approach that has been used is to perform "IP geolocation" on the news outlets' servers to determine where its website is physically hosted. However, even before the era of consolidated cloud hosting, few news outlets ran their own server clusters in a machine room in their building – they outsourced to local hosting companies, which in turn typically outsourced to regional or US-based data centers. Today, of course, news outlet websites are even further consolidated.

TLD country information, such as ".uk" in a domain is useful, but the overwhelming majority of news outlets unfortunately don't carry such information. Moreover, some domains like ".co" have been largely taken over to indicate things unrelated to geographic location.

Instead, historically, in creating its English and Multilingual geographic news outlet directories, GDELT relied on the strong geographic bias inherent in journalism – the simple fact that news outlets cover events physically proximate to them far more often than they do events on the other side of the world.

With just a single line of SQL in Google's BigQuery, we can process through more than three quarters of a billion global news articles spanning the past three years to county up every mention of a location anywhere on earth and create a country-level histogram for each outlet that shows what countries it spends the majority of its time reporting about. In just 16 seconds this query processes more than 344GB of geographic data to return the top two countries mentioned most frequently by each online news outlet monitored by GDELT. We then filter the results to exclude outlets where the top two countries are mentioned at nearly similar rates (suggesting the outlet doesn't have a single "primary" country of focus). We also exclude domains that have too few results to generate a meaningful geographic estimate.

 select countrycode, domain, cnt, ratio, ratio_sum, rank from (
  select countrycode, domain, cnt, ratio, SUM(ratio) OVER(PARTITION BY domain ORDER BY ratio) ratio_sum, RANK() OVER(PARTITION BY domain ORDER BY ratio desc) rank FROM (
   select countrycode, domain, cnt, RATIO_TO_REPORT(cnt) OVER(PARTITION BY domain) ratio from (
    select domain, countrycode, count(1) cnt from (
     SELECT REGEXP_EXTRACT(SPLIT(V2Locations,';'),r'^[1-5]#.*?#(.*?)#') countrycode, IFNULL(DOMAIN(DocumentIdentifier), 'broadcast_source') as domain
     FROM [gdeltv2.gkg] WHERE DocumentIdentifier contains 'http' having countrycode is not null
    ) group by domain, countrycode having cnt > 50
   )
  )
 ) where rank <= 3 ORDER BY domain, ratio DESC

The results of the query above are truly remarkable not only for how accurate they are, but also what they show us through the clarity of three quarters of a billion articles: that it is a simple universal truth that news outlets across the entire world are immensely geographically biased. This will come as no surprise to anyone in the journalism world, yet reminds us just how little we hear about the world outside our immediate surroundings when we turn to the news and that the "filter bubbles" of social media are merely natural extensions of the filter bubbles we have always lived in through the news media.

What's so interesting about these results is that we had originally expected to have to heavily correct them through domain and corporate registration records. Indeed, in previous work over the past 23 years we've made heavy use of such records for georeferencing websites, only to lament the complications and nuances of their geographic information. Over the past several years we've researched the locations of more than 60,000 outlets using a vast array of official record sources that we initially intended to use as the primary basis of this dataset, relying on geographic affinity only for outlets we didn't have sufficient geographic data for. Yet, as we poured over these official records we found all of the problems outlined above and in the end, the dataset, while technically correct, did not match the reality of how the outlets functioned. It may be the case that for a period of time Vanguard News was based legally in the UK, but readers and researchers will likely consider it to be a Nigerian news outlet, not a UK source. It may be that allAfrica has offices in multiple cities and a domain registered to its DC offices, but Nigeria receives more attention than any other country on its pages. In some parts of the world where minority groups are heavily oppressed and/or suffer violence at the hands of the government, news outlets representing their communities are run from abroad, often in Europe, lending systemic bias to the data that severely impacts its usability. This also extends to self-reported contact mailing addresses reported on their websites. Overall, as we poured over the registrations data, we found that the landscape of where news outlets are incorporated or register their website domains to, does not accurately reflect the reality of where its reporters are or how it is considered.

In the end, after extensive research and manual analysis of the various datasets, we eventually chose an affinity-first approach in which we assign outlets first to their TLD-specified country (excluding cases like ".co"), then to their GKG-suggested primary country of focus, then finally to the country where they are incorporated or the entity that controls their domain is registered (excluding domain privacy or management services).

The resulting dataset isn't perfect and you will still spot outliers, like "who.int", which in 2014 focused primary on Guinea and from 2015-2018 has emphasized Nigeria more than any other country in its news reporting monitored by GDELT. Yet, overall the dataset offers a reasonable and highly usable centralized dataset by country of media monitored by GDELT.

It is important to remember that news outlets can change domains over time, with their old domains being purchased and reused for other purposes. For example, 00221.info is listed as a Senegalese news outlet in this dataset, yet visiting the site today reveals it to be a furniture website. It turns out that it used to be the home of an online news outlet called Tak3 Senegal, but somewhere around 2017 the outlet moved to a new home. GDELT's global scale and 40 year timespan means it takes on the role of a web archive in many ways, capturing the ever-evolving fluidity of the web, so when you spot a domain that looks a bit off, its worth doing some research to see if the domain was formerly the home of something else that focused on that geography.

Note that not all of the outlets you see here are currently in the GDELT 2.0 GKG. The current 2.0 GKG dataset only extends back to 2015 and in preparation for the full 1979-2015 GDELT 2.0 historical backfile being released later this summer, we've gone ahead and included all of those domains in this dataset. Also note that this dataset will not include domains that did not have a country-specific TLD, did not have a primary geographic focus and for which we were unable to identify a location through registration or other official documentation.

We're tremendously excited to release this new dataset and all of the insights it offers, both in making it easier to filter coverage based on the geographic location of its source outlet, but also for the insights it offers into the geographic bias and affinity that guides the world's journalism. This dataset stands testament to the power of mapping the textual geography of the news and the insights possible through analyzing billions of location mentions across three quarters of a billion articles and the power of BigQuery to make sense of it all in just 16 seconds.

Let us know if you spot any errors or have other domains to add to this list! Happy Mapping!

Download The GDELT Geographic Source Lookup