Announcing New Source-Country Crossreferencing Dataset

As GDELT is increasingly used to deeply explore the geography of the global media landscape,  we've been hearing from many of you that you are looking for a crossreferencing dataset that maps news outlets to their country of origin.  In other words, given an article from a news outlet, you want to be able to determine what country that news outlet is based in so that you can compare the differing portrayals of an event or leader across countries.  Few news outlets use top-level domains (such as ".uk" or ".au") (most simply end in ".com"), meaning a website's domain name is rarely indicative of  its country of origin.  For example, the domain name "dawn.com" does not readily suggest that it is one of the largest English-language newspapers in Pakistan, while "jpost.com" would not immediately indicate an Israeli newspaper.

Today we announce the release of precisely this dataset: the top 13,155 English-language websites that GDELT has monitored relevant information from, along with the estimated country of origin of that outlet as both a FIPS10-4 country code and human-readable label.  This can be used to readily filter both GDELT Event and Global Knowledge Graph records to subset the GDELT for only those records recorded from news outlets originating in a specific country.  Keep in mind that no monitoring system can monitor the entirety of the media output of a given country, and thus when filtering to a specific country, keep in mind that the absence of an event from the media of that country does not necessarily indicate that no outlet in that country covered that event (only that GDELT did not monitor any such coverage), though the presence of an event form the media of that country is often highly instructive, especially distinctions in the reported details and the emotional and thematic undercurrents of that coverage.

This initial dataset focuses exclusively on English-language web-based news outlets.  To estimate the likely country of origin for each outlet, a very simplistic methodology is used, based on traditional journalistic practice: news outlets tend to have elevated coverage of events in the immediate geographic proximity of the outlet compared with locations elsewhere in the world.  In other words, the New York Times is likely to have a higher number of stories about local events in New York City than it is about those in Moscow, while The Dawn is more likely to emphasize local events in Karachi over those in Rio de Janeiro.  Using this assumption, the entire GDELT Global Knowledge Graph back to April 2013 was processed to compile a histogram of all locations found in coverage monitored by GDELT from each English-language web-based news outlet.  Outlets which had too few monitored articles or where there was not a substantial separation between the volume of coverage allocated to the top several most-mentioned countries, were excluded.

While imperfect, this approach allows generalized geolocation of outlets regardless of other available information sources.  Many news outlets today are physically hosted in a common set of data centers around the world, meaning that "IP geolocation" (ascertaining the physical location of the IP address of the web server hosting the news outlet) is no longer useful in geolocating the news outlet itself.  For example, for a number of years the Islamic Republic News Agency (IRNA – the official news agency of the Islamic Republic of Iran) was hosted on a set of web servers physically located in the US State of Virginia, in the same data center that also housed CNN's website.  Historically, DNS records ("whois" lookups) were a useful source of location information, but increasingly domain registrars encourage domain owners to use "privacy" or "masking" services that display the registrar's location information rather than that of the domain owner.  While there are commercial services like Alexa that produce locative information for websites, they do not make this data freely available at scale, and the coverage of many of these services tends to be poor for smaller and non-Western countries.  Instead, by simply using the metadata GDELT already computes about each outlet, this algorithm is not only able to rapidly assign an estimated location to each news outlet, but can also automatically geolocate new websites as it encounters them over time.

You will note some odd results in the data, such as "who.int" being assigned to Guinea.  WHO's assignment is due to the fact that it has focused a large volume of its news coverage thus far this year on the Ebola outbreak in Guinea.  "Aljazeera.com" tends to contextualize much of their English-language reporting in its impact and relation to the United States, yielding a US country assignment, while "Allafrica.com" appears to feature elevated coverage of Nigeria compared with other Africa states, causing it to be identified as a Nigerian news outlet.  Finally, news outlets in smaller countries and dependencies, such as "Thisisjersey.com" in the Bailiwick of Jersey (also known as the Island of Jersey) tend to contextualize local events by providing input and connections back to larger neighboring countries (in this case causing the Jersey Evening Post's website to be assigned to the United Kingdom).  However, other than these special cases, the majority of outlets tend to be assigned to their correct country of origin.

Please let us know what you think of this dataset and if you use it in a particularly interesting application that you'd like to share.  Please also forward to us any corrections to the dataset regarding news outlets assigned to the wrong country, or if there are English-language outlets that are missing that you think should be added.  (Note that this list does NOT currently contain foreign-language outlets other than those that also offer English-language editions).

Happy mapping!

DOWNLOAD THE DATASET NOW!