Using Web NGrams 3.0 & Custom Media Catalogs To Segment By Country, State Ownership, Partisanship Or Other Attributes

Many research questions that interest media scholars involve segmenting the media by various characteristics, from country of origin to ownership to partisanship. Beyond estimating the most likely country of origin for each news outlet, GDELT does not attempt to curate media outlets into various categories due to the inherent ambiguity and disagreements that tend to emerge from such categorizations. For example, sorting US news outlets into "conservative-leaning" versus "liberal-leaning" versus "neutral" categories will inevitably lead to sharp disagreements as to which outlets belong where.

For thematic, geographic and other searches that can be conducted with the GKG, it has long been trivial to simply join the GKG search results with an external domain list to segment the results. However, for keyword queries using the DOC 2.0 API, the process has historically been far more involved, requiring running a massive number of queries spaced one every ten seconds to get the complete result list before merging with the external outlet list.

The new Web News NGrams 3.0 dataset removes all of this complexity and makes outlet segmentation with external lists trivial, since your query returns the complete list of matching URLs that you can then trivially partition based on any external domain list.

For example, take the query below, searching Russian-language coverage from January 20, 2022 for Ukraine ("Украина"):

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' limit 100

This yields a table of results like the following:


The outlets on this list come from a range of news outlets. What if we want to just look at outlets whose domains end in ".ru"? While this will miss Russian outlets that end in ".com" or other suffixes, it gives us a simple, albeit flawed filter:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.HOST(url) like '' limit 100

As expected, this has narrowed our results to news outlets ending in the Russian country code:


But what if we want to narrow our search specifically to news outlets owned, supported or connected to the Russian government? For the sake of a simple example, let's use TASS, RIA, RT and SputnikNews. In a real application, you would simply use a spreadsheet of all of the outlets of interest, but for the sake of a simple example code, we'll use just those four outlets:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.REG_DOMAIN(url) in
(select '' UNION ALL select '' UNION ALL select '' UNION ALL select '')
limit 100

In just over a second we have results like the following:


Following this example, you can use any external list of domains to segment your Web NGrams 3.0 results along any dimension imaginable. For media, you could filter by political partisanship, party affiliation, state or private ownership, state or city location, audience size, target audience or any other factor of methodological interest. While such analyses have long been possible with GDELT's DOC 2.0 API, it required considerable work, whereas the Web NGrams 3.0 dataset now makes such outlet segmentation trivial!