Using Web NGrams 3.0 & Custom Media Catalogs To Segment By Country, State Ownership, Partisanship Or Other Attributes

Many research questions that interest media scholars involve segmenting the media by various characteristics, from country of origin to ownership to partisanship. Beyond estimating the most likely country of origin for each news outlet, GDELT does not attempt to curate media outlets into various categories due to the inherent ambiguity and disagreements that tend to emerge from such categorizations. For example, sorting US news outlets into "conservative-leaning" versus "liberal-leaning" versus "neutral" categories will inevitably lead to sharp disagreements as to which outlets belong where.

For thematic, geographic and other searches that can be conducted with the GKG, it has long been trivial to simply join the GKG search results with an external domain list to segment the results. However, for keyword queries using the DOC 2.0 API, the process has historically been far more involved, requiring running a massive number of queries spaced one every ten seconds to get the complete result list before merging with the external outlet list.

The new Web News NGrams 3.0 dataset removes all of this complexity and makes outlet segmentation with external lists trivial, since your query returns the complete list of matching URLs that you can then trivially partition based on any external domain list.

For example, take the query below, searching Russian-language coverage from January 20, 2022 for Ukraine ("Украина"):

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' limit 100

This yields a table of results like the following:

URL
https://biz.liga.net/ekonomika/tek/article/tseny-na-azs-priblijayutsya-k-32-grnlitr-vinovaty-husity-rossiya-i-omikron
https://www.1tv.ru/news/2022-01-20/419780-utrata_doveriya_i_rekordno_nizkie_reytingi_itogi_pervogo_goda_prezidentstva_dzho_baydena
https://www.unian.net/politics/v-mide-rf-zayavili-chto-kremlyu-nuzhny-zheleznye-garantii-nevstupleniya-ukrainy-v-nato-novosti-ukraina-11677357.html
https://lenta.ru/news/2022/01/20/mcfaul/
https://echo.msk.ru/blog/umarov_a/2965482-echo/comments.html
https://eizvestia.com/politika/2022/01/20/ykraina-na-finalnom-etape-na-pyti-k-energeticheskomy-bezvizy-galyshenko/
https://regnum.ru/news/polit/3482823.html
https://ua-reporter.com/news/novosti-energonezavisimosti-ukrainy
https://www.dw.com/ru/blinken-v-berline-chem-otvetjat-ssha-i-es-esli-rf-napadet-na-ukrainu/a-60499955
https://ria.ru/20220120/amerika-1768636164.html
https://www.1tv.ru/news/2022-01-20/419740-dzho_bayden_peregovory_vashingtona_s_moskvoy_o_nerazmeschenii_amerikanskogo_strategicheskogo_oruzhiya_na_ukraine_vozmozhny
https://www.pravda.ru/politics/1676361-poklonskaya/
https://ukranews.com/news/828083-google-zapustil-novyj-dudl-pomogayushhij-iskat-blizhajshie-tsentry-vaktsinatsii
https://rus.azattyq.org/a/31662577.html
http://www.infotag.md/politics-m9/296615/
http://vlasti.net/news/339393
https://racurs.ua/n165938-zelenskiy-obratilsya-k-ukraincam-iz-za-ugrozy-polnomasshtabnoy-voyny-s-rossiey.html

The outlets on this list come from a range of news outlets. What if we want to just look at outlets whose domains end in ".ru"? While this will miss Russian outlets that end in ".com" or other suffixes, it gives us a simple, albeit flawed filter:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.HOST(url) like '%.ru' limit 100

As expected, this has narrowed our results to news outlets ending in the Russian country code:

URL
https://regnum.ru/news/polit/3482141.html
https://riafan.ru/1591148-v-dnr-nazvali-postavki-oruzhiya-ukraine-povysheniem-stavok-v-eskalacii-konflikta-v-donbasse
https://ria.ru/20220120/poroshenko-1768731346.html
https://regnum.ru/news/3481854.html
https://www.mk.ru/politics/2022/01/21/vashington-razreshil-stranam-baltii-peredat-oruzhie-ukraine.html
https://newtimes.ru/articles/detail/207638/
https://www.interfax.ru/world/816536
https://ko.ru/articles/fondovyy-rynok-ne-verit-v-voynu-s-ukrainoy/
https://lenta.ru/news/2022/01/20/bayraktar/
https://aif.ru/society/bla_bayraktar_vms_ukrainy_provodit_patrulirovanie_chernogo_morya_-_smi
https://www.pravda.ru/news/world/1676294-soratnik_zelenskogo_obvinil/
https://ria.ru/20220120/zhirinovskiy-1768739981.html
https://www.pravda.ru/news/world/1676523-erdogan/
https://www.tatar-inform.ru/news/v-kremle-ocenili-ideyu-erdogana-organizovat-vstrecu-putina-i-zelenskogo-5851370
https://regnum.ru/news/3482141.html
https://sputnik-georgia.ru/20220120/263754090.html
https://ria.ru/20220120/bidenthreats-1768745395.html
https://rueconomics.ru/561386-zerkalnyi-otvet-na-novye-antirossiiskie-sankcii-ssha-udarit-po-amerikanskim-biznesmenam
https://www.vedomosti.ru/finance/news/2022/01/20/905751-tsb-zayavil-ob-otsutstvii-planov-zapreta
https://sobesednik.ru/politika/20220120-putin-vsyu-zizn-budet-sozalet-ob-etom

But what if we want to narrow our search specifically to news outlets owned, supported or connected to the Russian government? For the sake of a simple example, let's use TASS, RIA, RT and SputnikNews. In a real application, you would simply use a spreadsheet of all of the outlets of interest, but for the sake of a simple example code, we'll use just those four outlets:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.REG_DOMAIN(url) in
(select 'tass.ru' UNION ALL select 'ria.ru' UNION ALL select 'rt.com' UNION ALL select 'sputniknews.com')
limit 100

In just over a second we have results like the following:

URL
http://special.tass.ru/ekonomika/13483631
https://tass.ru/mezhdunarodnaya-panorama/13484335
https://ria.ru/20220120/bidenthreats-1768745395.html
https://russian.rt.com/inotv/2022-01-20/RND-perevorachivat-realnost-s-nog
http://special.tass.ru/mezhdunarodnaya-panorama/13484335
https://ria.ru/20220120/poroshenko-1768731346.html
https://russian.rt.com/ussr/news/951472-snbo-danilov-rossia
http://special.tass.ru/politika/13475917
http://special.tass.ru/mezhdunarodnaya-panorama/13483523
https://tass.ru/info/13477685
https://ria.ru/20220120/bayden-1768808071.html
https://russian.rt.com/world/article/950700-ukraina-rossiya-britaniya-provokaciya
http://special.tass.ru/ekonomika/13485385
http://special.tass.ru/mezhdunarodnaya-panorama/13485441
https://ria.ru/20220120/gaz-1768824162.html
https://ria.ru/20220120/arestovich-1768725459.html
https://tass.ru/ekonomika/13485209
https://ria.ru/20220120/zhirinovskiy-1768739981.html
https://tass.ru/ekonomika/13483631
https://tass.ru/politika/13476813

Following this example, you can use any external list of domains to segment your Web NGrams 3.0 results along any dimension imaginable. For media, you could filter by political partisanship, party affiliation, state or private ownership, state or city location, audience size, target audience or any other factor of methodological interest. While such analyses have long been possible with GDELT's DOC 2.0 API, it required considerable work, whereas the Web NGrams 3.0 dataset now makes such outlet segmentation trivial!