Using Web NGrams 3.0 & Custom Media Catalogs To Segment By Country, State Ownership, Partisanship Or Other Attributes

Many research questions that interest media scholars involve segmenting the media by various characteristics, from country of origin to ownership to partisanship. Beyond estimating the most likely country of origin for each news outlet, GDELT does not attempt to curate media outlets into various categories due to the inherent ambiguity and disagreements that tend to emerge from such categorizations. For example, sorting US news outlets into "conservative-leaning" versus "liberal-leaning" versus "neutral" categories will inevitably lead to sharp disagreements as to which outlets belong where.

For thematic, geographic and other searches that can be conducted with the GKG, it has long been trivial to simply join the GKG search results with an external domain list to segment the results. However, for keyword queries using the DOC 2.0 API, the process has historically been far more involved, requiring running a massive number of queries spaced one every ten seconds to get the complete result list before merging with the external outlet list.

The new Web News NGrams 3.0 dataset removes all of this complexity and makes outlet segmentation with external lists trivial, since your query returns the complete list of matching URLs that you can then trivially partition based on any external domain list.

For example, take the query below, searching Russian-language coverage from January 20, 2022 for Ukraine ("Украина"):

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' limit 100

This yields a table of results like the following:

URL

https://biz.liga.net/ekonomika/tek/article/tseny-na-azs-priblijayutsya-k-32-grnlitr-vinovaty-husity-rossiya-i-omikron

https://www.1tv.ru/news/2022-01-20/419780-utrata_doveriya_i_rekordno_nizkie_reytingi_itogi_pervogo_goda_prezidentstva_dzho_baydena

https://www.unian.net/politics/v-mide-rf-zayavili-chto-kremlyu-nuzhny-zheleznye-garantii-nevstupleniya-ukrainy-v-nato-novosti-ukraina-11677357.html

https://lenta.ru/news/2022/01/20/mcfaul/

https://echo.msk.ru/blog/umarov_a/2965482-echo/comments.html

https://eizvestia.com/politika/2022/01/20/ykraina-na-finalnom-etape-na-pyti-k-energeticheskomy-bezvizy-galyshenko/

https://regnum.ru/news/polit/3482823.html

https://ua-reporter.com/news/novosti-energonezavisimosti-ukrainy

https://www.dw.com/ru/blinken-v-berline-chem-otvetjat-ssha-i-es-esli-rf-napadet-na-ukrainu/a-60499955

https://ria.ru/20220120/amerika-1768636164.html

https://www.1tv.ru/news/2022-01-20/419740-dzho_bayden_peregovory_vashingtona_s_moskvoy_o_nerazmeschenii_amerikanskogo_strategicheskogo_oruzhiya_na_ukraine_vozmozhny

https://www.pravda.ru/politics/1676361-poklonskaya/

https://ukranews.com/news/828083-google-zapustil-novyj-dudl-pomogayushhij-iskat-blizhajshie-tsentry-vaktsinatsii

https://rus.azattyq.org/a/31662577.html

http://www.infotag.md/politics-m9/296615/

http://vlasti.net/news/339393

https://racurs.ua/n165938-zelenskiy-obratilsya-k-ukraincam-iz-za-ugrozy-polnomasshtabnoy-voyny-s-rossiey.html

The outlets on this list come from a range of news outlets. What if we want to just look at outlets whose domains end in ".ru"? While this will miss Russian outlets that end in ".com" or other suffixes, it gives us a simple, albeit flawed filter:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.HOST(url) like '%.ru' limit 100

As expected, this has narrowed our results to news outlets ending in the Russian country code:

URL

https://regnum.ru/news/polit/3482141.html

https://riafan.ru/1591148-v-dnr-nazvali-postavki-oruzhiya-ukraine-povysheniem-stavok-v-eskalacii-konflikta-v-donbasse

https://ria.ru/20220120/poroshenko-1768731346.html

https://regnum.ru/news/3481854.html

https://www.mk.ru/politics/2022/01/21/vashington-razreshil-stranam-baltii-peredat-oruzhie-ukraine.html

https://newtimes.ru/articles/detail/207638/

https://www.interfax.ru/world/816536

https://ko.ru/articles/fondovyy-rynok-ne-verit-v-voynu-s-ukrainoy/

https://lenta.ru/news/2022/01/20/bayraktar/

https://aif.ru/society/bla_bayraktar_vms_ukrainy_provodit_patrulirovanie_chernogo_morya_-_smi

https://www.pravda.ru/news/world/1676294-soratnik_zelenskogo_obvinil/

https://ria.ru/20220120/zhirinovskiy-1768739981.html

https://www.pravda.ru/news/world/1676523-erdogan/

https://www.tatar-inform.ru/news/v-kremle-ocenili-ideyu-erdogana-organizovat-vstrecu-putina-i-zelenskogo-5851370

https://regnum.ru/news/3482141.html

https://sputnik-georgia.ru/20220120/263754090.html

https://ria.ru/20220120/bidenthreats-1768745395.html

https://rueconomics.ru/561386-zerkalnyi-otvet-na-novye-antirossiiskie-sankcii-ssha-udarit-po-amerikanskim-biznesmenam

https://www.vedomosti.ru/finance/news/2022/01/20/905751-tsb-zayavil-ob-otsutstvii-planov-zapreta

https://sobesednik.ru/politika/20220120-putin-vsyu-zizn-budet-sozalet-ob-etom

But what if we want to narrow our search specifically to news outlets owned, supported or connected to the Russian government? For the sake of a simple example, let's use TASS, RIA, RT and SputnikNews. In a real application, you would simply use a spreadsheet of all of the outlets of interest, but for the sake of a simple example code, we'll use just those four outlets:

SELECT distinct url FROM `gdelt-bq.gdeltv2.webngrams` WHERE DATE(date) = "2022-01-20" and lang='ru' and ngram='Украина' and NET.REG_DOMAIN(url) in
(select 'tass.ru' UNION ALL select 'ria.ru' UNION ALL select 'rt.com' UNION ALL select 'sputniknews.com')
limit 100

In just over a second we have results like the following:

URL

http://special.tass.ru/ekonomika/13483631

https://tass.ru/mezhdunarodnaya-panorama/13484335

https://ria.ru/20220120/bidenthreats-1768745395.html

https://russian.rt.com/inotv/2022-01-20/RND-perevorachivat-realnost-s-nog

http://special.tass.ru/mezhdunarodnaya-panorama/13484335

https://ria.ru/20220120/poroshenko-1768731346.html

https://russian.rt.com/ussr/news/951472-snbo-danilov-rossia

http://special.tass.ru/politika/13475917

http://special.tass.ru/mezhdunarodnaya-panorama/13483523

https://tass.ru/info/13477685

https://ria.ru/20220120/bayden-1768808071.html

https://russian.rt.com/world/article/950700-ukraina-rossiya-britaniya-provokaciya

http://special.tass.ru/ekonomika/13485385

http://special.tass.ru/mezhdunarodnaya-panorama/13485441

https://ria.ru/20220120/gaz-1768824162.html

https://ria.ru/20220120/arestovich-1768725459.html

https://tass.ru/ekonomika/13485209

https://ria.ru/20220120/zhirinovskiy-1768739981.html

https://tass.ru/ekonomika/13483631

https://tass.ru/politika/13476813

Following this example, you can use any external list of domains to segment your Web NGrams 3.0 results along any dimension imaginable. For media, you could filter by political partisanship, party affiliation, state or private ownership, state or city location, audience size, target audience or any other factor of methodological interest. While such analyses have long been possible with GDELT's DOC 2.0 API, it required considerable work, whereas the Web NGrams 3.0 dataset now makes such outlet segmentation trivial!

The GDELT Project

Using Web NGrams 3.0 & Custom Media Catalogs To Segment By Country, State Ownership, Partisanship Or Other Attributes

Archives