A New Multilingual Dataset For Exploring Climate Change Narratives: 4.1 Million News URLs In 63 Languages 2015-2020

For those interested in exploring the global news narrative around climatic change, we are excited to announce today a massive dataset of 4.1 million URLs and page titles of worldwide online news coverage in 63 languages monitored by GDELT 2015-2020 for which GDELT's machine translation of that article into English contained the phrases "climate change" OR "global warming" OR "climate crisis" OR "greenhouse gas" OR "greenhouse gases" OR "carbon tax". Machine translation is far from perfect and the five year span covered by this dataset has seen enormous improvements in the quality of machine translation, so you will undoubtedly find numerous errors in this dataset. Any mention of the terms above is sufficient for the article to be included in this list – we have not performed further relevancy filtering given that a number of research questions revolve around the way in which climatic change is being increasingly casually referenced across the narrative landscape.

GDELT has long provided the list of all URLs it crawls each day to the Internet Archive for preservation and thus URLs in this list which no longer exist can often be found in the Archive's Wayback Machine.

The language breakdown is as follows:

Lang Count
afr 360
ara 207151
axe 3573
ben 9157
bos 4433
bul 27408
cat 7261
ces 33746
dan 32877
deu 526335
ell 132440
est 6858
fas 2017
fin 55386
fra 551882
glg 1961
guj 632
heb 4853
hin 19820
hrv 17059
hun 48051
hye 4273
ind 35412
isl 24932
ita 197377
jpn 9413
kan 848
kat 424
kaz 381
kor 56758
lav 7501
lit 18633
mal 412
mar 696
mkd 14044
mon 3264
msa 2487
nep 2817
nld 167557
nno 1
nor 31927
pan 417
pol 30912
por 123480
ron 70365
rus 157497
sin 464
slk 12888
slv 8863
som 140
spa 943725
sqi 7274
srp 22051
swa 683
swe 101994
tam 979
tel 92
tha 350
tur 89113
ukr 15232
urd 2184
vie 11545
zho 293594

Each yearly file consists of four columns as seen in the header image, with the first being the date GDELT saw the article in YYYYMMDDHHMMSS format, the second being the language code, the third being the title of the page and the fourth being the URL.

Note that these files are encoded in UTF8, but some spreadsheet software, including Microsoft Excel can encounter problems loading multilingual files and/or may fail to display certain scripts. You may have to experiment with the best way of loading the files into your analysis software.