For those interested in exploring the global news narrative around climatic change, we are excited to announce today a massive dataset of 4.1 million URLs and page titles of worldwide online news coverage in 63 languages monitored by GDELT 2015-2020 for which GDELT's machine translation of that article into English contained the phrases "climate change" OR "global warming" OR "climate crisis" OR "greenhouse gas" OR "greenhouse gases" OR "carbon tax". Machine translation is far from perfect and the five year span covered by this dataset has seen enormous improvements in the quality of machine translation, so you will undoubtedly find numerous errors in this dataset. Any mention of the terms above is sufficient for the article to be included in this list – we have not performed further relevancy filtering given that a number of research questions revolve around the way in which climatic change is being increasingly casually referenced across the narrative landscape.
GDELT has long provided the list of all URLs it crawls each day to the Internet Archive for preservation and thus URLs in this list which no longer exist can often be found in the Archive's Wayback Machine.
The language breakdown is as follows:
Lang | Count |
afr | 360 |
ara | 207151 |
axe | 3573 |
ben | 9157 |
bos | 4433 |
bul | 27408 |
cat | 7261 |
ces | 33746 |
dan | 32877 |
deu | 526335 |
ell | 132440 |
est | 6858 |
fas | 2017 |
fin | 55386 |
fra | 551882 |
glg | 1961 |
guj | 632 |
heb | 4853 |
hin | 19820 |
hrv | 17059 |
hun | 48051 |
hye | 4273 |
ind | 35412 |
isl | 24932 |
ita | 197377 |
jpn | 9413 |
kan | 848 |
kat | 424 |
kaz | 381 |
kor | 56758 |
lav | 7501 |
lit | 18633 |
mal | 412 |
mar | 696 |
mkd | 14044 |
mon | 3264 |
msa | 2487 |
nep | 2817 |
nld | 167557 |
nno | 1 |
nor | 31927 |
pan | 417 |
pol | 30912 |
por | 123480 |
ron | 70365 |
rus | 157497 |
sin | 464 |
slk | 12888 |
slv | 8863 |
som | 140 |
spa | 943725 |
sqi | 7274 |
srp | 22051 |
swa | 683 |
swe | 101994 |
tam | 979 |
tel | 92 |
tha | 350 |
tur | 89113 |
ukr | 15232 |
urd | 2184 |
vie | 11545 |
zho | 293594 |
Each yearly file consists of four columns as seen in the header image, with the first being the date GDELT saw the article in YYYYMMDDHHMMSS format, the second being the language code, the third being the title of the page and the fourth being the URL.
Note that these files are encoded in UTF8, but some spreadsheet software, including Microsoft Excel can encounter problems loading multilingual files and/or may fail to display certain scripts. You may have to experiment with the best way of loading the files into your analysis software.