A New Dataset For Exploring The Global Multilingual Covid-19 Online News Narrative

UPDATE (4/26/2020): Now updating daily!

How has Covid-19 been covered around the world and how does that coverage compare to the baseline of viral outbreak coverage since November 2015? To explore these questions in more detail, we've compiled a dataset of the 13.46 million worldwide online news articles monitored by GDELT across the 65 languages machine translated by GDELT since November 2015 mentioning either "virus" or "Covid-19." Each row represents a distinct URL and includes the date GDELT first saw the article, its title, URL and the language code of the primary publication language of the article.

Given the rapidly evolving global nomenclature for Covid-19 and thus the difficulty in correctly translating it across GDELT's 65 live-translated languages, we chose to allow all matches for "virus" to be included, recognizing that this will include coverage of computer viruses and other non-medical coverage, but will maximize recall for Covid-19 coverage. While the majority of matches are from late December 2019 to present, we included matches back to November 2015 to provide a baseline for coverage of viral outbreaks, such as seasonal flu.

We're tremendously excited to see how researchers can use this data to understand how the virus has been covered globally.

There are two historical datasets, one covering 11/30/2015-3/29/2020 and a second covering 3/29/2020 – 4/22/2020. Daily files are used for days since, updated each morning, found in the filelist below as the files ending with "*.multilingualurls.csv.gz".