Announcing A Massive New Geographic News Database Of The Locations Mentioned In Covid-19 News Coverage

Last month we created a map of the top locations on earth most commonly mentioned in worldwide online news coverage of Covid-19 using GDELT's GEO 2.0 API. This is NOT an infection map, but rather a map of the locations mentioned most commonly in Covid-19 news coverage, from mentions of quarantine rules to governments announcing preparedness plans to discussions of hypothetical scenarios to actual mentions of cases all mixed together. In short, it is a map of the locations the news media is mentioning in its coverage of Covid-19, providing a holistic look at how societies are responding to, internalizing, contextualizing, understanding and being affected by the pandemic, including dangerous behaviors like violating quarantine orders and spreading misinformation.

The GEO API used to create the map above only searches a rolling window of the last 7 days, so while it reflects the most recent locations associated with a topic, it doesn't support longitudinal assessments. It also only displays the top most commonly associated locations, rather than an exhaustive list of all matching locations.

To enable far more powerful kinds of cartographic explorations of the news landscape associated with Covid-19, we've created a special extract from a portion of the dataset that powers the GEO 2.0 API, consisting of all geographic locations worldwide mentioned in English news coverage (the GEO 2.0 API searches all languages, but for now we're starting this extract with English coverage only) within 600 characters of "social distan*", "quarantin*", "lockdown*", "stay at home", "shelter in place", "self isolat*", "*virus*", "Covid-19" and "Sars-Cov-19".

In short, we take a news article, identify and disambiguate all of the geographic locations it mentions (converting a mention of "Paris" to the actual latitude and longitude of the city in France) and compile a 600-character snippet surrounding the location mention that can be used to understand the context of that mention. For all of the location mentions whose contextual snippets contained one of the keywords above, we added it to this dataset, which updates daily.

The final dataset consists of a list of location mentions representing each relevant location within each English language article monitored by GDELT since the start of this year and for each location mention, including the date/time it was seen, its URL, title, sharing image, language code, document-level sentiment, the country code GDELT estimated the article to be published in, the human-friendly location name, the centroid latitude/longitude, country code, adm1 and adm2 of the location, the geographic resolution of the match (1=Country, 2=State, 3=US City, 4=International City) and a lower-cased version of the 600 character contextual window with all punctuation removed.

Please remember that this is NOT an infection map, it is a map of locations mentioned in Covid-19 news coverage of all kinds, which covers an incredible range of topics. Please also remember that this dataset is 100% machine generated and that there will undoubtedly be errors in the machine's identification and disambiguation of geographic locations.

The new table is available in BigQuery:

Or as gzipped JSONNL files:

We're enormously excited to see what you're able to do with this incredibly powerful new cartographic dataset!