UPDATE (1/21/2020): The GEG neural sentiment scores are now integrated into the core GEG dataset instead of a separate dataset.
The thousands of emotional dimensions that GDELT computes today for each article it monitors through GCAM are all produced through classical presence/absence and value-based lexicons. The rise of deep learning-powered sentiment analysis over the past few years raises the question of how a state-of-the-art deep learning sentiment algorithm would "see" the world's news. Today we are excited to unveil the new GDELT Global Entity Graph (GEG) Neural Sentiment Dataset, a dataset of article-level sentiment scores for more than 103 million worldwide English-language online news articles published July 17, 2016 through October 4, 2019, annotated through Google's Cloud Natural Language API.
Each article contains three sentiment values: polarity, magnitude and score. Polarity was an earlier field that has been depreciated and score is a field that was added in late 2016 and is not present in the earliest records in this dataset. See the Natural Language API documentation for more detail on interpreting these values.
Given the small data size of each record, the entire dataset is available as a single UTF8 newline-delimited JSON file 5.7GB compressed / 21GB uncompressed. Each row represents a single article, with the date/time the article was seen (rounded to the nearest 15 minutes), the URL of the article and the three values above. The JSON structure of each row is as follows:
- date. The timestamp the article was seen, rounded to the nearest 15 minutes.
- url. The URL of the article.
- lang. The Google-provided language code of the article used by the API, allowing filtering by language.
- polarity. See the Natural Language API documentation. (This value has been depreciated by the API).
- magnitude. See the Natural Language API documentation.
- score. See the Natural Language API documentation. (This value was added later and is not present in the first few months of the dataset).
You can download the full dataset from:
- MASTER.20160717-20191004.geg-gcnlapi-sentiment.json.gz (5.7GB compressed / 21GB uncompressed)
The full dataset is also available in BigQuery:
We're excited to see what you can do with this dataset!