We are tremendously excited to announce today the debut of the GDELT Global Quotation Graph, which compiles quoted statements in worldwide news coverage. It scans each article monitored by GDELT and compiles a list of all quoted statements within, along with sufficient context to allow users in many cases to establish speaker identity. Quotations offer a powerful signal for understanding global events and narratives, from proclamations by government leaders to official statements by public health officials to the statements of ordinary citizens witnessing breaking news or commenting on the impacts of current events on their lives.
Unlike most of GDELT's other primary datasets, the Global Quotation Graph covers all 152 languages monitored by GDELT, though news in some languages may not regularly incorporate quotations or may utilize constructs the underlying models do not currently recognize. This inaugural release uses an internationalized quotation model, but may not work for all languages and we would love your feedback as we continue to expand and improve it!
This inaugural release covers all quotations in worldwide online news coverage monitored by GDELT since January 1, 2020 and thus covers the majority of the global spread of the Covid-19 pandemic.
Given the enormous linguistic complexity of attempting to ascribe a speaker to each quote (especially given complex formulations like "AAA was upset with BBB for saying…" versus "BBB spurred the anger of AAA for saying…", each quoted statement is accompanied by a brief snippet of text before and after the quotation that frequently contains sufficient information that you can usually determine the speaker. In cases, especially longer articles, speakers may not be readily determinable from the pre/post context alone and attributing a given quotation to a specific speaker may require further analysis.
The dataset updates every minute and is in UTF8 JSONNL format with each row representing an article containing one or more quotations:
- date. The date and time the article was seen by GDELT.
- url. The full URL of the article.
- title. The title of the article.
- lang. The human-readable name of the language the article is primarily written in. Articles with multiple languages will be listed under the primary language. Language detection is performed by CLD2 and will have a certain level of error.
- quotes. An array containing one or more quotations identified in the article.
- pre. A brief snippet of text preceding the quotation up to 100 characters to assist with speaker identification. This ranges in size based on the article language and linguistic queues and will typically be shorter than 100 characters.
- quote. The actual quoted statement itself.
- post. A brief snippet of text following the quotation up to 100 characters to assist with speaker identification. This ranges in size based on the article language and linguistic queues and will typically be shorter than 100 characters.
We're tremendously excited to see what you're able to do with this incredible new dataset!
The dataset is available as a BigQuery table:
Or as per-minute UTF8 JSONNL files in the format:
- http://data.gdeltproject.org/gdeltv3/gqg/YYYYMMDDHHMMSS.gqg.json.gz
Thus, the file for January 1, 2020 at 00:01:00 UTC would be http://data.gdeltproject.org/gdeltv3/gqg/20200101000100.gqg.json.gz.
All times are in UTC and files are generated typically about 2-5 minutes after the current minute. Thus, each minute a cron script could be used to compute the date of five minutes ago and fetch that file. The dataset is currently generated on a 15 minute heartbeat in which every 15 minutes a new set of articles are queued and thus you will typically see several minutes of data followed by several minutes of no available files. This is the expected behavior, but ingest pipelines should assume that over time there will be a file available each minute.