This past July we debuted the Global Relationship Graph (GRG), an experimental new initiative in codifying the factual claims and relationships made in the global press each day. Reading the world's news each day, what is the understanding of the world one would be expected to come away with? Most importantly, if we could codify news media with sufficient accuracy and resolution, could we autonomously identify contested narratives and tie news coverage more closely to related fact checks?
In July we released two small experimental datasets based on Google's Natural Language API, one using verb-centered ngrams and the other walking the dependency graphs of each sentence. Earlier this week we released a third small pilot dataset, created using Open IE 5.1 by the University of Washington and Indian Institute of Technology, Delhi.
Today we are enormously excited to announce our first live GRG pilot: realtime updated verb-centered ngrams, updated every minute from the previous minute's monitored coverage!
Each English language article is part-of-speech-tagged (unlike the earlier pilots, here we are using a classical HMM tagger for speed, so it may misclassify some edge cases). Then each verb phrase in an article is identified and up to 6 tokens before ("pre" snippet) and after it ("post" snippet) are compiled into a verb-centered clause. Clauses are truncated to shorter than 6 tokens if another verb or any form of punctuation appears first.
The result is a list of verb phrases and their immediate context for each article, creating essentially up to a 12-gram (though typically far less) centered on each verb phrase. Unlike traditional ngrams, these 12-grams are only generated around verbs, meaning they are not a sliding window but rather a fixed context around each verb phrase, capturing the statements of action related in the article. Each unique "pre-verb-post" clause found in articles published during a given minute becomes a record and contains a list of all of the articles published that minute that contained that clause. The goal of this dataset is to enhance the traditional ngram modality by making it more semantically meaningful in centering it only on verb phrases.
For example, here are a couple of Covid-related examples drawn from today:
Date | Pre | Verb | Post | URL | Title |
2020-10-28 13:06:00 UTC | because of coronavirus | have also affected | the free movement of goods | https://www.bbc.co.uk/news/world-africa-54058640 | World Trade Organization: How an African head could make a difference |
2020-10-28 13:06:00 UTC | of Covid-19 cases in the country | are just | short of the 80 lakh-mark | http://www.daijiworld.com/news/newsDisplay.aspx?newsID=765676 | Aarogya Setu's active users plummet due to pandemic fatigue: MyGov CEO |
2020-10-28 13:06:00 UTC | COVID-19 cases in the country | approached | 8 million | https://www.bostonglobe.com/2020/10/28/business/global-stocks-slide-amid-worries-over-virus-cases-us-aid/ | Global stocks slide amid worries over virus cases, US aid |
2020-10-28 13:06:00 UTC | green economic recovery from COVID-19 and | help | the country reach its net zero | https://www.energyglobal.com/solar/28102020/edf-renewables-to-develop-uk-solar-farm/ | EDF Renewables to develop UK solar farm |
Each unique verb-centered ngram within a given minute appears on its own row, with an array of URLs that contained that ngram. This way statements that appeared across multiple articles seen by GDELT in a given minute will be grouped together. This also makes it trivial to look across larger time horizons such as hours, days, and eventually weeks and months to identify claims that are widely covered and/or that appear over a long time period versus those that receive little coverage or only brief bursts of coverage. It also makes it possible to rapidly identify high-velocity claims that are quickly going viral.
The final UTF8 newline delimited JSON file format is as follows, with each row being a verb-centered ngram:
- date. The date GDELT saw the article, rounded to the nearest minute.
- pre. Up to 6 tokens immediately preceding the verb phrase. In practice this will typically be less than 6 words.
- verb. The verb phrase.
- post. Up to 6 tokens immediately following the verb phrase. In practice this will typically be less than 6 words.
- urls. A JSON array of all of the articles GDELT found within that given minute that contained this ngram.
- url. The URL of the article.
- title. The title of the article.
You can download the dataset directly as per-minute UTF8 JSON-NL files beginning with "20201027110300" as the earliest file:
- http://data.gdeltproject.org/gdeltv3/grg/YYYYMMDDHHMMSS.grg.vcn.json.gz
Remember that since GDELT currently operates on a 15 minute heartbeat, most articles are spread over a 4-5 minute period after each quarter-hour, which will be evening out as GDELT 3.0 launches.
The dataset is available as a BigQuery table:
Note that this is a pilot dataset, meaning we may actively change it based on feedback moving forward, adding/removing fields, changing its core semantics or switching to a different kind of relationship extraction. Please let us know how you use it so we can keep you updated and gather feedback as we evaluate potential changes.
We're tremendously excited to see what you're able to do with this incredible dataset!