This past July we debuted the Global Relationship Graph (GRG), an experimental new initiative in codifying the factual claims and relationships made in the global press each day. Reading the world's news each day, what is the understanding of the world one would be expected to come away with? Most importantly, if we could codify news media with sufficient accuracy and resolution, could we autonomously identify contested narratives and tie news coverage more closely to related fact checks?
In July we released two small experimental datasets based on Google's Natural Language API, one using verb-centered ngrams and the other walking the dependency graphs of each sentence. Today we are excited to announce a third pilot dataset, created using Open IE 5.1 by the University of Washington and Indian Institute of Technology, Delhi.
Open IE attempts to codify the relations expressed in a given text, converting a sentence into a series of claims and relationships inferred or explicitly stated in it. To explore how it might be used to understand global news data, we processed a small random sample of around 3,000 English language online news articles from October 4, 2020, selecting a Sunday to ensure a mixture of breaking stories and retrospective news coverage on a wide range of topics. All sentences in each article were processed, yielding a total of 201,663 relationships. The precompiled jar and model files were downloaded directly from the Open IE 5.1 GitHub page.
We've reprocessed the Open IE 5.1 output in the following ways to make it simpler to work with and to collapse claims shared across articles:
- The values for Arg1, Arg2 (Arg2s was renamed as Arg2), Rel and Context were truncated to a maximum of 8 words each. Arg1 was truncated from right to left since it is the end of that argument that is most important, while the other fields were truncated from left to right.
- All English-language online articles monitored by GDELT in a given minute were compiled and a random sample of 10 selected. Since GDELT currently operates on a 15 minute heartbeat, in reality all articles monitored in a given 15 minutes are spread over the following 3 minutes of processing.
- Rather than the article-centric output of Open IE, the output here is relation-centric to make it easier to trace shared claims. Within a given minute of monitored articles, all unique relations are presented as a single row, with a list of the articles referencing that relation listed in an array at the end.
The final UTF8 newline delimited JSON file format is as follows, with each row being a unique claim found in that given minute of monitoring:
- date. The date GDELT saw the article, rounded to the nearest minute.
- lang. The language of the article. For this collection all rows will have the value of "ENGLISH".
- arg1. This value is passed as-is from Open IE and is left-truncated.
- arg2. This value is passed as-is from Open IE and is right-truncated.
- rel. This value is passed as-is from Open IE and is right-truncated.
- context. This value is passed as-is from Open IE and is right-truncated.
- confidence. This value is passed as-is from Open IE and is a floating point number.
- negated. This value is passed as-is from Open IE, converted from true/false to 1/0.
- passive. This value is passed as-is from Open IE, converted from true/false to 1/0.
- urls. A JSON array of all of the articles GDELT found within that given minute that contained this relation.
- url. The URL of the article.
- title. The title of the article.
You can download the entire dataset below:
- GRG Open IE 5.1 Pilot Dataset. (8.3MB compressed / 85MB uncompressed)
Remember that these results are 100% automated and represent machine codification of news coverage without any human intervention. As with all globally sourced news coverage, claims therein may be true, false, contested or unknown – the goal of this dataset is merely to codify the news as it stands to make such deeper analysis possible.
We're tremendously excited to see what you're able to do this with this new experimental dataset!