Understanding False Positives In the New GDG

The new GDELT Global Difference Graph (GDG) announced this morning is an immensely powerful platform that we hope will profoundly reshape how we understand the fluidity on the online news sphere through which we understand the world around us.

This initial GDG release is an alpha grade system, meaning among other things it has an elevated false positive level in detecting textual changes in the body text of articles. One might ask why we don't simply use hand built templates constructed for each individual news outlet we monitor? The simple answer is that at the scale GDELT operates at, covering all countries and 65 languages, hand built templates simply aren't scalable. Every day new outlets are launching around the world and being added to GDELT's monitoring streams, while existing outlets are making myriad small changes to their structure. Instead, GDELT uses a fully automated document extraction infrastructure that does its best to estimate the start and end boundaries of article text and eliminate insets and other unrelated text.

Attempting to precisely identify the body text of articles published in a large fraction of online news outlets worldwide in every country and 65 languages is an immensely complex and difficult task. We've been building extraction systems for 23 years, since the early days of Mosaic, which has given us many unique insights into building global-scale robust and fully automated extraction platforms, but even with all of these lessons learned, it is still exquisitely difficult to build systems that can robustly handle the extreme diversity of the entirety of the online landscape.

At the same time, the GDG currently recrawls an average of 2 million articles per day, meaning that even if the smallest fraction of those have an imperfection in their extraction, that will still result in a measurable number of false positives. This is an important factor – at the sheer scales GDELT operates at, even the smallest error rate will result in a substantial number of actual error records. As we move towards beta we're working on a number of correction factors that we hope will minimize these false positives even further.

For high precision tasks you should manually verify reported changes and we are actively working with organizations like the Internet Archive to make it possible to readily view and compare multiple versions of a page seamlessly right in your browser to confirm the relevance of reported changes, so stay tuned!