The GDELT Project

GDELT 3.0: GEN 3 Crawlers & Document Extraction Infrastructure

We're incredibly excited to announce today that over the past 24 hours GDELT's entire fleet of front line article crawlers have been upgraded to a transitional version of the forthcoming GDELT 3.0 (GEN 3) crawling infrastructure. For the moment we've only upgraded the crawlers themselves, leaving the GEN 2 backend queuing, dispatching and horizon management systems in place as we complete the GEN 3 transition of GDELT's massive processing infrastructure. The result is that you will not see a dramatic change in the total daily article count until the complete GEN 3 upgrade is complete, but the new crawlers feature a parallel interconnect into a transitional version of the forthcoming GDELT GEN 3 crawling fabric, meaning we are able to begin rolling out a number of new capabilities over the coming weeks in the leadup to the complete GDELT 3.0 transition, so stay tuned!

The GEN 3 crawling infrastructure also includes a complete reimagination of our document extraction infrastructure with some incredible new features that will begin rolling out in a few weeks. You may notice a slight decrease in the total number of daily articles processed by GDELT due to the GEN 3 extraction infrastructure's vastly enhanced capabilities for reasoning about the content of a page and thus its ability to discard contentless pages but also an increase in content from certain parts of the world due to its unique new capabilities for recovering content even from heavily corrupted HTML.

Stay tuned as GDELT 3.0 finally rolls out over the coming weeks!