How GDELT 3.0's Crawler Architecture Uses Realtime Global Network Monitoring To Learn And Adapt Its Global URL Routing

Historically, GDELT's crawlers have addressed transient errors simply by retrying a URL after a short pause or redirecting to a nearby peer crawler. However, a rise in non-transient errors like the HTTP 451 geofencing of EU member countries by a growing number of smaller US news outlets and increasingly interventionist national governments across the world talking more openly about geographic content restrictions has prompted a rethinking of this approach.

Rather than treating all crawlers as direct peers and redirecting a failed URL to another crawler within a zone, region or metaregion in which the geographic location of the crawler or its IP range is ignored, we need a new routing system that allows these factors to be considered. In other words, under our previous crawler architectures, all crawlers worldwide were considered candidate equals for redistributing a failed URL. If we received a transient or unexpected error from a server even after retrying it, we would redistribute to any available crawler anywhere worldwide, including within the same geographic region or IP range.

Using the Global Difference Graph as a testbed, we've been experimenting with modifications to the GDELT 3.0 crawler architecture that allow the URL routing fabric to intelligently redirect and even proactively divert URLs by learning over time about each outlet. A set of learning and discovery systems now monitor the live auditing control log flows from our worldwide crawler fleet, examining each failed request and classifying the circumstances around the failure and developing both a shortterm mitigation for that specific URL and a long-term mitigation for that site.

Transient technical errors are redistributed based on the network layout of our crawler fleets at that moment, moving a URL to a crawler outside of that entire IP range. Repeated transient technical errors for a given site within a particular IP range, zone, region or metaregion will result in that site being placed on an internal routing exclusion list to divert all future requests to that site to crawlers outside of those ranges and locations. Over time URLs from a given site are systematically randomized by country, region and IP range to determine if there are consistent factors in such errors, allowing us to better characterize nonobvious targeting behaviors or technical faults.

Similarly, non-transient errors like consistent 451 errors to a given country or geographic region will result in our internal routing tables being updated to route future requests to those sites exclusively to crawlers outside of that geographic area. In this way, as new sites implement geofencing or technical interventions in the future, our routing tables will automatically update themselves in realtime without any human action required.

Preliminary tests of a prototype of this architecture are showing considerable promise and we hope to share more on these approaches shortly!