GEN4: Advanced Crawling Telemetry

In his opening keynote of the 2012 IIPC General Assembly at the Library of Congress, Kalev emphasized the importance of archiving not just content, but also the telemetry surrounding how and why that content was recorded. To this day, however, many web archives simply record content snapshots and discard the majority of the surrounding networking information describing how, where and why that snapshot occurred.

Where did the URL to be archived first enter the queuing system and what was its eligibility criteria? What was the specific crawler that captured the snapshot and its operational details, including its physical location and IP address at the time of capture? Knowing the geographic location of the crawler that captured a given page is absolutely critical in an increasingly geofenced web. Knowing the physical colocation of crawlers can help diagnose intermittent hardware failures at the compute, storage and networking layers, while IP assignments can help identify poisoned IPs and their digital blast radius. Capturing even richer details like DNS resolution, connection latency, network speed and system status, from networking layer issues like packet loss and queue overruns to kernel state, can both help diagnose problems and reconstruct the capture path of a given resource.

Incorporating such telemetry data into the final preserved record of a resource is absolutely critical to understanding the fidelity of the snapshot in an increasingly geofenced and dynamic web. Two crawlers fetching the same URL from two different geographic locations can receive very different content. Two crawlers running under the very same VM on the very same physical host can yield wildly different results if the target web server or its active CDN assess their respective IP addresses to have different "interests" or utilize A/B testing or content rotation. A hardware fault in the underlying host environment or a colocated resource-intensive VM under the same hypervisor can all have unanticipated impacts on crawling outcomes.

GDELT's GEN1 and GEN2 architectures recorded a wealth of details about each page fetch, but simply passively recorded this telemetry data. In contrast, GEN3 brought telemetry to the forefront, actively consuming the live global telemetry feeds of every GDELT crawler worldwide to automatically discover suspected geotargeting behaviors and autonomously establish dynamic routing rules for those websites.

GDELT's new GEN4 architecture leverages a wealth of powerful new capabilities and lessons learned to record a rich array of new statistics about each capture, from a range of statistics about the host VM kernel state and network status to an array of vital statistics about the network interactions of the capture itself. These new statistics are logged for each page in GDELT's global crawling register, but most importantly are actively monitored by its global fleet management orchestrators to dynamically adjust how GDELT routes requests across the world.

We are exploring making some of this telemetry data available and would be interested in hearing the kinds of statistics that would be of greatest interest.