GEN4: From Aggressive Inlining To Process Isolation

GDELT's GEN1 to GEN3 architectures optimized for performance, aggressively inlining all of the processing of a given crawler into a single process image to absolutely optimize memory use and computational efficiency. This extremely aggressive optimization allowed us to extract considerable execution performance out of the austere hardware environment of our frontline crawlers, but came at a dangerous cost: a single error could take down an entire crawler pipeline.

The global internet is an unpredictable and unforgiving place that can trigger edge cases in even the most robust and widely used libraries. This means that an unusual networking response, a specific illegal Unicode sequence, a specific kind of image corruption, a unique change sequence in an article or any of myriad other oddities could trigger a fault that could take down an entire pipeline, given how all of the components of each crawler were intertwined. In turn, this could cause downstream problems that made it difficult for the node-level hypervisors to make sufficient sense of the state of a given VM to adequately recover it. In addition, a number of major libraries still rely on unsafe signaling behaviors to manage timeout and resource constraint conditions in ways that can make them prone to failure.

In place of this inlining, the GEN4 architecture is based on an explicit process isolation principle in which every core computational or networking task is isolated into its own standalone process, with delineations carefully constructed to balance efficiency with isolation. While this has a significant expected impact on latency, the effects have been surprisingly mitigated at the node throughput level through the kernel's increased ability to more efficiently interleave compute and IO intensive tasks.

Most importantly, it means that a fault in one of the core routines can be readily worked around by the pipeline itself or the hypervisor as needed, either retried or simply removed from the queue, depending on status, creating a far more resilient crawler architecture.