VGKG & GEN4: Massive Threading Versus Massive Process Parallelism

GDELT's original GEN1 crawling infrastructure relied heavily on process parallelism, with each machine running a set of processes overseen by an orchestration process and designed to absolutely minimize IPC. The GEN2 architecture moved to a process + threads environment, in which a small number of processes used massive numbers of threads. The GEN3 architecture eliminated processes altogether, moving entirely to a single process per crawler that managed hundreds or even thousands of threads per system, adjusted in realtime based on performance.

While modern threads are exceptionally efficient, there remain myriad libraries that are not yet threadsafe or which utilize their own threading models and even with highly efficient workarounds to these issues, GDELT's unique pipeline of distributing computationally intensive processing tasks such as image filtering and document extraction to its crawler fleet consistently posed challenges to this threading model. Moreover, the myriad oddities of the open web meant that previously unknown edge cases in critical third party libraries could trigger a segfault that took the parent process and all its threads with it, making it difficult to achieve strong isolation. Even with kernel adjustments and extreme code optimization, the sheer volume of incoming data to each node meant that if a single operation in a thread took longer than expected and it was unable to read data out of the kernel buffers at its expected rate, incoming data could be truncated or suffer unexplained connectivity and networking issues in unusual places and ways that made diagnostics extremely difficult.

Instead, GDELT's GEN4 architecture is moving back to a process model, but this time in place of GEN1's small number of processes, GEN4 uses massive process parallelism with creative IO/CPU task interleaving and IO-oriented pipelining to mitigate the greater overhead of process-centric computation. Critical code sections are now triple isolated, allowing pipelines to jump out of or even dynamically work around faults, while execution pipelines are designed to allow the kernel to most efficiently schedule processes in thread-like fashion. The end result is an extremely robust and flexible crawling architecture that is able to more fully leverage every available cycle on the system and work around edge cases. It is also able to leverage kernel scheduling and prioritization features to avoid the networking issues that plagued our threaded models under overload conditions by reordering, throttling and even stalling individual tasks at a system-wide level across all tasks being performed, including networking tasks. Most intriguingly, despite the much greater overhead of process parallelism compared with thread parallelism, as the VGKG has been transitioning to the new GEN4 architecture, we are seeing truly dramatic performance improvements on the same identical underlying hardware and geographic location, suggesting these greater efficiencies are more than compensating for the architectural inefficiencies of sovereign processes and IPC.