One of the greatest complexities in building global-scale geographically distributed systems that move vast amounts of data around the world in realtime lies not in CPU, memory or disk management, but rather the networking layer. From idiosyncratic kernel behaviors and undocumented side effects to arcane edge cases that seemingly trigger at random and are not reproducible under lab conditions, moving vast amounts of realtime data in and out of large geographically distributed crawler fleets with a mixture of internal and external traffic, massive parallelism and fault tolerance and with the ever-present reality of misconfigured, rogue and malicious external servers and the attendant mitigation strategies coupled with the need for robust fault-tolerate internal communication and storage fabrics, building such systems at scale is an incredibly complex and nuanced task.
Modern kernel networking layers and off-the-shelf crawling tools can yield instant crawling clusters that work well at the scale that the majority of organizations are likely to confront. Scaling to the levels that GDELT operates at each day and building systems that are fault tolerate and geographically distributed worldwide requires entirely different kinds of architectures and can push operating systems in ways that trigger unusual edge case behaviors.
Unlike CPU-intensive applications across many-core SMP systems, which are well-managed by modern kernels, networking lies outside the physical confines of a single machine and involves a vast cast of third parties that can yield unexplained behaviors.
For example, a single-threaded synchronization server whose job it is to simply keep a local fleet of crawlers coordinated using small messages might peak at just 10% CPU and have 20% instantaneous CPU headroom available to it on its host VM and require minimal memory. Yet the server might mysteriously truncate receives at random without CPU or networking saturation, with the caller successfully sending off the entire stream and the kernel and application networking layers reporting no errors. Conversely, massive multithreaded servers following modern design practices running on oversized hardware might simply take an unscheduled "timeout" with no obvious root cause, with neither computational nor kernel networking layers reporting any problems.
Even the most basic of tasks like determining whether all data has been received from a socket become considerably more complex at scale. Traditional simple approaches like select() polling that work well in small deployments fail readily at scale, reporting that sockets have no data available to read even when their buffers are saturated and unbuffered sysreads are able to continue reading to the end of the data. Long pauses of multiple seconds of no data even while guard channels report that the sender is streaming steadily with no slowdown requests. Writes that fail due to filled buffers when the buffers are only at partial capacity. Arcane architectural decisions in the sequence of hardware->kernel->library->application layers that can cause strange behaviors not found in any networking manual or web search.
For example one of our storage fabric interface server designs that handles a light load of small coordination messages used a classic small-message architecture of select() polling coupled with sysread() to read small streaming messages in which length was not known apriori or in which the sender might perish before all data was sent. This architecture worked for years until we increased the number of connections to the server by just 1 extra connection per second. CPU load did not increase by any measurable amount and the various related kernel buffers showed no appreciable change in fill/drain rate. Yet suddenly select() would return that a socket had no data available for reading and would continue to report it was not available for reading until it was eventually closed, even while sysreads() continued to return all of the remaining data in the socket as it was streamed in, including new data sent after select() began reporting it was not available for reading. This would happen at random even under near-zero load if the number of incoming connections increased to the new level at any point since the server began. Yet this is just a simple example of the strange networking complexities encountered at these scales.
At these kinds of scales, computer networking becomes almost a lifeform of its own, exhibiting strange behaviors under precise conditions that are either impossible to recreate under lab conditions for diagnostics or only occur under those conditions at random, requiring vast reams of instrumentation data of control flow through an application over long periods of time to understand internal application state and which functions are producing impossible return values that violate their documented constraints. Thankfully, using GCS and BigQuery (not to mention GCP's extensive network diagnostic tools) such instrumentation can simply be left to generate unlimited realtime streams that are trivial to analyze in realtime, but actually understanding how those return values came to be and what mitigation strategies will work to prevent the behavior is rarely obvious, especially when both documentation and kernel or application layer source suggest the function cannot actually return those values under those conditions.
Within GDELT our crawlers typically use several channels of communication, with streaming connections directly to GCS to archive large amounts of data in realtime, BigQuery ingest of GCS streams for analysis and a fabric of coordination, tasking, shaping and sink servers that manage our fleets. Each of those servers were based on a core base server architecture tailored to their specific needs that has served us well under GDELT 2.0.
As we transition to GDELT 3.0 we are leveraging our lessons learned at these scales to develop a new generation of server architecture that is much more readily able to handle these edge cases, including new embedded guard protocols layered on top of our data, adaptive load sharing and dynamic healing that have proven exceptionally capable in our pilot infrastructure to date. Most importantly, we have stripped away all of the high-level libraries and interface layers and rearchitected our servers to operate directly on the socket objects using the lowest possible C operands at all times to ensure we control the entire pipeline directly with complete visibility to both shape the flow of data far more effectively and make more informed decisions about root cause problems in realtime to select the best possible mitigation strategy for different classes of problems, from retrying a transmission to restarting a VM. We have also instituted permanent module-level instrumentation of all crawlers and servers that stream a live heartbeat of module-level status and health indicators into a series of internal BigQuery tables that allow us to proactively monitor for problems and better understand macro-level behaviors over time. We're excited to see where this new server architecture takes us as we complete development and begin rolling it out!