One of the quickest lessons developers learn building true global-scale production applications is how quickly systems break down under extreme stress and how many of the hard rules found in kernel and library documentation break down in the real world. Linux kernels can break under stress in ways that leave a seemingly normal machine that behaves in the most unpredictable ways. Applications that push both compute and IO have long been a weak point of Linux kernels and anything that touches IO, from locally-attached disk to network traffic can become an unpredictable bottleneck even in the best of times. A key lesson we've learned over the years is the degree to which even non-blocking IO calls can fail into blocking behavior – asynchronous loops and event-based IO handlers can deadlock and even correctly implemented protective timeouts can fail to prevent the best-written application from hanging.
Fault tolerance is at the core of all large-scale applications and orchestration frameworks for at-scale computing must by necessity be designed around the universal truth that a certain percentage of worker threads, containers or VMs will simply vanish in the worst ways. At the same time, applications often have a need to interact with legacy or specialized exterior services that may have stringent access requirements and no tolerance of unexpected failures, especially services interacting with highly specialized or exotic hardware that form critical interface nodes and thus that must run licensed COTS services that have per-hour per-core charges that lock them to specific machines. It can be hard to join these two worlds.
In our own work, we've found the humble Unix "timeout" utility a tremendously powerful yet trivial to use wrapper around any CLIs and external tools we invoke that has nearly eliminated the number of zombie processes we have to deal with. Simply wrap any CLI shell invocation to give it a hard timeout.
For example, gsutil and gcloud can both hang indefinitely due to limitations of Python's IO error handling. Historically this meant that zombie processes would pile up over time on systems. Worse, some of them would hang in infinite loops that consumed from a few percent to an entire core, siphoning away precious resources. Even writing custom copying scripts in Python using the GCP-provided Python bindings would similarly hang in unexpected ways under severe system load. However, simply prefacing every CLI call with timeout eliminated 100% of these zombie processes:
timeout -s 9 5m gcloud storage cp ./localfile.txt gs://[YOURBUCKET]/remotefile.txt -q > /dev/null 2>&1
Within our own code, we typically wrap any external utility calls within timeout and incorporate multiple redundant timeout mechanisms to introduce hard timeouts at many levels. Interestingly, we've found asynchronous and event-based IO frameworks to be highly vulnerable to these kinds of deadlocks under severe system strain and thus have increasingly moved back to splitting CPU and IO-intensive tasks across isolated processes that rely on kernel scheduling for parallelism with watchdog processes enforcing hard timeouts.
Specifically – over successive generations of our portions of our ingest infrastructure we moved more and more of our core ingest code to modern async and event frameworks, only to find that those engines were far more unstable and brittle during periods of extreme system stress and would fail in highly unexpected ways that made it extremely difficult to provide robust error handling coverage as unfailable operations would fail, operations that failed would return successful response codes and so on. In our most recent GDELT 4.0 ingest engines we reverted back to the concept of serial isolated processes relying on standard kernel scheduling and actually found in many cases an exponential increase in overall performance and decrease in total resource utilization (allowing us to increase the number of processes per VM) due to the kernel's superior scheduling abilities.
Thus, while timeouts are typically overlooked in many system designs as the responsibility of underlying libraries and function calls or swept under the rug by fault tolerant designs, truly robust global architectures will want to enforce their own hard timeouts for reliant infrastructure.