GEN4: Third Party Libraries & The Strange Case Of A Hidden SIGPIPE

GDELT makes extensive use of third party libraries, but one of its core guiding development principles is to isolate those libraries to detachable tasks where the so-called "blast radius" of their failure can be limited to a single disposable task. For example, when using ImageMagick to resize images from the open web, if we encounter an image that causes a segmentation fault, its process isolation within the image crawling pipeline means that in a worst-case scenario we are simply unable to process that single image and that crawler thread continues on to the next image. If the failure is severe enough to take the entire VM with it, the zone-level fleet manager detects the failure and uses the GCE API to harvest the VM.

Over the years we have found that even the most extensively tested third party libraries begin to destablize at the scale GDELT pushes them. Most are developed and tested under laboratory conditions that fail to account for the sheer strangeness of the open web and the nearly limitless edge cases and unexpected behaviors that can materialize.

As a general principle, GDELT therefore strictly minimizes the use of external code in its most mission critical core areas, such as the core servers that power its global routing fabric. For our most critical systems, we reach back to the dawn of the web, when networks were far less stable and systems less scalable, to study how early server code handled key rare edge cases that have long since vanished for ordinary developers, but which reemerge for us given the sheer globally distributed scale at which GDELT operates. The result is that the majority of the code for our core systems is written from scratch and hardened over years of real world deployment running GDELT.

A recent case study explains why we focus so relentlessly on this.

One of our core dispatch server architectures historically made use of a widely used third party library to handle certain networking tasks. In this case, we felt the use of a third party library to handle these specific tasks relieved us of a lot of intricate networking complexity and, given its wide use and dedicated networking developers, it would likely be updated regularly to handle emerging edge cases, relieving us of that workload. However, over the years, instances of this server would crash once a year or every other year at random. While a once-a-year crash is not catastrophic and the network would instantly reroute traffic to other servers and automatically restart the server without incident, we would obviously prefer for servers never to crash unexpectedly. Despite years of diagnostics and walking through the code of the third party library repeatedly, we were never able to locate the issue.

With the launch of GDELT's new GEN4 architecture, the capacity and capability of our global crawler fleets is increasing many orders of magnitude beyond our previous architectures. Yet, as we have conducted scaling tests, we discovered that once we crossed a certain threshold of traffic, this once-a-year crashing behavior became a twice-a-day situation. This meant we now had no choice but to correct the underlying issue. At the same time, however, a crash rate of every 12 hours rather than every 12 months made it feasible to perform verbose logging of the server's execution using tools like strace. We stood up a set of temporary VMs with large amounts of SSD disk to handle the heavy logging IO and transferred a portion of our regional traffic to these machines to observe the precise failure point and conditions of the servers.

The eventual culprit was a case study in the dangers of third party code, even libraries that are widely used. The majority of the library was doubly protected from SIGPIPE: a global-level signal handler responded to SIGPIPE calls, while writes that did not require transmission error handling used "send(socket, buf, len, MSG_NOSIGNAL)" to avoid raising SIGPIPE. Yet, deeply buried within the library was a decades-old routine that handled a rare edge case that the developers had apparently left in but never modernized. As part of its execution, this routine performed a sequence of steps whose inadvertent result was the removal of the global-scale SIGPIPE handler just before performing a socket write without the protection of MSG_NOSIGNAL. The resulting SIGPIPE was therefore unhandled and correctly led to process termination.

While this was corrected by the library developers, it led us to discover other potential edge case behaviors for which the library generated the expected general response, but for which GDELT's unique needs would benefit from an alternative approach. Thus, we rewrote this server to completely remove the third party library and take on those networking tasks ourselves. This had the added benefit that we were able to customize our approach to the GCE environment, leveraging external knowledge about the GCE networking and kernel and hardware VM environments, as well as provide closer coordination with our hypervisor managers and the underlying GCE management APIs.

This case study offers a reminder that while modern development practice is to maximize the use of third party libraries to outsource as much as possible of the development process, when you reach a certain level of architectural scaling, there can be strong benefits to reversing this tendency and focusing more on bespoke hardened code.