The GDELT Project

GEN4: Building Robust Chained Networking Pipelines

Applications that must reach beyond the confines of their own virtual machine must contend with an almost limitless array of arcane and nuanced networking behaviors that can at times even violate official library or kernel specifications and be able to recover gracefully from anything it encounters. Even two VM's in the same zone can experience networking issues of such a rare and arcane nature it can take days just to track down the relevant library or kernel interaction points. In some cases resource pressure can cause kernels and libraries to exhibit behaviors that their documentation states is impossible and their codebase would appear to prevent. Even core cloud utilities can fail as the underlying Python libraries they rely upon fail in unexpected ways. Friendly servers can enter a bad state and return incorrect information, hang or stream data at such a low rate that underlying libraries automatically close the connection.

Each of these issues has a range of potential mitigation strategies depending on the time criticality of their completion and eligibility for asynchronous error handling, such as retrying, moving the request to a different VM, allocating the request more time or resources, etc.

When ingesting large files from external sources, it is typically highly beneficial to retrieve them from VMs that are physically as geographically and networking proximate as possible. This means we may use small single-core VMs to stream files into GCS, where they can be distributed globally for processing through APIs and large VMs. A highly efficient method for ingesting such files is a variant of:

timeout -s 9 Xm curl -m Y -s -k -H 'authorization: XYZ' "https://URL" | mbuffer -q -m 15M | gsutil -q cp - "gs://bucket/"

Here we use curl to ingest the file, pipe it through mbuffer to even any burstiness and output to GCS using gsutil in streaming write mode. The initial curl call is given a request-specific maximum timeout and the entire pipeline is further wrapped in a timeout to hard exit if curl enters a bad state.

Pipelines like the one above pose a specific challenge: exit codes are not propagated down the sequence of piped commands to the final exit code of the chain. That means that if curl fails with a non-zero exit code, but gsutil succeeds, the final exit code of the pipeline above will be zero, reflecting gsutil's zero exit code. In the case of curl specifically, we can use "–write-out" to request further information on its operations, but for many other utilities the exit code may be our only indication of its completion status. Similarly, if mbuffer fails to initialize due to critical memory pressure, we have few ways of gathering that information beyond its exit code.

In the bash shell, the special variable PIPESTATUS is an array that stores the individual exit codes of each individual command in a chained pipeline of commands. In the case of the sequence above, it would be an array of three numbers, each representing the exit code of the three respective commands above:

echo ${PIPESTATUS[@]}

Unfortunately, when calling piped sequences from most applications, the default command shell is used, which typically on most Linux distributions today is still the sh shell (often aliased to dash), which does not support PIPESTATUS. Thus, we typically wrap pipechains inside their own bash scripts, then exec the bash script and have it echo PIPESTATUS to a file on exit:

#!/bin/bash
timeout -s 9 Xm curl -m Y -s -k -H 'authorization: XYZ' "https://URL" | mbuffer -q -m 15M | gsutil -q cp - "gs://bucket/"
echo ${PIPESTATUS[@]} > ./pipestatus.txt

Then we exec the script above and check both its exit code and the contents of pipestatus.txt to determine whether any errors occurred. In PERL this might take the form of:

$exitcode = $? >> 8; $tmp = ''; open(FILE, "pipestatus.txt"); read(FILE, $tmp, (-s FILE)); close(FILE); foreach (split/\s+/, $tmp) { $exitcode+=($_+0) };

Simple design principles like this can go an incredibly long way towards building robust chained networking pipelines.