VGKG & GEN4: Execution Speed Versus Computational Overhead

Building massively scalable infrastructure entails myriad technical tradeoffs. One of the most basic involves the tradeoff between raw execution speed and computational overhead. For example, a highly optimized loop might achieve nearly complete utilization of the CPU and complete its task in 1 second. A second version, written without any consideration towards optimization, might take 10 seconds to complete. If the unoptimized loop requires 100% CPU utilization for those 10 seconds, then it has few advantages over the optimized version, assuming memory consumption and other requirements remain the same. On the other hand, if it requires just 10% CPU utilization for those 10 seconds, then the question of which is better comes down to the precise needs of the underlying application. A high-intensity loop that consumes the entirety of the available CPU can cause cascading problems if it displaces supporting processes providing its input and output streams, especially if those streams are latency sensitive. On the other hand, a lengthier execution time could cause downstream tasks to overrun their timeout conditions.

The myriad interrelated components that make up the full execution lifecycle of a crawler make such decisions difficult, as each has different needs that must be carefully interwoven. For example, CPU intensive tasks are best intermixed with network-intensive tasks that spend most of their time waiting, but care must be given to ensure that a high-intensity burst does not induce sufficient latency in a networking task that the remote connection closes the connection due to excessive throttling or buffer overruns.

One of the most surprising elements of this tradeoff is the way in which the standalone CLI and library versions of commonly used tools can differ by an order of magnitude or more in their positions along this spectrum.

For example, one major toolkit commonly used in web processing achieves on a particular benchmark two wildly different results, depending on whether its CLI or library versions are used. The library version, when implemented with all recommended settings and fully optimized, completes this particular benchmark in just 3 seconds with 99% CPU utilization. The CLI version, on the other hand, takes twice as long, completing its task in 6 seconds. At first glance, this might suggest the library version is the better solution. However, a look at CPU utilization tells a very different story: the CLI version averages less than 0.5% CPU over each of the 6 seconds and consumes just a fraction of the memory.

The reason for the discrepancy in this particular case is that the library binding used in many programming languages uses a polling model to ensure threadsafe and signal-less execution and to simplify the movement of data between the library and user code, as well as provide a more generalized callback and injection workflow. The CLI, on the other hand, uses blocking IO and signals to absolutely minimize its CPU overhead. The end result is that the CLI takes around twice as long typically to complete the task, but does so with just 0.5% of the CPU consumption of the library version. This is despite the fact that the CLI must also suffer the overhead of process creation and malloc()ing all of its memory with each and every call (in this particular case each loop iteration invoked the CLI separately, which makes this differential even more impressive).

Such oddities abound in common toolkits and it is worth extensively benchmarking CLI versus library versions and not to assume that a library interface will be substantially faster, even if the CLI interface requires separate invocation for each and every loop iteration.