A Case Example Of Performance Tuning: Pure C Versus PERL + Inline C Versus Pure PERL

Given the scale at which GDELT operates, we spend considerable time developing highly efficient architectures and relentlessly tuning the performance of our pipelines, balancing the need for raw execution speed with geographically distributed fault tolerance, mixing traditional tightly coupled HPC practices at the VM level with cloud practices at the interconnect level. Given the enormous breadth of GDELT's analytics and its reach across so many different research domains, its underlying codebase spans a large number of languages glued together into complex pipelines that leverage the unique capabilities of each language and toolkit.

For a forthcoming analytic tool, we had the need to analyze vast sparse matrices, performing a number of common operations on them. Traditionally sparse matrix analysis is the domain of Fortran or C and makes use of libraries like BLAS, LAPACK and others, often with tuning to the specific hardware architecture on which it runs. In our case, certain unique characteristics of our data suggested an alternative architecture that traded memory for execution time, allowing us to use a complex nested data structure with random memory access patterns that was particularly well-suited for PERL's ability to represent complex data structures with high performance using hashes. We built and benchmarked three versions of this code to determine whether implementing the core high-intensity portions of the algorithm could benefit from being written in C.

The benchmark version of the code was written in PERL, using its support for large dynamically expanding fluid multidimensional hash tables to trade traditional highly optimized matrix operations for large randomly scattered access patterns. By transferring the high-intensity execution areas away from numeric operations (which are slower in PERL than C) and to memory operations, the code was optimized to make most efficient use of PERL's strengths.

The second version of the code used PERL's Inline::C module to take the benchmark PERL script and rewrite the entire engine core in C. The resulting PERL code simply created the hash structures in PERL and then invoked the C code to perform the analysis. By executing the entire core of the engine in C, the script was able to operate on PERL's internals at the lowest available level and potentially force the PERL interpreter into a more optimized execution sequence. For many codes like mathematics-intensive calculations, this can lead to dramatic speedups, but in this case, excepting the additional overhead for the compilation of the C code, there was no difference in runtime between the Inline:C version of the code and the original pure PERL version. Execution was largely memory constrained, meaning that even creative ordering of PERL's low-level operands could do little to improve the speed. This offers a critical reminder that modern PERL is already so efficient out of the box that simply recoding a given module into C may not yield a speedup, even for high-intensity nested loops and data structures.

The third version was built in pure C, using all of the usual performance optimizations and a standard optimizing compiler. On simple brute-force striding through a multidimensional array using toy data and linear access on a bare metal system, the code is unsurprisingly several orders of magnitude faster than PERL and exhibits very high efficiency. At the same time, it is important to remember that in the cloud, vCPUs are actually execution threads on the underlying physical cores, meaning the same code running on the same processor in the cloud will typically run at 2x slower or more than on bare metal. The cache management, architectural feature, pipelining and memory access strategies possible on bare metal are largely circumvented in the shared environment of virtual machines, negating many of the advantages of such highly optimized code (though most cloud providers now offer bare metal or more direct hardware access on select compute types). Thus, while the C code was still considerably faster than the corresponding PERL code on toy data, the specific speedup was considerably muted in the cloud compared with the bare metal environment. When applied to the production dataset, however, with its large resident dataset and random memory accesses, the performance differential all but disappeared and the C and PERL code executed at exactly the same speed in the cloud. In fact, when run on a shared heavily-loaded 8-core VM, the PERL code actually ran considerably faster since its memory-centric model allowed it to be interleaved among the CPU-intensive tasks on the VM, while the more CPU-dependent C code competed more heavily for core time. Of course, in reality the C code likely could be optimized even further, but at the end of the day the code is largely memory bandwidth-limited meaning PERL is already efficient enough to match C for codes with certain kinds of access patterns. Alternative hashing libraries could be used in both the standalone C and Inline::C versions, though it remains to be seen whether they could provide a meaningful speed improvement given the unique nature of this specific use case.

In the end, this experiment offers a reminder that while interpreted languages are traditionally viewed as too slow for high-intensity computation compared with C, in cases where the underlying algorithm permits rearchitecting to trade memory for computational efficiency and especially the use of complex hash structures and regular expressions, PERL's highly optimized building blocks allow it to match standalone C implementations for many tasks, especially in the shared environment of the cloud.