The GDELT Project

A Trivial Few-Liner Key-Value Store Benchmark Experiment For Digital Twin Storage

As we explore porting some of our previous queue-based architectures over to centralized key-value stores that provide longitudinal history and better visibility into error states, we've been conducting benchmarks of various architectures, ranging all the way from trivial few-liner key-value stores that enable bespoke inline management logic and processing alongside of data on through hyperscale solutions like BigTable, with an eye towards their very different use cases. In particular, one area of great interest to us is the landscape of services where we need complex atomic logic capable of running advanced models directly on-server or, of especial interest, where the total key-value size is extremely small and trivially sharded by date and thus services like BigTable or even far more modest existing solutions would be overkill, while self-managed solutions would bring with them additional complexity and fragility.

Here we benchmark four trivial key-value stores built using a few lines of Perl to examine how a bespoke server core might perform. We test two different days of the Visual Explorer in a scenario in which the key-value store is being used to track the status and progress of various processing tasks. In this case we test a scenario where there are 7 different tasks that must be performed on each video and thus 7 status values that must be recorded in the store. Since the Visual Explorer is organized by day, we can trivially shard our store by date. We thus test two dates, one which contained 1,839 broadcasts (1,839 * 7 = 12,873 total values) and one which contained 3,942 broadcasts (3,942 * 7 = 27,594 total values). We then test four different persistent key-value store mechanisms (we exclude memory-resident scenarios here due to need for immediate persistence and thus test only disk-based models). In the JSON model, we use Perl's C-based JSON::XS module to record the database in JSON, while the TSV is a simple split(). The 1DHASH is a standard Perl associative hash which is read and written to disk using Storable. In all three cases the primary and secondary keys are concatenated together like $HASH{"$SHOWID.$TASKTYPE"}. The 2DHASH test is the same workflow as the 1DHASH but uses a 2D hash instead of concatenating the values together (ala $HASH{$SHOWID}{$TASKTYPE}) in order to leverage Perl's shared key optimizations. Each approach was run 1,000 times in a loop in a single process.

As expected, the hash-based solutions offer vastly higher scalability than the JSON-NL and TSV versions, with average runtimes across 1,000 iterations that barely increased from 1,839 to 3,942 records. Overall, these benchmarks demonstrate that for specific use cases even trivial few-liner code can offer substantial performance capabilities with robust persistent storage. As expected, the hash solutions offer the highest performance, while the 2D hash offers considerable storage reduction as well. The use of a native hash table opens the door to extremely performant complex logic operations that can be performed entirely in-server per-request as needed, meaning specific unique use cases with relatively small numbers of managed key-value pairs, low total key and value space and complex logic needs may benefit from this approach.

JSON-NL: use JSON::XS: while(<FILE>) { decode_json };
TSV: while(<FILE>) { split() }
1DHASH: use Storable: $HASH{"$SHOWID.$TASKTYPE"}
2DHASH: use Storable: $HASH{SHOWID}{$TASKTYPE}

Benchmarks for 1,839 record day:

 

1,839 records * 7 statuses per record * 1000 iterations
JSON: 942KB: 10.5s
TSV: 700KB: 7.9s
1DHASH: 749KB: 6.2s
2DHASH: 391KB: 6.2s

Benchmarks for 3,942 record day:

 

3,942 records * 7 statuses per record * 1000 iterations
JSON: 1.90MB: 22.3s
TSV: 1.38MB: 16.6s
1DHASH: 1.48MB: 6.4s
2DHASH: 819KB: 6.4s