Digital Twin Experiments: Benchmarking Bigtable Batch Vs Serial Reads

As we continue our digital twin infrastructure initiative using GCP's Bigtable, how important is it to use Bigtable's native batching support, rather than issuing serial requests? Is there really that much performance improvement? Here we have an example of a process that needs to verify the processing status of 20,000 broadcasts from a particular channel to determine whether any have any failed processing steps. We can see that issuing read requests from a VM to a single-node SSD Bigtable instance in the same zone serially one-by-one takes 1m17.5s, while issuing in batches of 10,000 takes just 1.7s: a speedup of 46x. There is of course an upper bound to this improvement, with performance narrowing sharply from 100 to 1,000 records and only a minimal improvement from 1000 to 10,000 records and erroring at 20,000 with this particular use case. Overall, the importance of batching read requests is abundantly clear, as is just how much of a performance improvement can be achieved even in a trivial use case.

BATCHSIZE=1       1m17.505s
BATCHSIZE=2       0m37.118s
BATCHSIZE=5       0m15.562s
BATCHSIZE=10      0m9.140s
BATCHSIZE=100     0m2.633s
BATCHSIZE=1000    0m1.806s
BATCHSIZE=10000   0m1.732s
BATCHSIZE=20000   ERROR:root:Error querying Bigtable: 400 Received ReadRowsRequest message too large