As we continue our digital twin infrastructure initiative using GCP's Bigtable, how important is it to use Bigtable's native batching support, rather than issuing serial requests? Is there really that much performance improvement? Here we have an example of a process that needs to verify the processing status of 20,000 broadcasts from a particular channel to determine whether any have any failed processing steps. We can see that issuing read requests from a VM to a single-node SSD Bigtable instance in the same zone serially one-by-one takes 1m17.5s, while issuing in batches of 10,000 takes just 1.7s: a speedup of 46x. There is of course an upper bound to this improvement, with performance narrowing sharply from 100 to 1,000 records and only a minimal improvement from 1000 to 10,000 records and erroring at 20,000 with this particular use case. Overall, the importance of batching read requests is abundantly clear, as is just how much of a performance improvement can be achieved even in a trivial use case.
BATCHSIZE=1 1m17.505s BATCHSIZE=2 0m37.118s BATCHSIZE=5 0m15.562s BATCHSIZE=10 0m9.140s BATCHSIZE=100 0m2.633s BATCHSIZE=1000 0m1.806s BATCHSIZE=10000 0m1.732s BATCHSIZE=20000 ERROR:root:Error querying Bigtable: 400 Received ReadRowsRequest message too large