Stream Processing In The Cloud At Google Scale

Google's BigQuery platform is capable of brute-forcing its way through datasets with phenomenal speed. Performing a simple LIKE scan of four LIKE's OR'd together over a 631GB field spread across 1.1 billion records, while summing the total size of its paired 330GB, 9.7GB and 13GB columns, takes just 23 seconds with a case sensitive search or just 29 seconds with a case insensitive search.

That works out to around 21.7GB/s to 27GB/s of searched data and 217GB/s of processed data with a single SQL query for a completely unoptimized, unpartitioned table.

What if you have more complex stream processing needs, such as running an arbitrary third-party executable over a vast dataset stored in Google's Cloud Storage?

In our own work, a standard 4-core 16GB N2 or C2 VM on GCP's Compute Engine platform streaming from Google's Cloud Storage into a low-CPU highly optimized stream processing code typically achieves sustained read rates of around 1.87-1.9GB/s from GCS per VM, which would allow it to perform the same 613GB scan as above in just 5.5 minutes with a single VM. Scale that to multiple VMs and using best practices for name sharding and its not hard to achieve some pretty breathtaking stream processing rates.

Using Cloud Storage as a store fabric for stream processing rather than local disk allows us to scale infinitely, both in terms of the total data size processed and the ability to nearly linearly scale the number of cores brought to bear for a given computation. It also makes it easier to centralize datasets that are compiled in a decentralized fashion, with regional ingest points all streaming results into a single central archive.

We typically repack datasets from large numbers of small files to singular large packfiles for our large streaming pipelines. Like most filesystems and object stores, Google's Cloud Storage achieves its fastest performance rates when streaming a small number of large files rather than accessing many small files. In our own extensive benchmarking, when streaming a small number of large files to N2 and C2 VM's within the United States we see statistically insignificant differences between regional, two-region and multiregion storage localities. This allows us to exclusively use the US multiregion and distribute our processing clusters nationally for resilience and to integrate with localized services within each region that integrate with some of our external data providers' POPs that transit the commodity internet and thus require geographic localization. In short, Google's internal cloud backbone is so high-bandwidth that geographic location is no longer relevant for large-file stream processing in the US multiregion.