Mass Ingest Workflows With Google's Cloud Storage Streaming Transfers

Beneath all massive data-driven analyses lie vast archives of data that in many cases originate outside the cloud. Google's Cloud Storage makes working with even petascale datasets trivial, but getting the data into cloud storage from external data warehouses can often be difficult, especially if it involves shipping hundreds of terabytes of medium-sized files from an external warehouse into cloud storage. Historically one approach to perform such massive network-based transfers was to use an iterative process where Compute Engine VM's were created with large scratch disks to stage incoming files in batches, ship each batch to GCS, clear the local cache and repeat.

It turns out that Google Cloud Storage has a powerful trick up its sleeve to support such mass ingest workflows without the need for local disk: streaming transfers. Using streaming transfers, a GCE instance can simply pipe the streaming output of CURL or WGET directly to streaming GSUTIL to download the remote file in a stream and pipe that stream directly to a file in GCS, essentially acting as a relay without any data touching local disk.

For example, using an 4-core VM with 50GB RAM and just a 10GB root disk, wget and GNU parallel can be combined with gustil using a single shell command to live-stream an essentially limitless volume of remote files into GCS. In this case, FILES.TXT contains a list of files on a remote website, header contains the authentication tokens for the requests, mbuffer is used to smooth the connection between input and output and gsutil is used to stream the results as they arrive out to GCS. Adjust the number of parallel tasks based on the capabilities of the remote server and the desired transfer rate, but streaming rates of 400MB/s from the remote server to GCS per machine (total bandwidth of 800MB/s with 400MB/s incoming from the remote server and 400MB/s outgoing to GCS) are routine even with a small number of parallel threads and significant bandwidth limitations on the remote server. Of course, given GCS' linear scaling, as many relay VMs as the remote server can handle can be used to maximize the transfer rate.

time cat FILES.TXT | parallel -j [NUMPARALLELFETCHES] --eta "wget -q -O - --header '[AUTHORIZATIONTOKENSHERE]' https://someremotesite/{} | mbuffer -q -m 50M | gsutil -q cp - gs://yourpathhere/{}"

Note that to maximize speed even further, the ingest VM's should be placed in the region geographically closest to the remote server. For example, ingesting from a data center in California into GCS, the fastest ingest speed would be seen by VM's placed in us-west2 (Los Angeles). Placing those VM's in us-west1 (Oregon) will typically see a speed reduction of 30%, while placing the ingest VM's in us-central1 (Iowa) will see speeds 2-3 times slower in our experience. Given Google Cloud's global footprint, it is typically easy to find a GCP data center close to the data archive to be ingested. Most importantly, Google's global internal networking means that once the data enters Google's network, the speed at which it can be written to GCS is typically fast enough that the limiting factor is the remote data archive and VM physical location depends only on the remote archive, rather than any GCS considerations.

In the end, once again GCS makes it trivial to ingest and manage the vast datasets that define today's largest analyses.