Behind The Scenes: The Power Of Simple Command Line Tools At Cloud Scale

Most discussions of cloud-scale infrastructure focus on massive exotic architectures that rely on highly specialized tools. It is worth pointing out, however, the power of simple everyday command line tools in the cloud.

For example, you'll notice many of the examples in our blog posts use GNU parallel. For simple explorations we typically use a cluster of 64-core GCE VM's and shard the work across them by listing all of the commands in a text file, then using "shuf" to randomize them and "split" to shard into one file per node. GCS is used to copy the task files to each VM and parallel used to run each node, with results being written back to GCS.

In fact, this trivial workflow forms the backbone of much of our initial exploratory work. Importantly, we can distribute the compute nodes across all of GCP's global data centers to take advantage of unique hardware availability in different regions, such as specific GPUs or CPUs, or for cases where nodes must access external datasets, to position each closest to the data center housing the data it is processing. Using GCS as the common backbone means both compute and data share a seamless global namespace.

The same kind of command line chaining used by system administrators the world over are just as powerful when applied to cloud-scale workflows. For example, recently we had a need to take a massive list of MP4 video files of the format "CHANNEL_YYYYMMDD_…", extract the date of each, collapse into a single master histogram of all unique dates there were videos from, then shard across across a set of 64-core VMs, with the results being written to GCS direct from each. The result is the following:

time gsutil -q ls gs://[BUCKET]/*.mp4 > LIST
wc -l LIST
perl -ne 'print "$1\n" if /_(\d\d\d\d\d\d\d\d)_/' LIST | sort | uniq > DATES
wc -l DATES
split -n 20 --numeric-suffixes DATES CMDS.
gsutil -m cp CMDS.* gs://[BUCKET]/

mkdir /dev/shm/script/
mkdir /dev/shm/script/tmp/
gsutil -m cp gs://[BUCKET]/CMDS.* .
cat CMD.x | parallel --tmpdir /dev/shm/script/tmp/ --resume --joblog ./PAR.LOG --eta  './ {}'

We use gsutil to compile a single master list of all of the MP4 files in the bucket. Then we use a PERL one-liner to extract the date from each and output to stdout, which is piped into "sort" and then "uniq" and written to DATES. Then we shard DATES into 20 equal batches using "split" and write the per-node command files to GCS. Then on each 64-core VM we run the script over its respective portion of the total using a RAM disk for scratch disk, with data being read/written to/from GCS.

While this workflow has a number of limitations, including the inability to rebalance work as nodes finish sooner than others, it has the advantage of being trivial to set up and run with just a few lines of code, making it ideal for quick "what if" experimentation.