There are three primary ways to access files in GCS: using the native GCLOUD CLI (replaces GSUTIL), using the GCS official library for supported programming languages and using the native GCS JSON API directly. Which of these offers the highest performance for listing files matching a specific wildcard in a medium-sized bucket (half a billion files) in the US multiregion when accessed from both the US and Western Europe?
Unsurprisingly, the GCLOUD CLI is the slowest, with the Python library offering 2-4x speedup. Using CURL to access the GCS JSON API directly yields blazingly fast performance, with European VM accesses that are 8.2x faster than GCLOUD and 4.6x faster than Python and US VM accesses that are 16.2x faster than GCLOUD and 4.3x faster than Python. From a US-WEST1 VM, we can perform a listing operation on a folder prefix containing 10 million files in a bucket containing half a billion files and return the 150 matching results in around 0.072 seconds, while a EUROPE-WEST2 VM takes just 0.016 seconds – making it almost fast enough to operate as a primitive queuing system.
For the purposes of this brief experiment, we'll use bucket that contains around half a billion files and a "folder" prefix under that bucket that contains around 10 million files. We will just perform a file listing asking for all files matching a specific wildcard.
First, let's test GCLOUD STORAGE. The old "gsutil" utility is officially depreciated and gloud storage offers performance gains over gsutil. We'll report the median of 50 sequential runs from a 64-core N1 VM in US-WEST1 (Oregon) and EUROPE-WEST2 (London) against a bucket with 500 million files in the US-MULTI region. In this case, "gs://bucket/folder/" contains roughly 10 million files and our specific wildcard below matches 150 files. Here we can see that despite nearly half a billion files in the bucket, our listing returns in around 1.2-1.3 seconds. Notably, there is only a slight difference in timing between the two VMs:
time gcloud storage ls gs://bucket/folder/ABC*.zip | wc -l #EUROPE-WEST2: 0m1.314s #US-WEST1: 0m1.169s
While sufficient for many tasks, this isn't performant enough for many applications. Let's try GCP's native Python library. Note that GCLOUD is actually written in Python, but here we'll test how GCS integration performs in an end user application that is much simpler and more streamlined.
First we'll install the library:
pip install google-cloud-storage
Save the following as "gcstest.py":
from google.cloud import storage import fnmatch def list_blobs_with_prefix(bucket_name, prefix, delimiter=None): """Lists all the blobs in the bucket that begin with the prefix.""" storage_client = storage.Client() blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=delimiter) return [blob.name for blob in blobs if fnmatch.fnmatch(blob.name, "*.zip")] # Replace 'my-bucket' with your bucket name and 'my-path/' with your path bucket_name = 'bucket' prefix = 'bucket/folder/ABC' matching_files = list_blobs_with_prefix(bucket_name, prefix) print("Files matching the pattern:", matching_files)
Now we'll run the script. Note that the library must first obtain an access token before performing the ls and includes the substantial startup cost of the full Python interpreter. In a production application, these would be partially ameliorated by the script performing these at startup. However, a long running script would still require token refresh and in this case we are testing the full cold start performance, replicating a cronjob or other process. Here the performance is nearly twice as fast as gcloud on the European VM and almost 4 times as fast on the US VM. Clearly using the native Python library is considerably faster and even applications that must wrap a workflow in a CLI utility (such as supporting a COTS application) will be better served performance-wise by writing their own Python script to perform key tasks than using gcloud:
time python3 ./gcstest.py #EUROPE-WEST2: 0m0.739s #US-WEST1: 0m0.313s
Finally, what about the native GCS JSON API? Presumably this should offer the fastest performance, since this has the lowest possible overhead. As a bonus, when running on a GCE VM, we have instant access to the access token for authentication. For even greater speedup, we'll use the "fields" parameter to ask the API to return only the filenames, not the full JSON record for each file – this yields considerable performance gains. As expected, this yields the best performance by a considerable margin. European accesses are 8.2x faster than GCLOUD and 4.6x faster than Python, while US accesses are 16.2x faster than GCLOUD and 4.3x faster than Python. Note that the API returns a maximum of 1,000 results by default (this can be increased using maxResults, but is not fully respected by the API) and can return fewer, requiring parsing the pagination token and making subsequent results to ingest all results, so this is a simplified workflow that is designed for use cases where there are only a few matching results.
time curl -X GET \ -H "Authorization: Bearer $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google" | jq -r '.access_token')" \ "https://storage.googleapis.com/storage/v1/b/bucket/o?prefix=folder/ABC&fields=items(name)&matchGlob=folder/ABC*zip" #EUROPE-WEST2: 0m0.160s #US-WEST1: 0m0.072s