Scaling In The Cloud: Storing Billions Of Files Totaling Petabytes In GCS

One of the most remarkable aspects of working at "cloud scale" is the sheer scalability of the modern public cloud. Nowhere is this more apparent than in the object storage systems that power their global storage fabrics like Google Cloud Storage (GCS). Just how scalable is GCS? We have single buckets that contain billions of files totaling multiple petabytes. Moreover, these buckets are accessible by our entire global GCE VM fleet, meaning any of our VMs anywhere in the world can read/write to this central storage fabric. Private Google Access (PCA) support on our VPCs means that even VMs that are entirely isolated from the outside world without any external IP address can be configured to have GCS access and the use of service accounts and IAM means that we can enforce strict precision access controls over every single one of those billions of objects, configuring each individual file with unique access controls that govern its specific lifecycle needs. Best of all, GCS is immensely scalable, with entire fleets of VMs across the world reading and writing at full service speed, while public-access computed metadata files like extracted entities, category labels, ngrams, sentiment scores, etc, can be made available via dedicated public access buckets that can simultaneously handle vast global internet traffic reading those uploaded files even as multiple global VM fleets are writing those files behind the scenes. The end result is an incredibly scalable global storage fabric that "just works" and does so with such scale that it opens the door to entirely new kinds of architectures that GDELT 5.0 makes full use of and which we will be unveiling more of over the coming months on this blog so stay tuned!