The GDELT Project

The Benefits Of Bigtable As A Digital Twin Over GCS

Given that our past benchmarks have proven out the extreme scalability and speed of GCS' native prefix-based search capabilities, why use Bigtable as a digital twin over our GCS archives? The short answer is that while GCS demonstrates extreme prefix-based object listings and existence checks, actually returning object content is vastly slower, especially at scale for large-scale random-read requests. Thus, storing key summary and status metadata as a JSON blob in a Bigtable index over that GCS archive allows us to perform identical prefix-based searches and extremely scalable large-volume existence checks and reads using batch reads with full scalability.

For example, for our multi-petabyte archive of tens of millions of minutes of video spanning millions of discrete MP4 and MPG files, we store the complete video files and their associated XML and JSON metadata files in GCS and store all of the key metadata about the broadcast (ranging from technical format details to sourcing information) as a JSON blob in our Bigtable digital twin. This allows us to instantly retrieve our complete holdings for a given day using a date prefix and compile a list of GCS objects that we need to perform additional tasks on, that match a given set of advanced search criteria or simply to inventory our holdings at scale.

Using GCS as the actual object store and Bigtable as an extremely scalable metadata store allows us to create an exceptionally performant and scalable digital twin over our GCS holdings.