Like other object-based storage systems, GCS has immensely powerful qualities, including effectively infinite capacity, strong resilience and considerable performance. Any authorized GDELT VM, serverless or managed process in any GCP datacenter anywhere on the planet can read and write to a set of core GCS buckets, creating a single planetary-scale storage fabric, which we leverage to its full potential in our distributed architectures. We make especial use of these capabilities when processing video, allowing us to globally distribute the processing of petabytes of video and their associated metadata and annotation datasets to load balance across managed APIs and GPU and other accelerator-based clusters across GCP's global footprint, shifting traffic from one datacenter to another as a given service becomes overloaded or to take advantage of unique hardware only available in a specific region.
At the same time, like all object stores (and in fact all traditional storage systems), GCS is optimized for object-centric operations rather than inventory-centric operations. This means that when there are tens or even hundreds of different microservices all performing various actions on different objects in different states across an archive of hundreds of millions of files spanning petabytes, with all the attendant unexpected errors, temporary failures, network glitches and other challenges, it can be difficult to get a global sense of the exact state of the entire universe at any given moment.
Historically, we managed the state of all of these operations through resilient global queue-based systems in which each microservice was responsible for its own operations and managed its own queue system. Processing chains dependent on the results of other services would monitor those queues, leveraging all of the distributed queue capabilities of the modern hyperscale cloud. At the same time, this was increasingly creating challenges as we have been making ever-heavier use of advanced AI systems for ASR, OCR, object and activity detection and other advanced narrative analysis systems. Such systems are far more unstable and unpredictable than traditional and even classical AI systems. GPU-equipped VMs are also less robust than CPU-only systems and the tooling for transparent automatic migration is not quite there, exposing workflows to the harsh realities of the underlying hardware.
Most concerningly, modern models can fail in ways that are extremely difficult to catch through tractable isolated methods, meaning it can require aggregating and comparing multiple outputs of multiple models over multiple passes in dedicated downstream failure detection processes that then must be able to requeue jobs to multiple services (often with their own downstream dependencies) in complex workflows.
In addition, foundational models are constantly updated, meaning new kinds of global oversight workflows are required to monitor the overall performance of entire complex analytic systems to evaluate whether underlying models are changing in ways that might break downstream assumptions or require tuning or adjustments to workflows like prompt changes or preprocessing tuning.
All of these needs become difficult in a traditional object store + queue model in which it is extremely difficult, costly and time-consuming to get a single global view of the state of the entire system and the state of every single object in a system spanning tens to hundreds of millions of objects over petabytes.
Thus, over the next months you'll be reading more about our launch of a set of new digital twin systems that act as mirrors over our entire global video processing infrastructure, accurately reflecting the state of every single object and task and enabling whole-of-system visibility for our new generations of oversight and self-healing infrastructures that we are bringing online.
Stay tuned for more to come on this exciting new evolution.