
Migrating a new idea from proof of concept one-off experiments powered by a hodgepodge of scripts, notebooks and cobbled-together tools into a robust at-scale production workflow is often one of the most technically complex tasks involved in analyzing large datasets, involving not only scaling up the underlying code and making complex judgement calls around accuracy/resolution/tractability tradeoffs, but also building highly robust and automatically scalable and fault tolerant pipelines. Yet, perhaps the most complex workflows of all involve scaling those production pipelines even further to process entire archives within short time frames, pushing the very limits of even the public hyperscale cloud.
For example, many of GDELT's societal-scale research initiatives involve processing immense petascale archives at full resolution at full scale under very short deadlines through various AI tools and APIs. While API quotas can be increased, archive-scale analyses require throughputs far in excess of what any shared public API can support and thus require a delicate and carefully-tuned orchestration of processes running across massive clusters of machines geographically distributed across the world to maintain a steady state throughput of requests to each API to saturate it to just under the maximum quota level.
Hyperscale workloads at the upper bound of what is possible today often must distribute their load across all available physical regions in which an API or GPU/accelerator exists to maintain maximum throughput. There are only so many hardware accelerators available in any given region, both for customer workloads and hosted APIs. For particularly latency-sensitive workloads, GDELT actually maintains internal monitoring tables that record the median and quartile latencies of API submissions across each GCP region around the world and automatically balance our workloads across the world accordingly, rebalancing the load on each region every few seconds.
One of the most complex orchestration tasks involves maintaining API submission rates just below the maximum quota of each API call across thousands or tens of thousands of processing engines. Within each multicore VM we use standard kernel-provided mechanisms for coordinating and rate-limiting across all processes running under that SSI. Rather than count the number of submissions per second, which results in bunching at the start of each second, we use a lock-protected microsecond-resolution timer that schedules each API call to precisely evenly distribute them across each second.
At the cluster level, the biggest challenge is the extreme unpredictability of processing a given file. For television news broadcasts, two hour-long HD-resolution broadcasts of the same filesize and overall metadata characteristics might take wildly different times to process, making submission-time scheduling unworkable. Instead, we use a vast globally distributed fleet of machines to perform the computationally expensive processing stages such as transcoding into MP4 or MP3 formats or extracting frames and forming them into montages for OCR. A smaller set of dedicated regional clusters then handle only the API submission process, allowing them to maintain a fixed steady throughput with minimal variability. In short, the unpredictable high-compute stages are distributed across a very large computational fabric, with the final more predictable API interaction stages run on much smaller dedicated clusters. Though, as we have noted in earlier discussions, even these submission clusters cannot be too small given the severe impact of network variability at extreme scale.
In short, from a haphazard experimental assortment of scripts capable of OCR'ing a single television news broadcast to robust production-grade fault-tolerant hyperscale processing pipeline is vastly more than merely hardening and scaling up the original system: it often requires wildly new designs and innovative and often extremely exotic architectures that bring to bear decades of lessons learned.
We're excited to continue to share more of this journey coming soon and especially some of the architectural lessons we've learned and the unique ways we're deploying GCP infrastructure to meet these needs.