Using Digital Twins To Power Self-Monitoring & Self-Healing Infrastructure

Running massive global production systems and pushing the very boundaries of the possible requires bringing together myriad immense, complex and highly interdependent systems into cooperative, collaborative and highly resilient infrastructure systems. In the lab, a single Jupyter notebook or a few Python scripts are often all that's needed to complete an analysis. Transcribing 2.5 million hours of global video from 50 countries in 159 languages spanning petabytes in just 7 days using a SOTA LSM requires a very different kind of approach. GDELT makes use of hundreds, if not thousands, of different libraries, models, services and infrastructure components across its global footprint. As a slowly growing portion of this infrastructure requires specialized hardware, specialized managed services or is too large to self-host and uses managed APIs to access foundational models that evolve hourly, the complex interdependencies in these systems can become ever-harder to manage. A downed node or even a downed cluster can be trivially routed around in a modern cloud environment. A sudden undocumented or inadvertent behavioral change to a managed service or an out-of-the-blue foundational model update are far more problematic in that they introduce complexity outside the control of the developer.

One of the tremendous benefits of a live realtime infrastructure digital twin is that it allows the creation of new kinds of holistic monitoring services (rules-based, statistical and even LLM and LMM-based) that can observe the subtle behavioral adjustments of these changes and their downstream impacts on infrastructure in realtime, often at their first glimmers and long before their impacts have rippled out far enough to have significant impact. For certain kinds of impacts, self-healing can even become possible, such as autonomously switching to an older known model version, invoking a compatibility endpoint or temporarily scaling emergency self-hosted infrastructure. For other changes, such as a foundation model tweak that inadvertently elevates its guardrail false positive rate or introduces LLM infinite looping, this at the very least allows for early warning alerts and detailed diagnostics that can be used to help resolve the situation with the model team.

We are already starting to see early glimmers of how these new digital twins are allowing us to observe early complex interaction challenges in realtime and are excited to unveil over the coming months how we are using these new technologies for automated alerting and self-healing, including experimental systems that autonomously tune their own LLM and LMM prompts in reaction to breaking changes in the underlying foundational models.