Behind The Scenes: Launching A New Autohealing Unified Health Infrastructure

We are excited today to announce the launch of our new automatic monitoring and healing unified health infrastructure that underpins GDELT 5.0. With a footprint spanning from isolated few-threaded VM processes to large MMP SSI systems to regional and even globally-distributed infrastructure components, often with bleeding-edge compute, networking, storage, accelerator and other needs, GDELT has long deployed a bespoke component-specific health infrastructure, with each and every production component having its own purpose-built autohealing system monitoring and repairing it, ranging from VM-local repairs to GCP control plane interaction.

This led to an exceptionally complex technical infrastructure that often led to conflicts among components. As GDELT has more heavily leveraged external libraries, from multimedia tooling for television news to API and infrastructure tooling, the landscape of failure points and failure modes has broadened to the point that it has been increasingly difficult to transfer new findings from one health system to others needing similar capabilities. As but one example, one widely-deployed multimedia library occasionally violates its hard limit resource constraints during certain edge cases in ways that can leave the underlying system in undefined states that no longer responds according to specifications, while Linux kernels can become unstable under extreme combined CPU+IO load. As we've uncovered these edge cases, it has become increasingly difficult to roll out mitigation strategies across such a diverse landscape of bespoke health systems. Thus, today we are unveiling our new unified infrastructure in which each production process needing health monitoring advertises itself, its health parameters (TTL, responsiveness, monitoring endpoints, etc) and provides detailed instructions on how to repair it under different conditions. This allows one single health infrastructure to take responsibility for each entire system and mediate amongst all its myriad competing components, including prioritizing certain components and repairs.

We'll be talking more about this exciting new infrastructure in the months to come, so stay tuned!