The GDELT Project

GEN4: GCE, Jupiter & Andromeda

GDELT has been running on GCP's Compute Engine platform since GCE's earliest prerelease days. Over the years, GDELT's internal architecture has evolved in step with GCE's spectacular growth in capability and performance.

When GDELT first launched with its GEN1 architecture, it ran in GCE's then-sole original region in the US, meaning its crawler and processor fleets were colocated together with all of the underlying GCP systems they used in a single data center.

As GCE's global footprint rapidly expanded, GDELT GEN2 introduced GDELT's global fleet architecture. Instead of a single fleet running in a single physical datacenter, GDELT now ran a globally distributed network of crawler fleets running all across the world and leveraged their physical geography to overcome the increasingly fragmented and geotargeted modern web. This extensively leveraged GCP's global networking architecture and the ability of any crawler anywhere in the world to access the totality of GCP's platform without regard to physical proximity. GCS' extreme performance and flexibility allowed GDELT to adopt it as a global storage fabric, with every node all across the world able to share streaming data of unlimited size with every other node through a common namespace.

GDELT GEN3 advanced this model even further, introducing a hyperscale architecture to GDELT's fleet architecture, connecting its formerly independent fleets into a larger many-to-many fabric capable of global-scale coordination and dynamic self-learning content-aware routing. It also began the process of leveraging GCP's hardware geography, moving key compute systems to different regions based on the unique CPU and accelerator platforms available in each. Leveraging GCP's immense networking performance, this meant that a compute task that required an A100 accelerator that was only available in one region and another task that required a preview environment only available in a different region could each be placed in their respective geographic regions and communicate with each other as if they were operating together in the same data center, complete with a shared storage namespace courtesy of GCS, with slightly increased latency as the only sign of their physical distance. Importantly, this removal of geographical constraints meant GDELT was now able to distribute compute tasks globally, including at the GCP platform level.

In the last few years since GEN3 launched, GCP's Jupiter and Andromeda networking have made enormous performance gains, with 100 Gbps connectivity now generally available. This has been coupled with enormous advances in GCP's VPC networking that have made some really incredible new architectures possible. GDELT GEN4 leverages all of these new capabilities through a set of radical new global architectures that have opened the door to some pretty incredible new capabilities. Over the coming months we'll be talking at length on this blog about the new GEN4 architecture, so stay tuned!