The GDELT Project

Leveraging Bigtable For Highly Scalable Digital Twin Architectures

As we continue to load our entire historical GCS archive into our Bigtable digital twin, BigTable's remarkable scalability has allowed us to achieve considerable parallelism in our ingest workflows. For example, while populating our Visual Explorer display metadata into our digital twin, we used more than 500 parallel processes running on a small cluster of VMs totaling 128 C2 and C2D VMs. Each process downloads a set of files from GCS and performs a set of analysis tasks to generate a final digital twin record for the broadcast, before writing it to Bigtable. Given their balance of downloading, processing and Bigtable tasks, these 500 processes yield around 300 updates/sec to our Bigtable cluster, not counting the load from ongoing realtime processing of new broadcasts and various background analytic tasks. Yet, despite this load, our Bigtable cluster's CPU maintains a stable 5% load with no measurable error rate. As a fully managed service, there are no servers or processes to monitor or maintain and new clusters can be seamlessly added over time as needed. The ability of our Bigtable-based digital twin to absorb a steady 300+ updates/sec in addition to realtime updates and background analytic tasks, all without breaking 5% CPU load demonstrates the incredible ability of modern cloud infrastructures to massively accelerate application development.