Always remember that once data is ingested into the cloud, it can be made available anywhere at any scale. Thus, ingest should be almost always be architecturally decoupled from processing when constructing cloud-scale infrastructure (absent cost and regulatory considerations). For GDELT itself, we ingest streams from our own globally distributed crawler fleets and partners all across the world via the data center nearest to them to maximize ingest rates and process it globally, taking advantage of specific hardware platforms and APIs available in each region worldwide, completely decoupling ingest from processing.
This has a number of benefits:
- Maximal Ingest Speed. Imagine a large dataset that must be ingested into the cloud and processed from a data center in London, where the data will be processed using an A100 accelerator. The nearest A100-equipped region to London is the Netherlands. Thus, it might initially be tempting to run the ingest VMs in the Netherlands, since that is where the data will be processed. However, this actually yield far slower ingest rates than ingesting the data in europe-west2 (London) into GCS and then processing in europe-west4 (Eemshaven), which allows the data to enter into GCP in London and transit GCP networking to Eemshaven for processing.
- Redundancy & Scaling. Decoupled architectures mean ingest and processing workloads can automatically scale independently of one another. If there is a burst of incoming data, ingest capacity can scale automatically (such as to absorb a sudden burst from a social media firehose), while allowing processing to scale separately within its typically more limiting constraints or switch to alternative behaviors like sampling. For example, imagine a social media stream processing workload built around neural models running on A100 accelerators. Under ordinary load, there might be a one-to-one mapping of ingest to processing, where each record is processed on the A100. However, during sudden massive bursts of new posts (such as global reaction to a breaking news event), the customer might not want to scale their A100 usage to maintain this one-to-one mapping due to cost considerations. In this case, their workflow might automatically switch to random sampling to maintain constant A100 load. However, they still want to archive 100% of the incoming posts for future historical processing. Thus, decoupling the two allows ingest to scale to ensure all data is archived, even as processing switches to sampling.
- Different Processing Needs. Decoupling ingest and processing allows workloads to make use of the unique resources available in each GCP region. For example, a downstream process might generate output files consumed by users primarily in Australia from data hosted in New York that are many times the size of the input data, suggesting an optimal design might involve ingest VMs in a us-east region and processing VMs in the Australian GCP region. Alternatively, different processes might make use of different accelerators and APIs available in different regions.