The GDELT Project

Behind The Scenes: Lessons Learned From Ingesting Large Datasets Into The Cloud

Given GDELT's immense global scale and footprint that extends across nearly every GCP data center worldwide, we are frequently asked about the lessons we've learned building truly "cloud scale" infrastructure that leverages the unique capabilities of the modern cloud. One of the most common questions that begins these conversations revolves around how to get large datasets from local infrastructure into the cloud in the first place. We've touched on this a bit in previous posts, including ingest versus transformative ingest, streaming transfers and ramdisk transcoding configurations. Often the problem most organizations moving to the cloud face is far simpler: how to simply ship terabytes to petabytes from their local systems into the cloud environment. Here are a few of our most common recommendations.

To summarize, use the physical transfer appliance for the largest datasets where local networking is insufficient to support the transfer. Otherwise, spin up a set of dedicated ingest VMs in the GCP region closest to the source data center and stream ingest all files, using CURL->MBUFFER->GSUTIL. Typically a cluster of single-core N1 VMs are more than enough, but some workloads benefit from quad-core C2 VMs (benchmark to determine which works best for a given workload). No data should touch local disk, with data streaming directly from the remote server to GCS via streaming writes. Check PIPESTATUS when using shell scripts for transfers and use redundant hard timeouts and safety features. Write all streams to a temporary path in the destination bucket, verify their checksums, then move to their final destination path.