Behind The Scenes: Lessons Learned From Ingesting Large Datasets Into The Cloud

Kalev Leetaru

3 years ago

Given GDELT's immense global scale and footprint that extends across nearly every GCP data center worldwide, we are frequently asked about the lessons we've learned building truly "cloud scale" infrastructure that leverages the unique capabilities of the modern cloud. One of the most common questions that begins these conversations revolves around how to get large datasets from local infrastructure into the cloud in the first place. We've touched on this a bit in previous posts, including ingest versus transformative ingest, streaming transfers and ramdisk transcoding configurations. Often the problem most organizations moving to the cloud face is far simpler: how to simply ship terabytes to petabytes from their local systems into the cloud environment. Here are a few of our most common recommendations.

Transfer Appliance. For the largest datasets and the most time-critical transfer needs where local networking is insufficient, GCP's physical Transfer Appliance is often the best solution. This is a physical appliance shipped in a flight case that the user connects to their local network. All data is encrypted as it is written to the device and then the fully encrypted device is returned to a physical GCP ingest location that loads it into GCS where the customer then decrypts the data in place. Multiple appliances can be used to transfer many petabytes at once. This is often the best approach for organizations with large amounts of data but limited external connectivity.
Geographic Placement. For nearly all other use cases, the fastest approach is simply to ship the data via the network to the cloud. The single most important consideration is geographic and network distance between the data center containing the data and the GCP region used to ingest it. Choose the GCP data center physically closest to the data source from GCP's global geographic footprint. In the case where there are multiple nearby data centers, work with your networking team to determine which data center has the shortest or highest quality networking path. In the case of a satellite data center that is part of a larger organization, networking and DLP policies may require all data transit the corporate network to a regional or national hub prior to exit which might change these calculations, since the data would be exiting from a different physical location, though this may be adjustable with the networking team. For most use cases, however, there will typically a several-fold to several order-of-magnitude improvement in transfer speed simply by placing the ingest nodes in the GCP region closest to the data source.
Architecturally Decouple Ingest From Processing. Always remember that once data is ingested into the cloud, it can be made available anywhere at any scale. Thus, ingest should be architecturally decoupled from processing to absolutely maximize ingest rates. Imagine the source data center is located in London, but the data will be processed using an A100 accelerator. The nearest A100-equipped region to London is the Netherlands. Thus, it might initially be tempting to run the ingest VMs in the Netherlands, since that is where the data will be processed, but that will actually yield far slower ingest rates than ingesting the data in europe-west2 (London) into GCS and then processing in europe-west4 (Eemshaven), which allows the data to enter into GCP in London and transit GCP networking to Eemshaven for processing. Most importantly, it means that if the downstream processing becomes saturated (neural workloads can often have unpredictable runtimes), ingest will proceed without impact, with processing simply working to catch up. Decoupling ingest and processing also makes it possible to scale them independently, increasing ingest resources as needed or scaling processing resources, such as leveraging multiple accelerator regions to run workloads optimized for each different hardware platform. For GDELT itself, it means we are able to ingest streams from our own globally distributed crawler fleets and partners all across the world via the data center nearest to them and process it globally, taking advantage of specific hardware platforms and APIs available in each region worldwide, completely decoupling ingest from processing.
Coordinate With Local Network & Security Teams. An organization's networking and security teams should be involved at the earliest stage to coordinate the transfer, since large-scale transfers can have unintended impacts even for organizations with robust networking by saturating links or traffic monitors in unexpected ways, triggering DLP and other security systems and causing congestion that impacts mission critical systems. Security personnel can craft specific policies to govern the transfer to exempt it from traffic monitoring (to prevent it from overloading DLP appliances for example) and prevent streams from being interrupted or systems being taken offline by automated processes. Networking personnel can shape the traffic by setting appropriate paths and priorities to ensure it does not saturate links or impede critical systems. Egress bandwidth costs typically vary depending on the path, with a tradeoff between quality/capability and cost, so networking teams can also advise on which transfers should occur over the highest cost and highest speed links versus alternative and lower-capability links. In some cases, organizations pay per-GB for DLP or other security monitoring, making it critical that large transfers be exempted to avoid unexpected cost overruns. For very large transfers, egress bandwidth costs can impose considerable costs or may exceed maximum ISP-imposed quotas and may require spacing out a transfer or reconsidering the use of a physical transfer appliance. Networking and security teams should always be involved to monitor large-scale transfers in realtime, with the transfers designed to be throttleable or abandonable in realtime as needed.
Leverage Cloud Networking. Once data is in the cloud environment, it is trivial to move around or process in place at any scale. Thus, the best architecture is to spin up a cluster of ingest VMs in the GCP data center closest to the source data center and simply ingest the data directly into GCS, streaming it live from the remote server into its final GCS bucket. In this way, the ingest VMs are simply lightweight VMs that do nothing other than act as proxy relays.
Always Use Streaming Transfers. One of the biggest mistakes organizations make in transferring large datasets to the cloud is to attempt to download them to local VM disk before shipping to GCS. While capable, persistent SSD disk will quickly become the biggest bottleneck. Remember that the networking bandwidth required to copy data to persistent disk comes out of the VM's overall quota, presenting a tradeoff. Local NVMe SSD arrays don't present the same bandwidth tradeoff, but are still limited to just around 4.6GB/s sustained write. The best approach is always to simply stream the data directly to GCS via CURL->MBUFFER->GSUTIL, presenting the appropriate authentication credentials via CURL (potentially generated externally prior to the request) and using mbuffer to smooth network issues. This can be scripted as a single shell call. Streaming GCS writes are so efficient that we are able to ingest 33 HD channels in realtime via a single shared-core N1 VM with less than 30-50% typical CPU utilization. In essence, the ingest VMs are simply dumb relays, requesting a list of resources to ingest and streaming them to GCS.
Transcoding Transfers. In the majority of cases, the best approach is to ingest data as-is into GCS and then process it later. This ensures that the variability of compute-intensive or accelerator-dependent tasks do not impede ingest speed. In some cases it is possible to perform streaming transcoding if the transcode process requires sufficiently low CPU such that it can be interleaved among the networking CPU demand. The one exception to this rule is ingesting vast numbers of small files, which do not stream to GCS efficiently. In this case, it is better to repack them into larger containers for GCS storage. Rather than using persistent or even Local SSD disk, the best solution in these cases is to use ramdisk. In this case, ingest VMs can have up to 12TB of RAM (m2-ultramem-416), allowing large datasets to be sharded amongst the repacking VMs. In this workflow, data is streamed to the ramdisk (with a throttling process to prevent overruns) and the repacker streams the packed container files in realtime to GCS.
Small Versus Large VMs. We typically find that large numbers of single-core VMs achieve higher aggregate transfer rates than a smaller number of larger systems other than in the local transcoding or repacking model described above. Thus, it is better to spin up a cluster of single-core N1 VMs in most cases. One exception is that for certain formats, quad-core C2 compute-optimized VMs can achieve a 30% ingest improvement over four independent N1 cores, but this is heavily dependent on the data type and file length, so always benchmark. If you are transferring larger numbers of smaller files or computing checksums, C2 VMs are more efficient in managing large numbers of parallel GSUTIL invocations. We have found an improvement from using gVNIC over virtIO networking. Always make sure "premium" networking is enabled for ingest VMs.
Leave Kernel Settings Alone. Despite extensive testing over the years, we have not found that tuning VM kernel networking settings has any appreciable effect on most transfer workloads and can easily make them far worse.
Check PIPESTATUS In Shell Scripts. When chaining shell commands together into pipelines, it is imperative to remember that exit codes don't cascade. This means that if the remote server returns an error, CURL's non-zero exit code won't propagate and the parent script will see only the successful zero exit code from GSUTIL making it think the transfer succeeded. Thus, make sure to use bash (not the default SH mapping that is typically the default for cronjobs) and check PIPESTATUS for the per-command exit codes.
Use Hard Timeouts & Safety Features Even For Local Transfers. Always use timeouts, even when transferring data between two VMs or to/from a VM and GCS. Network issues and kernel panics can happen, leaving transfers hanging indefinitely. Make sure to use all available timeout features. Some remote servers can send just enough data to keep their connection open or can simply vanish. CURL supports native timeouts, but when chaining CURL inside a larger pipeline it is typically best to wrap in an outer "timeout" shell wrapper to hard-terminate the entire pipeline. Make use of all available safety features, such as maximum transfer sizes, since GCS imposes a limit of 5TB per object and even if no file can be above that size, a malfunctioning server might improperly send a larger stream.
Verify All Transfers. To avoid bucket transfers and failed transfers, stream each incoming file to a temporary path in the destination bucket. Once the transfer completes, request its checksums from GCS (via GSUTIL STAT) and compare with their expected values. If checksums match, simply "gsutil mv" the file to its final path. This ensures that even if a VM kernel panics and thus fails to perform any final verification steps, no partially written files are exposed to downstream processes. Some data sources may not provide checksums or may provide invalid checksums, so it is important to test the verification pipeline on a random selection of content and log and surface any verification errors in realtime to avoid losing a large portion of a transfer due to invalid checksums provided by the source server.

To summarize, use the physical transfer appliance for the largest datasets where local networking is insufficient to support the transfer. Otherwise, spin up a set of dedicated ingest VMs in the GCP region closest to the source data center and stream ingest all files, using CURL->MBUFFER->GSUTIL. Typically a cluster of single-core N1 VMs are more than enough, but some workloads benefit from quad-core C2 VMs (benchmark to determine which works best for a given workload). No data should touch local disk, with data streaming directly from the remote server to GCS via streaming writes. Check PIPESTATUS when using shell scripts for transfers and use redundant hard timeouts and safety features. Write all streams to a temporary path in the destination bucket, verify their checksums, then move to their final destination path.