The GDELT Project

A Glimpse Behind The Scenes At How We Perform Mass Ingest And Transformation Workflows In the Cloud

How do we orchestrate our mass ingest and transformation workflows when performing some of our largest analyses? Once data is inside Google's Cloud Platform in the right format, the combination of tools like Google Cloud Storage, Compute Engine and BigQuery make even the largest data processing workflows fairly straightforward and almost infinitely parallelizable. The real bottleneck typically lies either in ingesting an external dataset into GCP and/or in transforming a dataset from one format into another. In the past we've talked about how we use streaming relay transfers for mass ingest into GCS and large Local SSD clusters for high-IO transformations. Here's a list of the primary architectures we've found most effective in the cloud for ingest and transformation workloads:

We hope these workflows are helpful and offer a glimpse of how to perform optimized large-scale ingest and transformation workflows in GCP!