Historically, GDELT ingested television news feeds through a number of workflows on an assortment of machines across GCP's US data centers. Each workflow was responsible for ingesting a specific set of feeds and codified the unique needs of the downstream processes it supported. However, this yielded an ever-growing patchwork of ingestion pipelines across an ever-growing number of machines, creating significant maintainability difficulties and presenting unique challenges when multiple workflows required the same streams on different schedules or with different configurations. As the underlying technical characteristics and delivery mechanisms of the streams have constantly changed, maintainability became an ever-greater challenge. Worse, as delivery stability issues have increasingly worsened with some feeds, with growing network, format and delivery errors, attempting to address these infrastructure-wide were becoming ever more futile.
Thus, GDELT's GEN4 architecture includes a single unified video ingest point responsible for ingest of all 33 HD television news channels we process. It receives a list new content every 10 minutes and is responsible for downloading the broadcast files, which are a mixture of MPEG2 and MPEG4 content, along with their associated metadata and structural files. It oversees the entire process, retrying all failed ingests every few hours and using realtime network and delivery information to adjust its interaction with the underlying source servers to maximize throughput while minimizing both transient and permanent errors.
Videos are verified upon receipt before being released for processing, centralizing the ever-growing verification workflows, which range from channel-specific tooling such as A/V sync mismatch detection to generalized file corruption issues.
A single 1-core VM is able to perform all of this through the power of cloud networking. A series of cronjobs and daemon processes manage all of the workflows, with videos being streamed directly to GCS via CURL using a workflow of CURL->MBUFFER->GSUTIL STREAMING. This means the VM needs only a 10GB local OS disk, with videos being streamed from the remote source servers via CURL directly to GCS. Files are written to a temporary path in the production GCS bucket, then verified and transferred to their final path and announced to all downstream processes.
Despite ingesting 33 HD video channels around the clock, you can see from the CPU graph at the top of this post that the machine typically achieves only 30-50% CPU usage, peaking every few hours briefly at 100% CPU, during which time the onboard GDELT hypervisor prioritizes networking tasks over verification and other processes to ensure uninterrupted ingest and prevent buffer overflow.
The end result is that cloud networking is so powerful today that a single 1-core VM can ingest 33 HD video channels around the clock and barely exceed a third of its CPU capacity in a typical day!