A Behind-The-Scenes Look At Our Petascale Video Processing Architecture For Cloud Video AI-Powered Annotation

How do we process all of the video we analyze each day through Google's Cloud Video API? What does a cloud-native architecture look like that can scale to petabytes of video and provide an end-to-end pipeline from videos to a rich array of analytic datasets, including AI-powered annotation? What does it look like to process video at scale? For those interested in building their own video analysis pipelines, below is a high-level overview of the architecture we use.

Unlike our global web crawling infrastructures, our video processing infrastructure is relatively simplistic, based around a set of scheduled cronjobs, shell commands and scripts. We're able to use such a straightforward and basic architecture because all of the complex parts of video annotation such as ingest, a distributed storage fabric, AI annotation and analysis are all basic building blocks of GCP that can be simply connected together.


All of our video analysis involves prerecorded h.264-encoded videos between 30 minutes and 6 hours in length stored as MP4 files. Those analyzing live streaming sources will need a different architecture, but Google's Cloud Video API supports both prerecorded streaming and live streaming analysis via a relatively straightforward architecture. The instructions on this page focus on GDELT's architecture built around prerecorded MP4 files.

The first step is simply ingesting the videos for analysis from the remote video archive into the cloud environment. Here we have a simple script that runs via cronjob every 30 minutes (at :00 and :30 after the hour) on a small shared 8-core VM (the VM could actually be as small as a single core, but here we run the ingest on a shared machine that does many other tasks). It calls a CLI utility provided by the remote video archive that returns a list of all available videos from the past 7 days (on rare occasions a video might take up to a week to become available) that is written as a text file on disk. The script then compares the list of available videos against the list of videos it has already processed and compiles a list of remaining videos that have yet to be ingested.

Of the videos remaining to be downloaded it estimates how many it will likely be able to download in the next 25 minutes (to give it a safety margin of 5 minutes for stragglers to finish downloading) (these estimates are just rough estimates based on how long it has taken to download past videos recently) and hands the list to GNU parallel to run X downloader scripts in parallel. The downloader scripts simply stream a video in via wget and write to gsutil using direct-to-GCS streaming with mbuffer to even out any bandwidth inconsistencies:

wget -q -O – –header '[AUTHORIZATIONTOKENSHERE]' https://someremotesite/[FILENAME] | mbuffer -q -m 50M | gsutil -q cp – gs://yourpathhere/[FILENAME]

In this way the videos are downloaded securely using an authentication token and streamed directly to GCS, with no data touching local disk. GCE VMs have such high ingest bandwidth and the downloading takes so little CPU that we actually run these downloading scripts on a shared machine responsible for many other high-CPU tasks and the downloaders have negligible impact. In practice, you could run the entire downloading process from a single-core VM.

After the video finishes downloading to GCS, we request the file's MD5 checksum using gsutil and compare it with the original checksum reported by the video archive (to catch any errors during the download):

gsutil stat gs://yourpathhere/[FILENAME]

If there is a mismatch, the file is deleted from GCS, to be tried again in 30 minutes. Otherwise the script writes to disk a log entry that the video was successfully downloaded. Note that gsutil returns the MD5 as BASE64 that will need to be converted to hex in most cases. We use this direct-to-GCS streaming architecture because it means no data has to touch local disk. This means you can use a single-core VM with a 10GB SSD disk and ingest videos at speeds effectively constrained only by the speed of the remote data archive.

To keep track of what files have been downloaded, the script writes a set of daily inventory files in the format YYYYMMDD.txt that lists the IDs of all successfully downloaded videos that were published that day. These files are rewritten to GCS every 30 minutes when the latest batch of videos are downloaded.

To recap, every 30 minutes a cronjob runs a simple script on a GCE VM that runs a video archive-provided CLI to get a list of available videos from the last few days. It compares the list of available videos against the list of videos it has successfully downloaded and downloads any new ones by streaming them directly into GCS and verifying their MD5 checksums after, then writing their IDs to a set of daily inventory files uploaded to GCS.


Once videos have been ingested into GCS it is time to process them through the Video API to generate their annotations. The Video API is entirely self-contained with its own computational resources (ie serverless), so you don't actually use any of your own CPU/GPU/TPU quota for it, you simply hand the API the GCS URL of a video file in your GCS bucket, tell it what annotations you want and it writes back to GCS the annotations when it finishes. That's it!

In our case, every 30 minutes (at :15 and :45 after the hour), another cronjob runs a script that checks the list of successfully downloaded videos up to that point (via the YYYMMDD.txt daily inventory files) and checks to see if they have been annotated through the Video API. Any videos that have not yet been annotated it submits via a simple curl call up to the maximum rate permitted under the project's quota:

curl -s -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \ --data "{
'features': [
'location_id': 'us-east1',
'videoContext': {
'labelDetectionConfig': {
'labelDetectionMode': 'SHOT_AND_FRAME_MODE',
'stationaryCamera': false,
'model': 'builtin/latest'
'speechTranscriptionConfig': {
'languageCode': 'en-US',
'enableWordConfidence': true,
'enableAutomaticPunctuation': true,
'maxAlternatives': 15,
'enableSpeakerDiarization': true,
'diarizationSpeakerCount': 6
'shotChangeDetectionConfig': {
'model': 'builtin/latest'
outputUri: 'gs://yourpathhere/[OUTPUTFILENAME].json'
}" "https://videointelligence.googleapis.com/v1/videos:annotate"

The curl command above is entirely self-contained and uses "gcloud auth application-default print-access-token" to retrieve the necessary authentication credentials during each call so it can simply be invoked on any GCE VM either manually from the shell to test things or, in our case, by simply executing via script. No extra libraries or tools have to be installed (all GCE VM's have gcloud preinstalled). Note that in a production environment it isn't recommended to use "gcloud auth application-default print-access-token" in this fashion, but in practice this approach works well and means it requires just a simple curl call to invoke the API.

In the call above we are requesting label detection (objects and activities depicted in the video), shot change detection, OCR and speech recognition (ASR) using the latest models. For label detection we tell the API that the camera is not stationary and may move around (studio cameras for newsreaders are typically largely stationary, but field reporting often makes heavy use of panning and handheld shots) and to report results as both shots and frames. For speech recognition we tell it to expect English language speech and to enable recognition confidence scores, diarization, punctuation and provide up to 15 alternatives for each speech group. For your own use case you would enable whichever features you need and configure them appropriately to your needs.

In our case each time we submit a video we randomize across the four processing regions, replacing "us-east1" above with one of the below:


In more complicated scenarios in which you need to charge the processing costs to a different project/account (ie you are running in a VM in one project that is charged to one account but need to charge the video processing costs to a different division in your organization or perhaps multiple different divisions depending on the video), you can change the first line of the curl command above to the following to execute the submission on a VM in one project with one set of billing information while using a service account and project and billing parameters from another project that bills the actual cost of the Video API to the other project (you must have previously created a service account in the other project for this purpose and granted this project access to it and you must have the necessary permissions in both projects):

curl -s -X POST \
-H "Authorization: Bearer "$(gcloud --account=[BILLINGSERVICEACCOUNT].iam.gserviceaccount.com --billing-
project=[BILLINGPROJECTNAME] --project=[BILLINGPROJECTNAME] auth print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \ --data "{

The call above completes within a second or two and returns a URL that includes the zone the job is executing on (if you don't manually specify "location_id" the API automatically distributes jobs across the nearest zones) and the Operation ID. These are necessary if you want to request additional status information from the API for debugging, but in the general case are never needed. We parse them via a simple regular expression and store in a logfile:

$zone = ''; $jobid = 0; ($zone, $jobid) = $results=~/"name": "projects\/[YOURPROJECTIDHERE]\/locations\/(.*?)\/operations\/(\d+)"/;

At this point there is nothing further to do. The Video API will read the MP4 file from GCS, perform the requested annotations and write the results back to the specified GCS filename, taking anywhere from a few minutes for a few small files to an hour or longer for larger files, especially if you are submitting multiple videos in parallel and have a smaller quota. Most importantly, the API will only write results once it has finished the entire file (it won't write partial results), so once the output file appears in GCS you know the video has completed processing.

Your Video API quotas determine how fast you can submit videos and how many can run in parallel at the same time.


At this point the Video API will run on its own using its own hardware, there is nothing further you need to do! Once the API completes, it will write its results to the specified output file. In our case, we have yet another cronjob that runs every 30 minutes (:00 and :30 after the hour) and runs a script that reads a list of currently in-progress videos from a file, and uses GNU parallel to check if the specified output file exists yet in GCS. Any videos where the output file doesn't yet exist are still being processed by the Video API, while any with the output file in GCS have completed.

In other words, we do something like this:

cat LISTOFINPROGRESSVIDEOSOUTPUTFILENAMES.TXT | parallel 'gsutil ls gs://yourpathhere/{}.json | scriptthatchecksstatus'

For each completed file, we stream it into memory (avoiding disk) using "gsutil cat" and parse it using a JSON library to analyze the results:

gsutil -q cat gs://yourpathhere/[OUTPUTFILENAME].json | analysisscript

For very simplistic analyses you could pipe the results above directly to jq or similar CLI JSON parser to extract key values.

At this point, our script removes the ASR data and writes a new copy of the Video API JSON to GCS for public download. It simultaneously collapses the annotations from frame-level resolution to 1s resolution, combines the results together into a summary file where each row is 1s of airtime and the columns contain the various annotations for all frames in that second of airtime, and writes a second summary file to GCS. This summary file is also loaded into BigQuery using the BigQuery CLI to load the file from GCS into a table.

Additional Annotations

In addition to the AI annotations from the Video API, we also process each video file through ccextractor to extract its captioning mode data and write the resulting TTXT file back to GCS:

gsutil cp gs://yourpathhere/[INPUTFILENAME].mp4 .
ccextractor -quiet -out=ttxt -bi ./INPUTFILENAME.mp4 -o ./INPUTFILENAME.ttxt
gsutil cp ./INPUTFILENAME.ttxt gs://yourpathhere/[INPUTFILENAME].ttxt

For MP4 files, ccextractor requires that the file be resident on local disk so we have to copy it out of GCS to local disk. For MPEG2 files we can simply stream them directly from GCS without the data having to touch disk:

gsutil -q cat gs://yourpathhere/[INPUTFILENAME].mpg | mbuffer -q -m 10M | ccextractor -out=ttxt -bi - -o ./INPUTFILENAME.TTXT

The resulting TTXT captioning files are extremely small ASCII files (around 100-200K per hour-long broadcast) so require little disk space, but if desired you can stream the TTXT results back to GCS directly without anything ever touching local disk you can use:

gsutil -q cat gs://yourpathhere/[INPUTFILENAME].mpg | mbuffer -q -m 10M | ccextractor -out=ttxt -bi - -stdout | gsutil -q cp - gs://yourpathhere/[INPUTFILENAME].ttxt

We write the TTXT files back to GCS for long-term storage and also use a script to parse them down to several analytic files that are written to GCS for public download and mirrored into BigQuery.


That's really all there is to it! You can see how extremely simple this workflow is, simply tying together basic cloud building blocks, with all of the underlying complexity of a global storage fabric, TPUs, AI, etc being magically abstracted away.

Most importantly, this architecture scales to petascale-sized archives with no modifications! And in fact, for the ccextractor TTXT annotations, we've already processed more than 1.5PB of videos in just the last few weeks!

In terms of AI-powered video annotation, to date we've annotated 350 million seconds of video through the Video API!

In the end, a single VM and a couple of cronjob'd scripts is all that's required to build a production 24/7 AI-powered video analysis platform that can scale to petabyte-sized archives!