Behind The Scenes: Using The GCS Byte Range Selector To Reduce 500TB Of Network Bandwidth To 700GB

We have an inventory of around 500TB of TAR files containing 1fps frame samples of all 7 million television news broadcasts that we use for OCR, multimodal embeddings, object and activity detection and other visual analysis. We need to directly reinventory this entire collection to validate our metadata about each file, including counting the number of files in each TAR file and computing the pixel dimensions of one image from each TAR file (since all frames are the same size), as well as performing several resizing operations and computing the pixel dimensions they yield for that broadcast.

Traditionally this would require downloading the entire TAR file locally to a VM and unpacking into a RAMDISK, necessitating more than 500TB of network transfers. Instead, using GCS' built-in byte range selector, it is possible to request only a specific byte range from GCS for a given object. Below is a sample gcloud command (in reality we implement directly using the JSON API) that requests only the first 100K of each TAR file and requests that tar unpack as many complete images as it can from the stream (skipping the last partial image that is truncated at the 100K mark) and save them to a RAMDISK-based temp directory. Remarkably, since these TAR files store their images within in reverse order, the largest image number appears at the front of the file, meaning we can simply unpack the first few images from each TAR, find the filename with the highest frameID and use that as a count of the frame count and perform the various calculations we need upon that single image. This means that instead of having to download 500TB of data from GCS, less than 700GB is downloaded!

timeout -s 9 15m gcloud storage cat -r 0-100000 gs://[BUCKET]/[BROADCAST].frames.tar | tar --ignore-zeros -xf - -C /dev/shm/TMPDIR/