
A key part of our ASR pipeline involves determining which of the inflight broadcasts has completed processing through GCP Chirp. Since we use dynamic batching, we submit videos to the API as they arrive and scan GCS for those broadcasts to complete. Given their different lengths, languages and complexities, they typically return well outside of submission order and due to technical nuances of the upstream ingest system, broadcasts from multiple days ago can suddenly become available for ASR, further complicating things by making it difficult to use prefix systems. Instead, we scan a GCS path where all files are written by the API. Originally we scanned via a wildcard like "gcloud storage ls gs://bucket/path/*", but with a path containing 312 files in 104 directories, this took 12 seconds and with tens of thousands of files could take minutes to complete, making it difficult to deploy in production. In reality, we only need the list of directory names, since each represents one broadcast, leaving the actual list of resulting files up to the ASR downstream processing engine to sort out. By simply dropping the wildcard and replacing "gcloud storage ls gs://bucket/path/*" with "gcloud storage ls gs://bucket/path/", the time drops from 12 seconds to just 1.4 seconds! Of course, this makes sense, given that GCS' paths are merely prefix abstractions layered on top of an object store, but this offers a reminder that even very small changes can have outsized impacts on performance and stability.