
Our longstanding ASR workflow was based on a submit-once model in which each broadcast needing transcription was submitted a single time to the ASR API, with the API taking on the job of automated retries and guaranteed completion. The transition of modern SOTA ASR to LSMs has changed these assumptions, with users now responsible for retrying failed submissions. Our new ASR infrastructure handles this, automatically repeatedly retrying a broadcast through various API parameters and configurations as needed. One challenge that emerged as we transitioned to this new approach was that some broadcasts have missing audio streams, meaning our self-healing infrastructure entered what amounted to an infinitely repeating loop of trying an exhausting array of configurations to ASR a file that was simply not ASR'able since it lacked any audio of any kind. Unfortunately, the metadata for these broadcasts often states that they do have audio streams and frequently even specifies bitrates, sampling rates and other metadata for an audio stream that simply doesn't exist. To address this, any broadcast that encounters a failed audio stream surrogate construction is now scanned using ffprobe for the existence of an audio channel using the code seen below. Performing this check only for failed transcodings allows us to absolutely minimize transfer bandwidth. While a trivial example, this edge case offers yet another reminder of the immense complexities of trying to understand and transcribe vast archive-scale collections spanning the globe over decades.
timeout -s 9 130m gcloud -q --no-user-output-enabled storage cat gs://bucket/video.mp[4g] | ffprobe -v error -i - -select_streams a -show_entries stream=index > streaminfo.txt