Behind The Scenes: The Strange Case Of An Unhappy Service Account & The Unique Complexities Of AI Infrastructure

Modern cloud-based computing infrastructure has grown so reliable over the years that even larger-scale workflows rarely encounter error rates sufficiently large or problematic that they require fundamentally redesigning core customer infrastructure. In general, the cloud "just works." The AI revolution's immense and often exotic hardware environments and massive and brittle models and supporting systems therefore represent a uniquely new and challenging environment for infrastructure developers. AI systems can fail in the most unusual and difficult-to-diagnose ways, with unexpected bursts of impossible error states that can require new approaches to error recovery and health management.

Take, for example, the following error that we started seeing one recent afternoon from the STT API. On the surface, this is a relatively easy-to-understand error: the service agent used by the API to write to the customer's GCS bucket lacked the correct IAM permissions. The problem is that this service account actually possessed the necessary permissions and, critically, handled hundreds of thousands of requests this day and only selected random files failed with this error. Worse, the API actually generated the correct results and successfully wrote them to GCS, only to delete one or more of the files moments later. Downstream processing pipelines received notification of a broadcast ready to process and a list of available files, only to find that output files were vanishing as it prepared to read them. Given that this specific error state is technically impossible according to documentation, the downstream systems began failing without an obvious recovery strategy other than to retry the API submission from scratch – sometimes which would work and sometimes which would yield a second error similar or even different.

Such is the brave new world of AI-centric processing. In this case, the original logic presumed that once an API signaled successful output and a GCS directory listing showed that the output files had been written to GCS, that all output files listed by LS were available for reading and any read errors were simply transient GCS errors that would self-resolve with a timeout-retry strategy. After discovering this new behavior that violates the constraints in the documentation, that logic was altered to accept that API output files could be deleted out from under a process reading them and that under such circumstances the entire broadcast should simply be retried after a period of time. While a relatively simple fix, the idea that an AI API could delete output files at random after successful execution requires a counter-intuitive understanding of possible system end states and represents just one of the myriad oddities that are becoming more and more common as the analytics world increasingly migrates to AI-first infrastructures.

...
    "progressPercent": 100
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v2.BatchRecognizeResponse",
    "results": {
      "...": {
        "uri": "....json",
        "error": {
          "code": 7,
          "message": "Service account `XYZ@gcp-sa-speech.iam.gserviceaccount.com` does not have write permissions to bucket `...`. Source error: ... does not have storage.objects.delete access to the Google Cloud Storage object. XYZ@gcp-sa-speech.iam.gserviceaccount.com does not have storage.objects.delete access to xxx.srt."
        },
        "cloudStorageResult": {
          "uri": "xxx.json",
          "srtFormatUri": "xxx.srt"
        }
      }
    },
    "totalBilledDuration": "0s"
...