Agentic AI At Scale: Observations & Challenges From Agentic AI For SQL Tuning & Optimizing Massive Databases

As AI and especially agentic AI capabilities have continued to improve, we've been increasingly incorporating autonomous code and infrastructure generation, optimization, observation and recommendation agents into our development workflows. It was agentic Gemini that developed our original schema design for our migration to Spanner, built the infrastructure daemon proxies for our VMs, optimized our GCE CPU platforms and tuned our kernels. However, as we've been working over the past two weeks to optimize our new Spanner deployment, we've run up against two critical limits of current agentic AI: the conflicting, sparsely documented and undocumented "art" of tuning massive-scale databases and the difficulty and cost of agentically or manually A/B testing and tuning massive distributed infrastructure like large-scale databases. To be fair, Gemini was excellent at taking natural language query descriptions and turning them into functional correct SQL that yielded the correct results. The problem lay in optimizing those queries and the underlying tables when working with large scale databases to improve performance and how current agentic developer workflows rely more on brute force and large-scale iteration rather than the small number of highly targeted and intelligent interventions required when each idea can take hours of costly resource offlining to test. In the end it was a set of recommendations from the Spanner team that massively improved the performance of our queries, rather than autonomous recommendations from Gemini, pointing to current limits in agentic SQL optimization. We also tested several competing models and found the same results, indicating this is a broader challenge of current agentic workflows, rather than specific to any particular model.

Tuning large databases to achieve maximum performance has long been one of the most sought-after skills of experienced database developers. When creating the schemas for tables and their indexes and the queries that use them, there are myriad factors that must be considered and tradeoffs, from the expected query load to balancing latency vs throughput to the specific strengths and design characteristics of a given database platform.

In theory, designing an optimal database schema is where agentic AI should particularly shine. After all, it can pour through every table, index, query and SQL fragment ever published on the entire internet, pour through every tiny bit of documentation minutia and look across millions of real-world developer experiences and benchmarks on the web cross-referenced with thousands of pages of official documentation to find the optimal design. In practice, however, we found several key limitations to this model.

The first is that while SQL is a universal database language, databases implement different features in different ways and have different strengths. Gemini (we tested both 3.1 Pro and 3.5 Flash, both with High Thinking and both with Search Grounding enabled) kept recommending features and design patterns from other databases without realizing that they were not applicable to Spanner and missing key Spanner-specific capabilities, features, hints and design patterns, despite the prompts all requesting solutions for Spanner specifically. While it often correctly adapted SQL syntax to Spanner's needs (though in at least 30% of runs it yielded SQL that failed to execute with Spanner), it failed to correctly reason architecturally about the mapping of those design patterns onto Spanner. For example, both Gemini models repeatedly falsely asserted that the entirety of all stored fields in a covering index were stored and replicated fully inside the fulltext search inverted index token listing such that every single word appearance resulted in the full duplication of the entire covering index. This resulted in the agentic workflows initially consistently attempting to minimize the use of covering indexes or recommending the use of many small indexes that would result in low cache hit rates. Strangely, while Gemini correctly saw that columnar engine options are not available for fulltext indexes, it consistently recommended schemas that used them even while stating in its narrative that the SQL would not work. These were addressable through manual intervention and prompt modification, but it is surprising that the autonomous workflows were not able to catch this on their own, especially with Search Grounding enabled.

Given that models, even with grounding enabled, are heavily influenced by the biases of their training data, given that SQL and databases are well represented on the web and given that Spanner is a more specialized database designed for massive scale deployments rather than running a 1MB database on a tiny VM and thus less represented in online code samples compared to databases like MySQL and Postgres, we found that Gemini consistently adopted traditional database design practices that do not map well to Spanner. More interestingly, Gemini consistently lapsed into outdated and suboptimal practices, such as recommending that every single field have its own separate index to minimize memory use, despite prompting telling it that queries would range across all of the fields and thus this would result in far more cache thrashing than a single shared index.

Strangely, when presented repeatedly with the table and index schemas and the SQL for a given query and asked to optimize it to make it run faster, Gemini would consistently recommend very poor schema changes that actually vastly decreased performance. These changes ranged from the intuitive but wrong designs beginning database students make (break every single field into its own separate index and make an index for every single crosswise permutation of any set of fields) to the just plain bizarre, with many of Gemini's solutions failing to execute and requiring numerous iterations to yield correct SQL. Gemini primarily focused on schema changes to the table and its indexes like a beginning CS student, rather than starting with query-level changes and hints like a professional database developer would, especially given the size of the data and the long latency in rebuilding tables of hundreds of GB's for A/B testing.

Interestingly, the silver bullet that changed everything was adding "OPTIONS (sort_order_sharding = true)" to the search index definition. Out of two weeks of iterations with Gemini, this option never came up, with Gemini consistently asserting that the presence of an "order by" in the index definition also meant that the index shards themselves were date sorted or alternatively asserting convincingly that this would destroy parallelism and undermine Spanner's parallelism. Only after manually poking through the Spanner documentation ourselves did we come across this option and begin pushing Gemini on the specifics of index sorting, which caused it to finally change its recommendation and suggest using this option in one of its iterations. Indeed, the Spanner team also readily recommended this option.

The addition of this single schema change exponentially improved query performance and improved retrieval queries to ~10-20ms on our large text-intensive table. Timeline aggregations were also considerably improved, but still not fast enough. Days of iterating with Gemini 3.1 and 3.5, getting its suggestions, running them, returning the query explanations, letting Gemini think on things and recommend new ideas and testing those, yielded no improvements. Performance was fundamentally stuck. Once again the Spanner team came to the rescue, recommending a single change: changing "GROUP BY" to "GROUP @{GROUP_METHOD=HASH_GROUP} BY". This yielded another step change improvement in performance, dropping all but the most expensive queries down to interactive level latencies. Yet, despite pressing Gemini in more than 20 iterations to recommend query hints to accelerate the query to see if it would recommend this improvement, it mentioned this option just once, combined with a flurry of other hints that, despite Gemini taking numerous refinement turns, never resulted in a parseable SQL statement.

Amazingly, Gemini never once, in two weeks of agentic optimization, recommended the ANALYZE command, despite being told the data had just been loaded. Another recommendation from the Spanner team, this eliminated the need for the GROUP_METHOD hint, with Spanner now having the necessary statistics to figure that out for itself.

Thus, it was three human recommendations that led to the major performance breakthroughs here, rather than Gemini, which seemed caught in a mobius loop of randomly trying everything imaginable rather than attempting to actually understand what the SQL was doing and target optimizations for the parts taking the longest. Gemini could read a Spanner query plan and explain what happened and recommend myriad possible improvements, but seemed incapable of making the leap to understand what schema and query changes would have the greatest impact on targeting the high-latency stages. Instead, Gemini seemed to just throw everything imaginable against the wall to see what would stick.

Perhaps the major issue here is that current agentic workflows depend on the idea that models can rapid-fire test large numbers of potential solutions, the majority of which are low-quality, and just brute-force their way to a solution that works, rather than intelligently refining a set of ideas based on extensive research and understanding. Asked to write a Python script to batch load a CSV file into Spanner, Gemini took multiple tries to keep iteratively debugging its code, but was eventually able to write a serviceable script simply by writing, testing, reading the errors and trying again, in a largely brute force manner.

The problem with large-scale database optimization is that the cost of A/B testing is so high. Testing a Python script takes milliseconds. Reloading or reindexing a multi-TB table to test an index tweak can take hours and requires the cluster to be taken offline to precisely isolate the benchmark from other system activity. It is entirely tractable for a model to just brute force its way through writing and changing random sequences of local Python code until something works, since the only real "cost" was tokens that were historically quite cheap (though this is changing). On the other hand, when testing each database change requires hours of dedicated server time and keeping expensive specialized servers offline to isolate the query load, this requires a very different approach to development that focuses on testing only the most promising approaches, rather than iteratively testing every possible permutation of table, index and query design and brute forcing hundreds or even thousands of different design possibilities, of which just a few are actually viable.

In the end, it was human experts that provided the breakthroughs for us, but the reasons that the agentic development workflows struggled offering a critical telling glimpse into the strengths and weaknesses of current AI-powered agentic code development.