The Hidden Dangers Of Generative Coding (CodeGen) Guardrails

One of the most intriguing findings from our generative code modernization (codegen) experiment earlier today was the degree to which the values-based guardrails of textual LLMs influence codegen models in ways that can render them unusable for enterprise applications. Just as textual LLMs refuse to address certain topics or modify their outputs to reflect specific values, so too do codegen tools encode similar values-based guardrails. Just like their textual counterparts, these codegen guardrails are designed for consumer-facing applications without consideration of the needs of enterprise applications.

For example, in our code modernization experiment, we tested taking a BigQuery Legacy SQL query from eight years ago and using Duet AI and Gemini Pro to modernize it to contemporary Standard SQL. Gemini repeatedly invoked its guardrails to refuse to modernize the code in several of our attempts, in one case falsely claiming that it could not generate code in a reflection of the strange ways guardrails can manifest themselves that makes debugging difficult. Critically, Gemini also rewrote the code to remove core functionality based on its wrongful application of its guardrails, ranging from the topic (monitoring global wildlife crime) to methodology (the fact that the code output a list of public news outlet URLs which Gemini's guardrails viewed as a form of PII).

It is critical that codegen systems not be misused to automatically generate phishing or cyber-adjacent code, but at the same time it is equally important that a bank modernizing its fraud detection code not have the model misclassify that code as a request to create rather than detect financial crimes, in much the same way that current textual LLMs are frequently unable to distinguish between a request to summarize an accredited news agency's coverage of a conflict and a malicious user's request to fabricate falsehoods about that conflict from whole cloth.

The fact that these guardrail false positive issues plague codegen systems is especially insidious in that they cause the generated code to be subtly rewritten to remove or modify core functionality in a way that a more junior developer might not even catch. Worse, as some companies increasingly deploy codegen systems in fully automated or minimally reviewed workflows against all best practice recommendations, these changes may pass unnoticed into production workflows.

Further research is required to understand how guardrails can be safely applied to codegen systems that prevent malicious misuse but allow benign enterprise applications to proceed.