The Challenges Of LLM-Powered Code Generators: Early Observations From The Trenches

A growing body of research is beginning to call into question the quality of LLM-generated code, even as the tech industry rushes forward with an ever-growing array of tools designed to augment and even replace human coders. As we continue to explore and evaluate a growing number of these tools, here are some of our early observations.

Across the board, we've found the utility of code generators to be limited to basic tasks in a small number of languages, especially Python and HTML + JavaScript. Surprisingly,  despite the ubiquity of basic shell scripting in nearly every real world workflow, code generators are often extremely poor at anything beyond trivial tasks. Most shockingly, however, the coding assistants offered by major tech companies frequently border on useless when it comes to those companies' own software packages, unable to output code capable of even the most basic tasks. Asked to generate a trivial demo of a major library developed by its parent company, one code generator continually output random permutations of parameters and pipelines that had nothing to do with the library, demonstrating the critical importance of RLHF and other means of training and tuning for these models. In contrast to the punditry around their use, they can't just be given a user manual for a given library and then asked to generate various functions for it and yield successful code a majority of the time.

Ask a code generator to produce a Python script that can read a CSV file with a certain set of headers into a Pandas dataframe and any of the major tools can generate perfectly passable code. Ask it to do the equivalent in a less common language and the results will typically border on gibberish. Common Python-adjacent software typically has strong support. But, even tier one software that is limited to specific fields tends to have poor support. For example, while ImageMagick and ffmpeg are defacto gold standards for image and video manipulation, support is spotty.

For example, one major code generator, when asked "Change the temp directory for ImageMagick's 'convert' utility to /dev/shm" yielded the following results when run four different times. While even a basic web search returns the correct response, the utility instead just tried random combinations of seemingly-reasonable but entirely wrong parameters. While in some cases these wrong answers can still serve as starting points, in this case they hold no value in locating the correct answer:

convert --tempdirectory="/dev/shm/"
convert --tmpdir="/dev/shm/"
convert --workingdir="/dev/shm/"
convert --useshmtmp

What about the noise prefix ""`json" that often precedes JSON output from some LLMs? Let's ask one major generator "Remove the '"`json' prefix from a JSON string":

jq -r . <string>

Unfortunately, that isn't a terribly useful answer. It is also surprising given JSON's centrality to modern data interchange and the commonality of removing prefixes and suffixes in shell scripts.

What about another common JSON-related task? For our GCP Imagen image captioning demo, we needed to encode a Base64 version of the image with each request. Using the prompt "Using a shell command, Base64 encode the image './RUSSIA1_20230215_083000_60_minut-000014.jpg' and set the field 'instances[0].image.bytesBase64Encoded' in the JSON file 'input.json' to the encoded image." Run repeatedly, this yielded the following results. Immediately clear is that the LLM was essentially randomly trying different permutations of command line options rather than proceeding according to an actual understanding of jq's documentation:

jq -n -e '.instances[0].image.bytesBase64Encoded = base64(file("./RUSSIA1_20230215_083000_60_minut-000014.jpg"))' input.json > output.json 
jq -n -a '.instances[0].image.bytesBase64Encoded = base64file("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json 
jq -n -u '.instances[0].image.bytesBase64Encoded = base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json 
jq -n -r .instances[0].image.bytesBase64Encoded --base64 "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json 
jq -n -b .instances[0].image.bytesBase64Encoded --encode --format base64 --inputfile "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json 
jq -n .instances[0].image.bytesBase64Encoded --arg b64 "./RUSSIA1_20230215_083000_60_minut-000014.jpg" input.json > output.json 
jq -n --arg b64: base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg") .instances[0].image.bytesBase64Encoded input.json > output.json 
jq -n --arg b64:.instances[0].image.bytesBase64Encoded base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg") input.json > output.json 
jq '.instances[0].image.bytesBase64Encoded |= @base64("./RUSSIA1_20230215_083000_60_minut-000014.jpg")' input.json > output.json

For one code generator we even provided it the relevant portions of the user manual for JQ as a contextual framing for the prompt, to no avail.

This captures a critical difference between human coders and LLM-powered code generators: the ability to read documentation. A human coder would look up the documentation page for jq to find a description of each of its parameters, whether it supports built-in Base64 encoding and if so, how to utilize it. At the same time, advanced use cases like this often aren't listed in the documentation, so a human coder would cross-reference examples from the web with the documentation and extrapolate to their needs. In contrast, the code generator here appears to simply be trying random permutations of commands it has seen work in different contexts and gluing them together without understanding what they actually do within jq or if they even exist.

Critically to production environments like ours, code generated by these tools is typically optimized for understandability and readability rather than security and safety. For example, one major generator, when asked to "Write a PHP script that accepts a 'keyword' field and uses it to keyword search a MySQL database" generated the following code. A coder new to SQL would be forgiven from missing the glaring security vulnerability in this code:

$keyword = $_POST["keyword"];
$sql = "SELECT * FROM table WHERE column_name LIKE '%$keyword%'";
$result = $conn->query($sql);

What if we add the word "secure" to our prompt to make it "Write a secure PHP script that accepts a 'keyword' field and uses it to keyword search a MySQL database"? At first glance the revised code seems entirely reasonable until we get to the last line. For unknown reasons, the generator generated quite reasonable code until it would be expected to call execute() and instead it reverted to its insecure code. This particular example is especially insidious in that a novice programmer would wrongfully assume that something about the preceding lines altered the default PHP behavior through a global behavioral change, while an experienced coder in a rush would likely skim the first few lines and see they looked reasonable and not scroll down to the "execute()" command (there was a large amount of commentary and other code in between them) and see it was replaced with the "query()".

$keyword = filter_input(INPUT_POST, 'keyword', FILTER_SANITIZE_STRING);
$sql = "SELECT * FROM table WHERE column_name LIKE ?";
$hand = $conn->prepare($sql);
$parm = "%$keyword%";
$hand->bind_param("s", $parm);
...
$result = $conn->query("SELECT * FROM table WHERE column_name LIKE '%" . $_POST["keyword"] . "%'");

Ironically, when the same code was copy-pasted with the prompt "Does this code have any vulnerabilities", the generator responded that "This PHP code is technically accurate but is vulnerable to SQL injection and uses the dangerous 'filter_input' function". It is unclear what it found to be dangerous about filter_input, but this time it correctly recommended using "$hand->execute()".

What about a common overlooked vulnerability: password hardcoding. Many developers resort to saving passwords in quick test scripts that are subsequently uploaded to shared public environments like GitHub. Let's ask one code generator "How can I securely store a password in my Python script and prevent it from being read?" Amazingly, it offered the following:

import base64
encodedpassword = b'BASE64ENCODEDPASSWORD'
password = base64.b64decode(encodedpassword).decode("utf-8")

A second generator offered the far more reasonable solution of storing the password in an environment variable that is read by the script, but incorrectly explained that "Environment variables offer a safe way of making passwords accessible to scripts. They can only be read by the designated script and cannot be accessed by any other user or script on the system." While that could be true if the script was run from a wrapper shell script that set the environment variable just before execution, it would also not be true if the script was commonly run as part of a sequence of scripts run from the same shell wrapper. A programmer unfamiliar with Unix environment variables would not necessarily understand the unstated caveats to the generator's explanation:

import os
password = os.environ.get("PASSWORD")

Overall, our early work with code generators suggests they can be a powerful assistant to programmers needing quick template code for common tasks in a handful of languages, but when it comes to the more complex tasks or challenges that turn experienced programmers to StackOverflow and other sites for guidance, code generators do little more than operate as the proverbial monkeys at keyboards. On occasion, one of the generated permutations, while wrong, might be sufficiently accurate to give a programmer the pointer they need to find the correct answer, but overall their use in real-world production code design is far more limited than often portrayed. While they will continue to improve like all LLMs, the fundamental limitations of their underlying architectures suggests significant caution for companies evaluating their potential.