As we continue our evaluations of advanced LLM-driven generative AI coding copilots, we've found them to be useful for generating quick snippets of code in common languages like Python for simple tasks that are already highly represented on coding websites and thus already available through a simple web search. In essence, in the place of a quick web search, copilots are able to regurgitate the code and customize it to the user's needs, making them ideal tools for novice and beginning programmers. At the same time, we've found that for more advanced coding demands, their dependence on statistical prominence across the open web and inability to be creative rather than merely randomly iterate over existing content in their training data means they generate code that requires so much human correction that it is often faster for the programmer to simply skip the copilot entirely.
Recently, we tested two advanced post-GPT-4 coding copilot models that were presented as representing the potential of GPT-5-level coding competency. Putting them through their paces for simple code generation like basic Python data science, both generated standard best practice code. Yet, neither's code was obviously different or superior to that generated by existing copilots or even just the more generalized GPT-4. More intriguingly, both tools and existing copilots all generated code that followed the same basic conventions and exhibited the same nuances and stylistic artifacts, offering a reminder both just how interchangeable these models are and how similar their training data and RLHF nudging is.
The real challenge came in when we gave them both a real-world task that eludes most current copilots: writing a simple but robust RESTful networking server without using third party libraries.
As expected, both models (as well as existing GPT-4-level copilots) all initially generated code that used external third party networking and server libraries. For novice programmers and many basic tasks, this is often ideal, as it abstracts away all of the complexities of networking and provides standard error and edge case handling. At the same time, we've found that when running globally-distributed high performance infrastructure at public cloud scale, existing libraries are unfortunately insufficient. Over the years we've had to build numerous bespoke networking and server libraries that handle the myriad intricate and highly specialized edge cases that occur when kernels and cloud infrastructure are pushed past the breaking point, RFCs are violated, hardware and kernels behave in ways their documentation says they cannot, networking stacks collapse in the most bizarre ways, processors and switches misbehave and things happen that simply violate the rules of the digital world, all while reporting that nothing is amiss. Such is unfortunately the real world of cloud-scale infrastructure.
At the same time, even when code performs correctly, the hard-won lessons of a more limited era of hardware and of the HPC world have long been forgotten: modern server libraries are all-too-often written for readability and compliance with modern coding trends, rather than optimized for absolute performance. Swapping out off-the-shelf popular server libraries for bespoke optimized code has allowed us in some cases to reduce the underlying hardware by more than 90% – in one case replacing more than 100 large VMs with a single medium-sized one.
Thus, internally we make use of many bespoke and extremely optimized infrastructure systems written to address all of the myriad edge cases and abnormalities we've observed over more than 25 years of running in colocated, HPC and globally distributed public cloud environments. Yet writing advanced networking code is extremely complex and requires deep understanding and familiarity with kernel, protocol and networking underpinnings and the intricate interdependencies amongst all their myriad moving parts. Could post-GPT-4 copilots help?
One copilot was highly responsive when asked to exclude external libraries, while the other required repeated coaxing and even then consistently reverted to including outside libraries, exhibiting low prompt coherence. Eventually both copilots were able to generate server templates that relied only on native code. However, neither initially produced code that included timeouts on the receiving or sending sockets. When prompted to include timeouts, one added elaborate and highly fragile signal-based SIGALRM code that involved wrapping the receiving code in multiple layers of signal installation, handling and removal. Prompted to explicitly add socket-based timeouts, it instead switched to nonblocking spins and counting the timeout itself, which works, but isn't as ideal as an actual simple socket timeout. Both made a wealth of naive assumptions about buffer sizes, non-adversarial clients and a total lack of any errors or problems on the networking side. Despite repeated prompting to add error handling, deal with partial reads, capping reads, buffer overflows and the like, both models resisted writing even basic best practice server template code. Despite myriad prompt rephrasings, it was difficult to get either model to produce a robust socket read loop that handled even basic error conditions, let alone more complex edge cases. Intriguingly, a simple search on Stack Overflow produced higher-quality networking code for the read loop and better socket construction than either model was able to produce despite nearly a day of extensive prompt engineering.
The reason for all of this? The most likely explanation is simply that server and networking code is a more niche and complex area of programming that is poorly represented on the "in the wild" web-scale-scraped training data used by many of these copilots. There are even fewer core libraries that incorporate the kind of production-grade best practices expected of real-world networking code. In fact, a simple web search and searching across code repositories like GitHub turns up a majority of top results that are simple servers designed as programming tutorials or as naive wrappers to make accessible backend processes for testing or in closed environments, where the focus is on the service being exposed, rather than the networking code itself.
In the end, for many core tasks, these copilots are merely wrapping up existing human-produced code for novice programmers, rather than acting as advanced competent programming assistants that can help build the next generation of stable solid code.