As we continue to evaluate the capabilities of advanced Generative AI coding copilots, we find that they offer reasonable performance for entry-level basic code snippets typically performed by beginning programmers, but fail catastrophically at more advanced tasks. Recently we were debugging a server application and attempted to use ChatGPT 4.0, Gemini Ultra's coding copilot and several additional dedicated coding copilots to diagnose the problem and generate template code for common sections. As usual, we ran into many strange issues.
Upgrading a legacy application that uses blocking sockets to non-blocking sockets, we couldn't recall off the top of our head whether "SO_RCVTIMEO" and "SO_SNDTIMEO" apply in any way to non-blocking sockets ("O_NONBLOCK") and if so, what point of origin they use to begin their timing (such as the initial connect). Intuitively, it would seem that they could still perform an overarching timeout from accept(), but we wanted to verify this behavior. Unfortunately, we received conflicting results, with multiple runs across the different models suggesting either they do apply, don't apply, can apply under some circumstances or will cause the code to fail. Worse, running multiple times under the same model often yielded alternating results. The actual answer is that they are not applicable to non-blocking sockets unless the socket is switched back to blocking mode at some point later in the code (a coding pattern seen in some older applications to wrap areas the programmer wanted to simplify at a risk of robustness).
Despite blocking and non-blocking IO forming a central tenant of modern networking and code parallelism, two of the models repeatedly responded definitively that non-blocking sockets block until ready for input or output. As one advanced code generation model phrased it, "O_NONBLOCK indicates a non-blocking socket, which waits indefinitely for data to become available to read. Timeouts do not apply to non-blocking sockets, since they do not return until data is ready."
In a strange twist, all of the models evaluated struggled with the humble "select()" call which was long a standard mechanism for implementing capability-based timeout-protected waits on non-blocking sockets. Asked to wrap a read() call with a "select()" to implement a timeout-protected blocking wait, each of the tested models consistently output code that placed the fd_set in the "readfds" parameter, rather than the "writefds". Despite repeated prompting, none of the models after 6 iterations placed the fd_set in the writefds parameter, meaning the select() code would test only for read readiness, not write readiness. Worse, since in the structured confines of a dev/test environment most sockets will likely be ready for reading at the time they are also ready for writing, the code will perform as expected, only to fail in the most unpredictable and unusual ways in production, yielding an extremely difficult-to-identify existential failure point to the generated code.
The failure of the models to generate correct select() code is intriguing, as select()-based non-blocking code is so ubiquitous across the web that any search for socket code is likely to return at least a few examples. This suggests that mere statistical saturation of a given function in training data is not by itself sufficient to yield strong code generation results.