How Evolving Guardrails & RLHF Is Creating False Confidence In LLM Safety & Bias Issues

One of the more interesting aspects of safety and bias issues in current generation LLMs is the rate at which companies are modifying their guardrails and using RLHF and other tuning and filtering mechanisms to evolve their responses in near-realtime. Often, when we locate a specific systematic bias issue that a model produces consistently and without fail over a period of days to weeks, it is fixed relatively quickly by the company after we notify them, sometimes within minutes to hours of reporting.

In cases of particularly egregious potential harm to vulnerable communities, we agree to temporarily delay public disclosure until the model or its interface wrapper has been adjusted to mitigate the bias issue. In many ways, this is much like the responsible disclosure best practices of the cyber realm. In particular, our uniquely global perspective means we are often more successful at finding innate biases in models than companies' own red teams.

More broadly, many commercial LLM providers employ large staffs that constantly review the transcripts of user engagement with their models for harmful interactions and utilize that feedback in a similarly corrective fashion. A user that manages to provoke a model into a particularly toxic exchange may find that that interaction no longer works, sometimes just hours later, due to this continual improvement process.

What is so remarkable is that, despite LLMs' enormous size and complexity, companies are often able to institute corrections within a remarkably short amount of time, though in some cases this is through broad-brush prefilter bans that simply exclude entire categories of content from the model until the actual model itself can be refined.

On a positive note, this means each harmful discovery can yield meaningful mitigation of classes of future harms as companies continually evolve their models to fix the biases and harms users uncover. On the downside, this can lead to a false confidence in the safety of models as users are unable to replicate the biases and harms they see reported in the community, leading them to believe the biases were merely random one-off responses, rather than understanding the full extent of the systematic biases encoded in today's models.

Most dangerously of all, the current approaches to correcting LLM biases operates more akin to a patchwork whack-a-mole than broad-brush corrective shifts in LLM operation, akin to individual software vulnerabilities being patched as reported without the corresponding systematic evaluation of the class of vulnerability it indicates. For example, often when we report a particular issue, we'll find that the company shortly corrects that specific precise issue. But, simply by changing a few words to test nearly the exact same issue with different wording, or an extremely similar issue, will recover the same bias issue.

In theory, this should lead to models becoming less and less biased over time in much the same way that a software package typically becomes more secure over time as vulnerabilities are discovered and patched. Instead, what we often find with LLMs is that fixing one issue often undoes fixes for other issues (in much the same way that a poorly executed software patch can introduce new vulnerabilities) and that the regular model updates tend to regularly undo whole swaths of bias mitigations. A model might exhibit less and less gender bias over a series of mitigations until a few weeks later, without warning, all of those biases suddenly materialize worse than ever.

This instability leads to false confidence, with companies testing a single snapshot of a model in time, reporting any biases they see and confirming they have been addressed, and then deploying that model without regularly reevaluating it to see that those biases have crept back in over time. Companies should run continual bias regression tests on a daily or weekly basis and steadily expand those tests over time with their own red teaming work, rather than leave bias detection and mitigation entirely to LLM vendors.