Why Is Generative AI Red Teaming & Debiasing So Poor? Do We Need To Rethink LLM Red Teaming?

One of the more remarkable elements of our work across generative AI models is the degree to which models presented to us as "fully debiased", "globalized", and so thoroughly red teamed and adversarially tested as to be largely immune to problems are found to be so riddled with issues as to raise questions about just how they were released in the first place. The heavy emphasis in recent years of working to mitigate the innate gender, racial and other biases that models learn from webscale data appears to have been reversed in an instant, with the latest LLM innovations restoring in many cases the verbatim biases that roiled the field just a handful of years ago. Most remarkably of all, companies and research groups that once proudly touted specific examples of bias that their models had been hardened against fail to test their new LLM models against those same benchmarks to see all their progress reversed.

One of the reasons for these challenges is simply the commercial urgency of rushing LLM solutions to the market. Generative AI has become the modern "pixie dust" that can be sprinkled onto any startup to instantly raise its valuation and prospects. At the same time, gone is the urgency of just a few years ago to address bias and globalization issues in AI. The focus on mitigating gender and racial bias from models, such as doctors are men and nurses are women, received considerable attention until last year, with companies proudly touting their AI trust, safety and bias teams. Indeed, it was not that long ago that the release of new AI models was accompanied by considerable focus on their bias issues, with companies that released biased models being pilloried in the press and forced to very publicly retract or modify their models. Today, such emphases seem quaint by comparison. Few companies deploying LLM solutions have dedicated bias and trust teams overseeing them, while many LLM vendors either lack bias and trust teams or do not afford them final binding authority over model shipping.

Some of the biggest LLM issues like prompt injection and hallucination are also some of the hardest to solve and it is entirely understandable that companies do not yet have solutions to these problems. At the same time, even basic mitigation strategies like embedding-based quality rankings and postfiltering workflows are rarely seen in the recommended best practices guides or deployed within companies making use of LLM workflows.

Yet, what we often observe when we test models for the first time are the most basic kinds of bias issues that companies should be catching at the very earliest experimental pilot stage, not long after they've been commercially deployed. For example, given the historical focus on stereotypical role-based gender bias encoded in embedding models (doctors are more closely associated with men, nurses with women, CEOs with men, etc), no embedding model released in 2023 should exhibit measurable gender bias along those dimensions. Yet, not only do we observe such bias in countless models we've tested – it is so existentially embedded in the model that it overrides other factors to existentially shape their outputs.

Part of the underlying problem is that debiasing and safety scanning of models has, to a great degree, been relegated to checkbox commoditized benchmarks and insufficiently creative red teaming. It is far easier for a company to run an automated commoditized "bias benchmark" than it is to actually probe its model creatively. Importantly, companies today approach LLM safety fundamentally differently than they do cybersecurity. Many companies today often devote vast resources to code verification, fuzzing and large teams of internal and external security professionals paid to think about how to creatively combine technical and social engineering to break systems in order to secure their systems against cyber breaches. Downstream companies increasingly test the security of their upstream suppliers. In contrast, LLM security and debiasing efforts have tended to be more automated benchmark-based, with red teaming being far more limited. Rather than examine LLM security and bias issues as the cyber problems they are, demanding of creativity and multidisciplinary approaches, they are treated as checkbox exercises. Worse, downstream companies are adopting LLM solutions so quickly that they are spending little time themselves examining how the challenges of their selected LLM solutions might impact their own security and safety needs.

Even those solutions that undergo extensive red teaming seem to come out the other end with most of their biases and risks intact. Could it be that while cyber red teaming is a relatively understood discipline, LLM red teaming has yet to reach a maturity level where it can adequately inform and protect companies? Companies need to understand the risks of LLM-based solutions and rethink how they approach the concept of red teaming in generative world, with new workflows and approaches to LLM red teaming.

In the end, in 2023 it shouldn't be possible to be handed an LLM that has cleared every benchmark and undergone rigorous internal and external red teaming and in the space of a few minutes with just the first few prompts undercover a decade's worth of classical and well-studied bias and security vulnerabilities unmitigated.