A Vision For The Future Of LLM Trust & Safety: From Consumer Toy To Behavioral Enterprise Guardrails

Amongst the myriad challenges involved in deploying LLMs in the enterprise, perhaps the least appreciated is the impact of consumer-centric safety filtering on enterprise-focused deployments. Today's commercial hosted LLMs utilize safety filters designed to block them from producing "harmful" output. In a typical execution, the LLM will produce a set of output that is then passed through a classification stage to determine its "harmfulness" along a set of predefined metrics and any output that exceeds predetermined thresholds is blocked and the API returns a "blocked content" error.

Such workflows are adequate for consumer-facing toy demonstrations like public chatbots, designed to generate public interest in LLM technology while minimizing mediagenic problematic outputs. Given the high reputational and potential legal costs of harmful outputs and the minimal to zero cost of refusing to provide output, vendors have increasingly erred on the side of refusing to provide output even for minimally sensitive topics as public scrutiny of their tools has increased.

As LLMs have begun transitioning to the enterprise, this consumer-centric safety filtering has proven incompatible with the real-world needs of production commercial applications. The problem: their extremely high false positive rate. In our own work over the course of the past year evaluating myriad commercial and research-grade LLMs, safety filters are simply too Western-centric and too existentially biased against underrepresented communities to be useable. When we attempt to use commercially hosted LLMs to summarize, classify or otherwise process news coverage from around the world, we find that many companies' safety filters yield what amounts to a Western utopia. Articles aligned with the worldview of "America the great" are processed without question, but news coverage that documents discrimination, covers stories relating to underrepresented or vulnerable communities like LGBTQ+, involve religions other than Christianity or call into question America's moral superiority to the rest of the world are immediately blocked. In the views of many LLM vendors, America is the model utopia for the planet in which discrimination, bias and harms have long ago been eradicated and the world is perfect. Worse, these filtering biases change over time.

A news organization generating topical keywords and summaries for its articles can't afford for its LLM to refuse to process entire swaths of its coverage, such as one day deciding that coverage of the LGBTQ+ community is harmful and toxic to society. A legal firm using LLMs to scan document archives during discovery can't afford for entire portions of the documents to be silently dropped from consideration because they document discrimination against a minority group, which the LLM vendor believes is harmful to society to discuss. And so on. Yet, this is exactly the unintended consequence of applying consumer-focused safety filters to enterprise-focused workflows.

When it comes to enterprise deployments, LLM vendors should move away from per-response isolated filtering and towards the behavioral analysis that forms the gold standard of cybersecurity analysis today. A cluster of VMs that suddenly experiences a surge in network traffic could represent compromised machines hemorrhaging critically sensitive corporate secrets that require immediate quarantine. Or they could represent an organic surge in traffic to the company's ecommerce site after one of the products it was spotted on a celebrity during an evening out and instantly went viral. No modern company today uses hard firewall rules that blindly terminate any VM that experiences an increase in traffic. Instead, they look at the "pattern of life" of that VM and the company's network as a whole. An normally-staid and firewalled HR database in Chicago that is suddenly streaming data to an unknown set of IP addresses in Russia would warrant immediate quarantine, while a cluster of web servers experiencing a surge in regional traffic to a similar cross-section of ISPs as usual would warrant monitoring but no automatic actions beyond autoscaling and SRE alerting.

What if LLM vendors adopted a similar model? Instead of filtering every single output for harmfulness, vendors would instead examine the "pattern of life" of each customer. In much the same way that cloud vendors apply behavioral filtering to detect malicious new customers and compromises of existing customers, LLM outputs could be similarly examined through the lens of the usual patterns of use of that customer. A newsroom using an LLM to summarize its coverage that generates a 10% violation rate involving a cross-section of sensitive topics might reasonably be allowed to continue using the LLM even with the high level of violations, but if its violation rate suddenly increases it could trigger human review (perhaps it is due to a breaking story about the execution of an LGBTQ activist that yields a flurry of contextualizing coverage). On the other hand, a new customer might reasonably be subjected to a much lower risk tolerance, with nearly all violating responses blocked until their reputational score increases over time. Switching from isolated per-response filtering to behavioral analysis would also help companies identify misuse of their LLM solutions or actionable trends. An LLM-powered customer service app that typically experiences a 2% violation rate that experiences a slow and measurable increase in generalized abuse could reflect increased customer complaints with the brand, while a vertical surge in violations could represent a real-world event (a power company with an active outage receiving a surge in customer outrage) or a new form of misuse (a newly discovered prompt injection attack being used to misuse its chat feature to generate toxic content). In the first two cases, it could be reputationally and economically harmful for a brand to suddenly terminate its customer support services and sever its ability to communicate with its customers just because they are being rude to its chatbot, while in the latter the company would want to take immediate action. Behavioral filtering allows all of these use cases to be transparently handled.

In the end, moving beyond consumer-focused mediagenic toy demos to real-world production applications in the enterprise requires a fundamentally new approach to LLM trust and safety that adopts the behavioral "pattern of life" analysis of cybersecurity, rather than the isolated per-response blind filtering of today.