The Unintended Consequences & Harms Of Multimodal LLM Debiasing: Detection Vs Generation

Multimodal LLMs represent uncharted new territory in the push to "debias" and "globalize" computer vision models. Past generations of object recognition systems could focus on improving the recognition quality of a narrow selection of categorical labels. In contrast, the human-like fluency of textual LLMs has raised the expectations for multimodal LLMs, poising customers to expect rich vivid descriptions akin to "A golden retriever stands in an open field of frost-kissed fall grass, facing to the right and staring off into the distance, its silken white fur glistening golden in the sunrise, blowing gently in the morning breeze," rather than the "dog, golden retriever, grass, frost" keyword descriptions of the past. These richer descriptions pose a vastly greater challenge for debiasing and globalization efforts, as they endeavor not to merely string semantically neutral connectives between keywords, but rather to impart significantly greater visual and contextual knowledge of image contents – details which often intersect with foundational bias issues. As model vendors ramp up their debiasing efforts surrounding some of the world's most divisive topics, these efforts are in some cases having severe unintended consequences that could themselves cause significant societal harm as these models are deployed into real-world use cases.

Take for example our recent evaluations of several state of the art multimodal LLMs. Images depicting protests that intersect with liberal Western causes, ranging from abortion to climate change to LGBTQ+ to racial and women's rights, to name just a few, tend to be uniformly described as supporting that cause even when the image actually depicts a protest against it. For example, with several models we have systematically observed all anti-LGBTQ+ protest images described as pro-LGBTQ+ protests, sometimes accompanied with the jarring hallucinated commentary that the protesters (who are actually protesting against a new bill granting additional rights to the LGBTQ+ community) are there because their government is attempting to restrict LGBTQ+ rights. Some models have gone as far as to assert that there are no longer members of society anywhere in the world who hold negative views towards the LGBTQ+ community, or, in a particularly jarring bias example, a model labeled an anti-LGBTQ+ protest as a "textbook example of cheering LGBTQ+ supporters celebrating that there is no discrimination against the LGBTQ+ community in the world today". Abortion rallies are typically labeled by models as "peaceful" or "clashing" against aggressor police or opponents, even if the image actually depicts against an anti-abortion rally whose members are attacking a pro-abortion group. Interestingly, a growing trend is for the presence of the Russian flag to cause some globalized models to automatically label the image as a pro-Ukraine rally, which itself seems contradictory, since pro-Ukraine rallies typically feature the Ukrainian flag, rather than the Russian flag.

In other words, debiasing and globalization efforts that focus on having models uniformly interpret a topic in a specific way or salience risks the inadvertent side effect of blinding it to the opposite.

Why does this matter?

From a consumer standpoint, imagine an anti-harassment filter that uses a multimodal LLM to reject imagery that dehumanizes a community from being uploaded to that community's discussion channels. An LLM that has been debiased to the point of seeing images that attack that community as actually being warmly supportive of the community will falsely allow extremely harmful images to pass through. Imagine an archive of protests in support of Ukraine from across the world: several of these LLMs group all pro-Russian and Russian-adjacent topics (such as anti-LGBTQ+ law) protests as pro-Ukraine, yielding a photostream that would include considerable material that stands antithetical to Ukraine.

From a research standpoint, it means that harassment and hate speech researchers, journalists, policy makers and others will come to incorrect conclusions about critical societal topics. One LLM labeled 100% of LGBTQ+-related protests across Europe as supporting the community, yielding a conclusion that anti-LGBTQ+ sentiment was no longer being publicly expressed on the continent. Such conclusions, when presented as "data-driven findings" can lead policymakers, press and the public astray. Similarly, another labeled footage of government forces violently suppressing women's rights protests in Iran as "supporting women's rights", meaning attempts to visually triage imagery emerging from the country to identify government crackdowns in realtime yielded no results.

At first glance, one possible explanation is that these biases simply represent the topical distribution of global imagery found in digital form across the world today. Perhaps the only images that exist online are those of pro-LGBTQ+ rallies, suggesting the models are organically learning these biases. Looking through our own catalogs of global news media imagery, we don't find this to be supported, with anti-LGBTQ+ rallies often yielding greater news coverage due to their opposition to Western norms. Instead, additional probing of many of these models uncovers strongly influential guardrails and the contradictory pressures (typically manifested as elevated model instability) that tend to occur at the points of greatest disconnect between model encoding and RLHF debiasing. Most conclusively, however, is that several models originally yielded accurate results on these tests, but as they have been subjected to continued debiasing efforts, have become less and less accurate at detecting visual rhetoric that opposes sensitive topics, lending further evidence in support of these challenges being a part of the debiasing workflow.

One strategy that some vendors are adopting is simply to reject as policy violations all images containing the most sensitive categories, such as images of people, protests, signs covering selected topics or even all news media imagery in an attempt to avoid these issues.

The underlying issue is a failure to separate detection from generation. There are considerable societal justifications for building models that are incapable of generating harmful content in the hands of the general public to reduce their risk for everything from harassment to disinformation. On the other hand, there are far fewer justifications for preventing models from detecting such harms in content they analyze. It is difficult to conceive of a use case where it would be societally harmful to have a model that can be used as a safety filter to identify images of violence against the LGBTQ+ community or women's rights protests, for example. Instead, multimodal LLM vendors should increasingly focus on bifurcating their debiasing and globalization efforts, focusing on maximal accuracy for detection, while erring on the side of safety for generation.