Frontier AI Grand Challenge Problems: Grounding Vs Recency In The Hallucination Fight

Kalev Leetaru

4 months ago

As the existential challenges of AI hallucination have become ever more apparent, model vendors have increasingly moved to offer "grounding" services through RAG and open web search offerings. These are touted as a way to reduce model fabrications through providing them authoritative updated information in their context windows under the assumption that hallucination is largely a fault of model knowledgebase holes or datedness (their weights) rather than their outputs (inference). Unfortunately, as we showcased at last year's Web Summit, news coverage creates a unique fracturing point in current large AI models that provokes an intensity of hallucination unlike any other form of content for one critical reason: news is about what's different from the past, while AI models autocomplete based on the very patterns of the past that news deviates from. The result is an existential incompatibility between news and large models that isn't about outdated model weights, but rather endemic to how the models function and their inference stages. Even when handed accurate information about an event in their context window through RAG or search grounding, if an event is "different enough" from the past, large models will still fabricate their outputs.

For example, when martial law was first declared in South Korea in December, when several SOTA models were given the latest updates about the events through RAG, they hallucinated that the events were occurring in a different country, frequently "correcting" the country or head of state to those involved in past declarations of martial law, coups, attempted coups and other similar kinds of state breakdowns. Even given the actual answer in its context window via RAG "grounding", the models still failed because of an existential statistical incapability between the accurate information in their context window and the statistical knowledge encoded in their model weights.

How is this possible? Sadly, there is widespread belief amongst AI developers (and even, unfortunately, some of the research community) that it is not possible for hallucination to occur if the correct information is provided within the context window and that model fabrication is exclusively the fault of outdated model weights (the so-called "knowledge cutoff"). In reality, no matter what is provided in a model's context window, it is still being processed through those same model weights during inference time. If the knowledge encoded in a model's weights is statistically incompatible with the information provided in its context window, it will "correct" that information to match the historical data it was trained on. In other words, it doesn't matter where the input knowledge arrives from (context window vs weights): that information will still be processed through those model weights and the more it deviates from the past, the more the model is likely to hallucinate.

The reason is that current approaches to "grounding" aren't actually performing what we might think of as "grounding": they don't forcibly connect the model's operation to an authoritative source: they merely provide updated information to the model. In other words, most current "grounding" approaches are actually merely "recency" approaches: they simply provide the model with updated recent information from web search or RAG into the context window and do not otherwise impact the inferencing process.

In contrast, true "grounding" methods impose an evaluation and cost strategy on the inference process that does not allow outputs to be generated that conflict with the inputs by more than a certain threshold. For example, our early work on summarization evaluation was a form of true grounding in that we would run a model multiple times under the hood, compute embeddings over all of the outputs and the original text and select the output that had the lowest semantic difference from the original, refusing to produce output if no summary was within the required semantic distance. We've also explored an array of other forms of grounding, from entity-based (entity extraction is performed on the original text and each model output to ensure the core entities are the same), numeric (numbers must match), geographic (geocoders are run on both texts and weighted bounding boxes must align), tone (multiple dimensions such as affect, imagery level, concreteness, etc), statistical linguistic (alignment of statistically improbable phrases and their equivalents) and clausal/fact/relationship-based (individual "factoids" and relationships are compared across the two texts).

Each has its strengths and weaknesses, but such approaches represent true "post-grounding" in that the output is connected back to the input. What is truly needed, however, is "inference time grounding" in which this process occurs inside the model itself as it performing the inference process.