As the enterprise world has begun to aggressively adopt LLMs into real-world workflows, the once-obscure and quickly dismissed challenge of LLM hallucination has increasingly moved front-and-center in enterprise discussions of LLMs. While an errant response to a consumer generative search query might be easily dismissed and at worse a minor public relations nuisance, a hallucinating LLM's outputs in the enterprise could lead to significant legal and reputational jeopardy. As LLM vendors attempt to reassure their corporate customers, one line of argument that has emerged in some sectors is the claim that coupling vector databases with LLMs eliminate or significantly minimize hallucination. Unfortunately, this is not the case and follows from a combination of a nascent field trying to expand at all costs and a fundamental misunderstanding of where and how hallucination originates in the LLM pipeline.
At its root, LLM hallucination stems from a mismatch between a given prompt output and the model's underlying training data. A model trained on data from across the open web whose only knowledge of incursions onto the American homeland from other nations in modern times is the theoretical risk of a nuclear weapons strike or a space-based weapon will readily hallucinate that a Chinese spy balloon is a ballistic nuclear weapon strike or a falling satellite or similar. Similarly, a model whose primary training corpus encodes Russian flags as indicative of protests relating to its invasion of Ukraine will readily translate any appearance of that flag into a hallucinated description of an anti-war protest. Yet, even in cases where training data is equally balanced and well-aligned with a prompt, the randomness that underlies how LLMs select each sequential token can cause them to select the least accurate or relevant token sequence from amongst all possibilities. In other words, even when a model has ample training data about a vast range of incursions to the American homeland from an adversary, there is still a random chance in any given run that it will select the ballistic missile launch as its output.
In other words, the output of LLM generation is much like gambling in a casino: a truly winning answer is rare and typically the result of random chance, you might end up with a truly catastrophic result, but you're more likely to end up with any number of suboptimal responses.
It is this mismatch of prompt and training data that is what leads to the false hope of vector databases as a solution to hallucination. In this scenario, rather than relying on the LLM to divine an answer from its knowledgebase, it is merely asked to summarize and distill a response from a small collection of highly relevant text prefiltered via embedding similarity scoring from an external database in a form of "external memory." After all, if the models can be hand-fed the answer and need only rephrase it, rather than look to their own knowledge stores, hallucination should be entirely avoidable. Right?
The problem is that the LLM is still summarizing that text from the vector search by relying on that same underlying training data. While reducing the input prompt text to only narrowly focused and highly relevant rich results can improve the model's attention by focusing its entire available attention on relevant content, it is still entirely possible for the model to draw from a mismatched portion of its training data as it summarizes the text from the vector database, transforming it into yet another hallucination.
Reducing the temperature of an LLM inference can significantly reduce hallucination, but at a vastly increased degree of plagiarization and stilted speech in which instead of distilling the search results, the LLM merely regurgitates them as-is with minimal grammatical blending. Worse, even with a temperature setting of zero, small mismatches between training data and prompt output can still yield hallucinations amongst the plagiarization.
Embedding databases can and are widely deployed to solve aging issues by providing models with more recent information than what is encoded in their weights, but this in turn can actually yield greater hallucination as the newer information conflicts ever more strongly with the token sequences learned from their training data.
In other words, even when LLMs are asked merely to summarize text from an embedding search, rather than answer directly from their training data, they are still at the end of the day using that training data as a "lens" through which to interpret the embedding results and thus still entirely dependent on the degree of prompt-training match to control hallucination.
In short, there are no quick fixes to LLM hallucination. The only approach we have found to date to work in real-world deployments is to calculate the embedding similarity distance between the source text and multiple possible summarizations and select the one closest to the original text.