There is a lot of confusion and misunderstanding about what "hallucination" is in large language models (LLMs), how it can impact downstream applications and how it might be mitigated. Today we'll take a very brief look at the basics.
At its most basic level, hallucination in LLMs refers to the fact that for any given input, those LLMs can output falsehoods that were not present in its input or training dataset.
It is important to draw a distinction between hallucination and bias or learned falsehoods. Ask many LLMs today to tell you a story about a doctor and a nurse, and the story will typically describe a highly skilled male doctor and a sweet and compassionate but less competent female nurse who falls in love with the doctor. This is not a true hallucination in that many of these models have been trained on datasets that contain outsized historical representations that reinforce these stereotypes. Similarly, ask an LLM to tell you about Covid-19 vaccines. Some will include elements of various conspiracy stories that proliferated during the pandemic that were present in their training data. This too is not hallucination, but rather an artifact of the simple fact that many LLM vendors trained their models by ingesting vast web-scale crawls of the internet that included vast quantities of highly questionable material.
Hallucination is also not defined as going beyond the constraints of the input even when ordered not to. A prompt that instructs an LLM to "Summarize this government press release without using any information not present in the article" that references the war in Ukraine might include additional detail that Ukraine was invaded by Ukraine in February 2022 at the direction of President Putin and US President Biden has directed large volumes of weaponry and funding to Ukraine to help it. None of these details might be present in the article itself, but the LLM, despite being told to restrict itself to only what it finds in the article, but still include it. This is not strictly hallucination either, but rather a form of "constraint failure" in which the LLM fails to observe the constraints specified in its prompt and incorporates additional world knowledge from its underlying knowledge store.
Rather, hallucination represents a model fabricating information that it has neither seen or been given or rearranging real information in a false context. For example, asked to summarize a US Government press release about American military support for Ukraine, it might summarize that the US military is providing large volumes of weapons and support to Russia. Asked to transcribe a NATO policy statement condemning Russia's invasion, it might instead transcribe it as lauding Russia's invasion and offering Putin NATO's complete and unwavering support. Asked to summarize a news broadcast about a murder, it might substitute different people's names as the suspect and victim and change the location and details. Asked to provide a biography of a person, it might falsely claim they are accused of crimes or ascribe titles, degrees and other false biographical details to them. This is hallucination.
What causes hallucination? Despite popular anthropomorphization of LLMs as human-like entities capable of advanced reasoning, thought and beliefs, LLMs are in reality simply autocompletion engines: nothing more, nothing less. This means they don't actually "understand" or "reason" about their inputs and training data, they merely search for statistical correlations. When presented with the phrase "the miner found a golden…" LLMs use the statistical probabilities induced from their vast training datasets to see that the most likely word to complete that sentence would be "nugget" rather than "retriever". In contrast, "the golden _ chased the squirrel through the yard" is much more likely to involve a "retriever" than a "nugget". This is how LLMs today "reason": by using word probabilities and writing their outputs word by word in linear fashion.
For those interested in a simplified guide to the underlying mechanics of how these probabilities work and a glimpse at just how quickly even toy models trained on miniature datasets can produce fluent results, Stephen Wolfram's introduction is a great start.
How then can we mitigate hallucinations? Unfortunately, this is where there is a huge amount of false information circulating.
Most LLMs have several parameters that can be adjusted to control their outputs. Typically these knobs are not exposed in their public-facing consumer-friendly websites, but they are available in various forms in their commercial APIs. The first is the "temperature" and can be thought of as the "creativity" setting of the model. When selecting each output token, the model's temperature controls whether it always selects the most probable next token or how randomly it selects it. A temperature of 0 means that it will always select the most probable, while higher temperatures give it greater degrees of randomness.
Here are the definitions used by GCP's Bison (Palm 2) for temperature and token selection:
- Temperature. The temperature is used for sampling during the response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a more deterministic and less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic: the highest probability response is always selected. For most use cases, try starting with a temperature of 0.2.
- Top K. Top-k changes how the model selects tokens for output. A top-k of 1 means the selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-k of 3 means that the next token is selected from among the 3 most probable tokens (using temperature). For each token selection step, the top K tokens with the highest probabilities are sampled. Then tokens are further filtered based on topP with the final token selected using temperature sampling. Specify a lower value for less random responses and a higher value for more random responses.
- Top P. Top-p changes how the model selects tokens for output. Tokens are selected from most K (see topK parameter) probable to least until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-p value is 0.5, then the model will select either A or B as the next token (using temperature) and doesn't consider C. The default top-p value is 0.95. Specify a lower value for less random responses and a higher value for more random responses.
Unfortunately, many LLM guides will falsely claim that setting temperature to 0 will eliminate hallucination under the incorrect assumption that hallucination stems from the intensity of randomness or "creativity" of the model. In fact, setting temperature to 0 often increases hallucination by removing the model's flexibility of escaping high-probability low-relevance phrasal assemblies. The reality is that temperature only controls how deterministic the model's output is. A run with temperature of 0.0 will always produce the exact same results every single time, while a temperature of 0.99 will typically yield wildly different results each time.
Another common falsehood is that using embedding-based semantic search as an external LLM memory can eliminate hallucination. The idea here is that instead of asking the LLM to answer a question from its own training data, it is provided input text containing the answer and asked to draw its response from that input alone. In other words, when asking an LLM "Who is the current president of the US?", the model must answer from its own training data, which may be outdated and provide the wrong response. Instead, the question is converted to an embedding and a vector search (typically ANN for performance reasons) is used to identify passages from a live authoritative dataset which are then concatenated together and given to the LMM as input, which is then told to answer the question only from the provided text. The theory is that by asking the LLM merely to summarize the provided text, rather than answer from its own embedded knowledge, it will no longer hallucinate. Unfortunately, this workflow still requires the LLM to rely upon its learned probabilities for summarization/distillation, meaning it will still readily hallucinate. Given a prompt "Who is the current US president? Use only the text provided below." and input "US President Joe Biden yesterday announced a new trade policy", there is still a non-zero chance that the LLM will respond with something else, such as "Donald Trump" or even "Putin".
Strangely, we've observed even top consulting firms falsely claim to their clients that setting temperature to 0.0 and using embedding-based memories can entirely eliminate hallucination, showing just how prevalent misinformation is in the LLM space right now and the lack of at-scale real-world experience with LLMs that many AI consultants have today.
Careful prompt engineering can decrease hallucination in some cases in theory, but in practice we find that LLMs readily escape their constraints. For example, a prompt that requires the LLM to use "Only the information provided below" will readily ignore such instruction. Model-specific workarounds typically break over time as models are constantly updated.
Unfortunately, there is no way to completely eliminate hallucination from current generation LLMs. One approach that can reduce their impact is to run each prompt multiple times and use embedding models to rank the results, choosing the one scored most similar to the prompt. The problem is that models themselves encode the biases of their training datasets, with several major commercial models encoding that white males are vastly more relevant to the concept of "CEO" than African American females and so on, meaning that such scoring workflows will still encode significant biases. Worse, even purpose-built models designed for specific use cases may not yield the desired output. For example, "bitext retrieval" models are embedding models designed for a single task: given an input passage in one language, find an exact translation match in another language. Unfortunately, these models were often trained on neutral machine translation (NMT) inputs, either purposely (to expand their training datasets) or inadvertently (due to NMT pollution of the open web) and thus will systematically score stilted literal NMT translations as significantly higher quality than more fluent LLM translations or even professional human translations. Thus, embedding-based quality ranking is not without its own limitations and biases.
In the end, there is no way to eliminate hallucination and as our own stream of demos here throughout this year have demonstrated, even under the most ideal of scenarios, hallucinations can manifest at random with existential impacts on downstream applications.
For a deeper dive on hallucinations, their impacts and mitigation strategies, see some of our past entries in our LLM series or reach out. We have a number of avenues through which we can work with your company to help understand both the promise and pitfalls of LLMs for your business and industry.