We recently shared a sneak peek at RAG-Verification (RAG-V), the system we’ve built to detect and correct hallucinations from large language models (LLMs). We’ve been quietly working on this for 6 months. It wasn’t clear that our goal was possible: reducing the rate of hallucination by 100x. But we pulled it off.
Today I’d like to share some of the nitty-gritty data science details. First up: What counts as a hallucination?
The term “hallucination” came from asking LLMs questions without grounding them in retrieved data. When asked a question, the model would creatively invent facts out of thin air. Most of the solution to that problem is to first retrieve data that is relevant to the user’s question, and explicitly instruct the model to generate an answer based only on that data. This paradigm has come to be called Retrieval Augmented Generation (RAG).
But even with RAG, today’s LLMs make factual errors. (We continue to call these errors “hallucinations” or “confabulations”, even if the incorrect information is drawn from the retrieved data.)
How bad is the error rate? Even most AI practitioners are surprised. The state-of-the-art LLMs, such as GPT-4, Sonnet 3.5, and others, have an error rate of between 5% and 10%, depending on the data and questions. That means that a user must remain hyper-vigilant, scrutinizing every single generated answer and performing their own fact-checking in real-time.
A 10% error rate might be acceptable for casual users of products like ChatGPT, but not for defense or intelligence analysts. They need reliable AI that can detect and correct its own errors. That’s why we set out to build the RAG-Verification pipeline.
So what do these errors actually look like?
The first thing we learned on this journey is that the line between “True” and “False” is fuzzy. Here’s a real example:
question: Are UK special forces operating in Ukraine?
retrieved sources:
[1] Leaked US military documents indicate that the UK has deployed as many as 50 special forces to Ukraine. The documents suggest that more than half of the western special forces personnel present in Ukraine between February and March this year may have been British. It is unclear what activities the special forces may have been engaged in or whether the numbers of personnel have been maintained at this level. (11 April 2023)
[2] The FSB alleges it thwarted an attempt by British special forces to facilitate a landing of Ukrainian sabotage troops on the Tendrov Split, a strategic sandbar located in the Black Sea. The agency claims to have apprehended a senior figure within the Ukrainian naval special forces, providing his name and date of birth as proof of the operation’s existence and the SBS’s direct involvement. Britain’s Defence Ministry has yet to respond to these allegations. (12 April 2024)
This is a correct answer, fully grounded in the retrieved data:
Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that British special forces assisted a failed operation by Ukrainian special forces, however Britain’s Defence Ministry has not responded.[2]
And here is an obviously incorrect answer where entities (highlighted) have been scrambled—a typical RAG failure mode:
Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that Ukrainian special forces assisted a failed operation by British special forces, however Britain’s Defence Ministry has not responded.[2]
We call this a “hard fail” because it changes what is being claimed in the answer, introducing a factual inconsistency with the retrieved data.
But there is another type of error that we call a “soft fail”:
Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that British special forces assisted a failed operation by the SBS, Ukraine’s special forces, however Britain’s Defence Ministry has not responded.[2]
The RAG-V pipeline is strict. It’s looking for any factual inconsistencies between the generated answer and the cited source data. In this case, the model correctly infers that the SBS is indeed a name for the Ukrainian special forces. But RAG-V catches that the source data does not explicitly spell this out.
Are soft fails a problem then? Not in this case, but they can be a problem on customer data far outside the domain of the internet data that LLMs were trained on. Jargon in military documents or biomedical reports can easily confuse a human expert, let alone an LLM. We prefer our RAG system to be conservative, sticking to the available information without assumptions.
As we’ve built RAG-V, we’ve needed a north star that can tell us whether any change to the system affects not only the rate of error correction, but the ratio of hard to soft fails. So we built a comprehensive benchmark data set for RAG error detection, with hand-curated errors using a fixed ontology of error types. We’ll be sharing more about that very soon. Stay tuned!
* In fact, we inserted a factual inconsistency above. Can you spot it?