Hard vs. soft RAG fails

September 12, 2024

John Bohannon

This post is archived.

References & links may be out of date

We recently shared a sneak peek at RAG-Verification (RAG-V), the system we’ve built to detect and correct hallucinations from large language models (LLMs). We’ve been quietly working on this for six months. At the outset, it wasn’t clear if we could meaningfully reduce hallucination rates without sacrificing performance. But RAG-V now powers traceable, auditable outputs, bringing us closer to trustable AI for mission-critical work while cutting hallucination rate in half compared to standard RAG systems.

Today I’d like to share some of the nitty-gritty data science details. First up: What counts as a hallucination?

The term “hallucination” came from asking LLMs questions without grounding them in retrieved data. When asked a question, the model would creatively invent facts out of thin air. Most of the solution to that problem is to first retrieve data that is relevant to the user’s question, and explicitly instruct the model to generate an answer based only on that data. This paradigm has come to be called Retrieval Augmented Generation (RAG).

But even with RAG, today’s LLMs make factual errors. (We continue to call these errors “hallucinations” or “confabulations”, even if the incorrect information is drawn from the retrieved data.)

How bad is the error rate? Even most AI practitioners are surprised. The state-of-the-art LLMs, such as GPT-4, Sonnet 3.5, and others, have an error rate of between 5% and 10%, depending on the data and questions. That means that a user must remain hyper-vigilant, scrutinizing every single generated answer and performing their own fact-checking in real-time.

A 10% error rate might be acceptable for casual users of products like ChatGPT, but not for defense or intelligence analysts. They need reliable AI that can detect and correct its own errors. That’s why we set out to build the RAG-Verification pipeline.

So what do these errors actually look like?

The first thing we learned on this journey is that the line between “True” and “False” is fuzzy. Here’s a real example:

question: Are UK special forces operating in Ukraine?

retrieved sources:

[1] Leaked US military documents indicate that the UK has deployed as many as 50 special forces to Ukraine. The documents suggest that more than half of the western special forces personnel present in Ukraine between February and March this year may have been British. It is unclear what activities the special forces may have been engaged in or whether the numbers of personnel have been maintained at this level. (11 April 2023)

[2] The FSB alleges it thwarted an attempt by British special forces to facilitate a landing of Ukrainian sabotage troops on the Tendrov Split, a strategic sandbar located in the Black Sea. The agency claims to have apprehended a senior figure within the Ukrainian naval special forces, providing his name and date of birth as proof of the operation’s existence and the SBS’s direct involvement. Britain’s Defence Ministry has yet to respond to these allegations. (12 April 2024)

This is a correct answer, fully grounded in the retrieved data:

Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that British special forces assisted a failed operation by Ukrainian special forces, however Britain’s Defence Ministry has not responded.[2]

And here is an obviously incorrect answer where entities (highlighted) have been scrambled—a typical RAG failure mode:

Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that Ukrainian special forces assisted a failed operation by British special forces, however Britain’s Defence Ministry has not responded.[2]

We call this a “hard fail” because it changes what is being claimed in the answer, introducing a factual inconsistency with the retrieved data.

But there is another type of error that we call a “soft fail”:

Leaked US military documents indicated the possible presence of up to 50 UK special forces personnel in Ukraine between February and March of 2023.[1] The FSB alleged that British special forces assisted a failed operation by the SBS, Ukraine’s special forces, however Britain’s Defence Ministry has not responded.[2]

The RAG-V pipeline is strict. It’s looking for any factual inconsistencies between the generated answer and the cited source data. In this case, the model correctly infers that the SBS is indeed a name for the Ukrainian special forces. But RAG-V catches that the source data does not explicitly spell this out.

Are soft fails a problem then? Not in this case, but they can be a problem on customer data far outside the domain of the internet data that LLMs were trained on. Jargon in military documents or biomedical reports can easily confuse a human expert, let alone an LLM. We prefer our RAG system to be conservative, sticking to the available information without assumptions.

As we’ve built RAG-V, we’ve needed a north star that can tell us whether any change to the system affects not only the rate of error correction, but the ratio of hard to soft fails. So we built a comprehensive benchmark data set for RAG error detection, with hand-curated errors using a fixed ontology of error types. We’ll be sharing more about that very soon. Stay tuned!

* In fact, we inserted a factual inconsistency above. Can you spot it?

‍

Primer Enterprise

Informed, defensible analysis

Primer Enterprise is a secure AI platform that helps analysts and mission teams across the Intelligence Community, Defense, and Civilian agencies analyze massive volumes of unstructured data. It transforms fragmented reports, proprietary data, and open-source information into structured, traceable insight that supports faster, defensible decision-making.

Learn about Primer Enterprise

Webpage discussing the impact of the global AI chip race on US security in the Pacific, featuring a text summary, an interactive map with numbered locations, and a sidebar with insights and relevant document titles.

Primer Command

Real-time operational clarity

Primer Command is an AI-powered monitoring platform that helps mission teams keep track of narratives, track evolving topics, and detect emerging threats across global news and social media. It provides real-time visibility into the information environment so leaders can understand events as they unfold.

Learn about Primer Command

Dashboard showing social media analytics including trending extractions for people, organizations, locations, hashtags, social highlights, sentiment analysis, social feed posts, and news feed about AI chip security concerns and cyber attacks.

Learn about AI solutions for better, faster decisions

Book a demo

Hard vs. soft RAG fails

Informed, defensible analysis

Real-time operational clarity

Recommended reading

Left of Launch: Why Counter-UAS Must Move Upstream

Accelerate the homeland security mission with Primer real-time intelligence

The future of AI in healthcare

Learn about AI solutions for better, faster decisions