RAG-V: the future of trustworthy AI

Envision tapping into your data’s full potential with AI so advanced, it transforms how you operate. Deploying LLMs alongside Retrieval-Augmented Generation (RAG) isn’t just a step forward—it’s a game-changer. At Primer, we’ve experienced just how transformative Generative AI can be when paired with your proprietary, internal data.

At Primer, we’ve seen this ourselves in our day-to-day lives. Beyond using Co-Pilot to help jumpstart code, we’ve integrated LLMs into Slack channels and Confluence documentation to more quickly find, understand, and inform one another on what’s most critical.

But here’s the catch: while the initial results can seem astonishing, there’s an underlying complexity that most overlook—model drift, variance between different models, and the challenges of ensuring accuracy over time. Our encounters with these issues in our data have driven us to pioneer solutions that combine the power of probabilistic AI with the precision of traditional engineering.

The LLM as a modern VM: a technical analogy

Our libraries allow seamless shifts between commercial model APIs, and the qualitative differences between parameter sizes and vendors are significant—and worth noting.

To draw a simple analogy, an LLM functions much like a modern virtual machine (VM). But instead of Java running on a JVM, think of English being processed by GPT-4. While switching JVM versions is somewhat reliable, it’s not without risks. The key difference is that today’s LLM ‘VM’ is built from data, with decision-making paths shaped by the inputs and outputs it’s trained on. These models are subject to data shifts over time, effectively creating a new ‘VM’ with each release. 

The U.S. Army has confronted this challenge head-on, weighing deterministic knowledge graphs against potentially compromised datasets. Rather than relying solely on Machine Learning to create a trustworthy “VM,” there’s a need for verified and curated knowledge graphs, with Overton windows closely monitored for shifts.

Merging engineering rigor with AI power

We believe there’s a middle ground—one that combines the rigor of traditional engineering with the power of probabilistic techniques. By investing in robust evaluation, validation, and correction algorithms, you can ensure your LLM and RAG are deployment-ready, with upstream versions locked down to prevent drift.

This goes beyond standard LLM evaluation techniques like MMLU, Hugging Face’s Evaluation Harness, or Adversarial Chatbot Arenas. These tests must run on your RAG and proprietary data, and against known truths. And yes, having a verified, curated knowledge graph is crucial here.

Introducing RAG-V: the future of trustworthy AI

We’re calling this concept RAG-V. We all know that we can’t improve what we can’t measure.  Thus, we’ve engineered a new scoring system for RAG+LLM accuracy and rigorously tested our software against it. We’ve also explored methods to correct these inaccuracies, resulting in reliable insights—complete with references and explanations you can trust.

Stay tuned for our presentation at the INSA Summit and our upcoming arxiv.org paper. We’ll also be unveiling new functionality in Q4 that balances deterministic and probabilistic approaches, delivering trusted and reliable AI.