When working on architecture to deliver AI-powered features, I have had many conversations about RAG. It seems like a mystical and magical creature that can suddenly make the Large Language Model (LLM) safer and smarter. There are many misconceptions and misplaced expectations, often coupled with the magical Vector Search.
RAG stands for Retrieval-Augmented Generation. In layman's terms, it’s variable expansion: you add variables to a prompt and you populate (the Augmented in RAG) these variables using data that you retrieved (R in RAG) to lead the LLM to a better answer (Generation in RAG). That’s it. By populating variables in a prompt, you are creating context for the model, in addition to the expected user input. And yes, if we just named it variable expansion it would seem less intimidating.
We need RAG because any call to an LLM is stateless, the model does not have any memory of what happened before, and we don’t know how to create precise memory for an LLM. We can train it and make it get good at something, but we don’t have a way to make it remember something without training it, and even if we train it, we’re not sure it will remember what we want it to remember. So we add context into the prompt to simulate memory or knowledge, and give it a better context to reason better and be more accurate in the answer. So in a sense, you need to generate and pass its memory every time you call it.
For example, you can use it to add the customer buying history, the customer profile, the reference documentation about the domain, the past conversations, the questions & answers to a test to correct it, etc. RAG is — simply put — how we give the model memory and knowledge so it can use it in its generation.
An LLM creates a rich, high dimensional, representation of a piece of content. That is what they use to process, reason, and answer. It contains, in LLM-terms, the “meaning” of the content, and it is represented as an array of numbers. A clever way to use this representation is to index it, and then compare it with other representations of content.
The result of this (called vector search) is highly accurate similarity based on the meaning, which we could call semantic search. Think full-text search on steroids (when full-text search would be using a couple of vectors for each token, LLM creates hundreds — up to more than a thousand — of vectors per token). So it’s a very good semantic search, in the true sense of the word. But it’s only suitable for this precise use case: finding similar content. And it’s been used a lot initially to index content in chunks to accommodate the small input windows of LLM. Now that LLMs have dramatically expanded their input window, the strategy is different.
When implementing RAG, it’s important to go back to the business need and desired outcome: What do I want the model to know when performing this task or answering that question to make it more accurate? Based on this, what data do I have available, in what format, and how do I retrieve it based on the context of the task?
The key for efficient RAG is a good definition of the context, and then a way to create that context so the LLM is highly relevant. But it’s not all vector search, I would even argue that vector search will only be a very small part of the retrieval done for RAG. For example, when you want to create a customer context: you need a query in Salesforce, a query in your support system, make a textual representation of that and feed it to the LLM to add context to your query. Vector search is only useful if you need powerful similarity search, there are many cases (like: customerId, documentId, material used, drug prescribed, customer age, and pretty much anything you put in a database) when you absolutely don’t want similarity, you want a precise match.
So I think when defining a RAG strategy, it is really important to not focus on the tools (what vector database do I use) but focus on the context for each task, how to retrieve it, and how to represent that context in text so the LLM can efficiently use it and get the knowledge it needs to answer the query.
RAG is a great approach to make LLM more relevant, and every time you add variables to your prompt, you’re doing RAG! 🙂
PS: There are many other aspects to create and deploy an efficient RAG strategy using vector search, namely how to (or not to) chunk, how to apply and maintain security, how to ensure consistency cross request, etc. We’ll cover some of these in future blog posts.