9 minutes to read With insights from... Silvan Melchior Lead Data Scientist silvan.melchior@zuehlke.com Dr. Gabriel Krummenacher Head of Data Science gabriel.krummenacher@zuehlke.com What is RAG in AI? RAG is the enabling technique behind Q&A chatbot systems – one of the leading use cases for GenAI across a host of industries. The RAG technique involves expanding the capabilities of a large language model (LLM) by combining it with a retrieval system. This enables the model to fetch external knowledge beyond its training data. In this way, RAG enables an LLM to provide more informed, accurate, and up-to-date responses to user queries, expanding the model’s knowledge without the need for fine-tuning. A key ingredient in many LLM-powered products like Microsoft Copilot, RAG brings a number of benefits to LLMs: Improves the accuracy and factual correctness of responses Reduces hallucinations (where LLMs make-up info where they lack knowledge) Keeps information current without fine-tuning and retraining the model Provides context for the outputs of your LLM Improves transparency and trust by enabling source citation It’s no surprise then that we’ve seen huge appetite for RAG GenAI applications across a host of industries and applications – from streamlining internal processes to enhancing customer service and customer satisfaction. Here at Zühlke, we’ve used RAG to help a manufacturer save thousands of hours of manual work. We helped an insurer halve the time it takes to retrieve tariff information to support customer queries. We helped a bank improve response accuracy with an LLM-powered tool. And we enabled another to develop, test, and validate RAG-enabled solutions before they’re scaled across the business. An illustration of RAG system use cases across different industries Of course, the quality of outputs you get from these RAG AI models depends on the information they can access. While most early GenAI experiments focused on using external public data, businesses are recognising that giving models increased access to proprietary and sensitive data can unlock more complex, value-driving use cases. The limitations of RAG in off-the-shelf products While off-the-shelf models can provide value for companies in some scenarios, they can fall short on a few fronts: An off-the-shelf solution may produce unsatisfactory outcomes including incorrect responses, particularly if the application involves more complex requirements, such as domain-specific data or inherent structures within the data. The data may reside in a location that’s inaccessible to standard, off-the-shelf tools due to technical and/or legal constraints, such as proprietary software environments or on-premises infrastructure setups. The required data may be of a heterogeneous type (think databases, diagrams, or forms), which could prevent it from being used by an off-the-shelf tool. The licensing scheme and cost structure may not align with the requirements of the organisation. If you're looking to address these constraints and advance value-driving use cases in your business, you’ll need to develop your own, custom solution. This could involve extending an existing solution – for example, the OpenAI Assistants API – or an even more bespoke solution – for example, using open-source LLMs deployed on any infrastructure. Customising LLMs and combining them with a RAG system is no walk in the park, however. And to navigate implementation challenges effectively, we first need to understand the basic principles of RAG. How does GenAI RAG work? RAG combines an LLM with an information retrieval system. For every user question, this system is first used to find information that might answer the question. Then, this information is fed into the LLM, together with the user question, so the model can generate an informed response. A retrieval augmented generation diagram illustrating a basic RAG system The retrieval system is usually built using embeddings. Embeddings take a snippet of text and put them into a mathematical vector space. They do this in a way that positions texts on related topics closely together within the vector space. By embedding the user query and details within the same space, the system can search and easily locate information that’s likely to be most relevant to the query. One important ingredient of this approach is the so-called chunking. What we do there is take our texts and split them into smaller parts (chunks) of, for example, a few hundred words each. These chunks are then embedded separately, so we cannot just search for a whole text but for relevant parts of it only. The limitations of a basic RAG system While this simple setup works surprisingly well in many situations, it falls short in others. Common challenges include: Embedding-based search often fails: While these embeddings are great at correctly capturing the meaning of synonyms and the like, they’re not perfect. For certain types of data, certain embedding models are even quite bad. For example, legal text or company names. As your information base increases in size, the likelihood of finding the correct chunks decreases. Chunks miss context: Even if the correct chunk is found, it is still only a small part of a wider text. The missing context around it might lead the LLM to interpret the content incorrectly. A one-shot approach prevents proper search: This kind of retrieval system has one chance of finding the correct information. And, if the user frames the query in an unusual manner, the system might fail to deliver a valuable output. If the found information requires a follow-up question, it will likely struggle to answer it. How to improve the performance of a GenAI RAG system In the cases mentioned before, the model either does not provide an answer at all, or worse, it provides the wrong one. Thankfully, there are multiple extensions or adjustments you can make in the basic RAG setup which, depending on the problem, can help. A diagram showing how you can optimise a RAG system with improved search, better chunking, and the right model Adopt the right model for your specific use case Unsurprisingly, your model selection heavily impacts the final performance of your RAG system. LLMs are limited by the amount of information they can process at once. You can only provide it with a certain number of chunks to help answer a question. 'An easy way to address this and improve your RAG system is to use a model that supports a larger amount of contextual information, because including more chunks decreases the likelihood of missing the relevant ones'. Luckily, newer GenAI models have drastically increased the context length. GPT-4, for example, was updated in 2023 to support 128,000 tokens, which corresponds to roughly 200 pages of text. And in early 2024, Google announced a breakthrough in their Gemini series, soon supporting one million tokens. But there’s a caveat to this solution. The longer the context, the harder it becomes for a model to find the relevant information. The location of information within contextual chunks impacts performance too. For example, information in the middle of a chunk tends to have less weighting. What’s more, it requires a lot of resources for your system to infer meaning from very long contexts. And so you need to evaluate carefully the degree of trade-off between resources and performance accuracy. The context length is not the only selection criteria for your model. If you have highly specific data, for example medical documents, a model trained on such data might outperform a more standard one, even though it might have a smaller context length. Optimise your LLM chunking Chunking – the process of cutting information into small, searchable parts – heavily impacts retrieval performance. If the chunk is too small, the LLM has a hard time interpreting it because of the missing context (the original text). If a chunk is too large, the embedding vector will become very general because the text starts to contain different topics, and so the search performance decreases. Again, you need to carefully evaluate your use cases to find the optimal chunk size and overlap. Adaptive chunking is a useful approach given that it supports chunks of varying sizes. This technique usually considers document structure (e.g. paragraphs) and might use embeddings to measure the similarity across topics covered in different passages of the text. Implement hybrid search and complementary LLMs Embedding-based search has its limits, as we explored earlier. A common way to improve it is to combine it with more classical, keyword-based search. This combination, often called hybrid search, usually outperforms pure embedding-based search. Since there’s a limit to the number of chunks you can feed into an LLM, a re-ranking step often takes place. This is where you retrieve additional chunks that couldn’t fit into the LLM context and then the most fitting ones are identified using a separate machine learning model. This model is too expensive to consider all chunks in the data, but still cheap enough that it can analyse more chunks than would fit into the LLM context. Improving the search query is another way to improve outcomes. A user’s query is used by default. But a great option can be to ask another LLM to reformulate the original question into a more fitting search term. This could be anything from identifying specialised keywords to uncovering multiple questions to potential answers. Empower your model to use external tools Your retrieval system has one chance of finding the right information. One way to streamline this process and get the most accurate responses is to put the LLM in charge of the entire information retrieval process. This is done with so-called agentic AI, where a large language model not only talks to the user, but can opt to use a tool, like a search engine or database, to locate structured information. The model can decide to actively search for information, look at the results, and then search again using different words, or for another topic as needed. This paradigm can be very powerful in certain scenarios, but usually only works well if the large language model is trained effectively to work with external tools. An example of an agent-based RAG system LLMs and proprietary data: a powerful combination We’re confident that the combination of LLMs with proprietary information is critical to making these models more effective and efficient. Giving GenAI models access to tools like search engines and databases with structured information – together with the capability to trigger actions like sending emails or adding information within a spreadsheet – unlocks a wide range of new use cases not previously possible. 'We’re entering a new area of automation, augmentation, and user interaction. And getting ahead of this curve is essential for rewiring your business for long-term growth and competitive advantage'. Here at Zühlke, we’re helping businesses across complex and regulated ecosystems to turn AI opportunities into value-driving use cases and scalable solutions. Talk to us today about how we can help you develop you ideate, build, and scale bespoke products and solutions while mitigating risk, adopting responsible practices, and ensuring early and ongoing value. You might also like... Data & AI – Agentic AI: the risks & rewards of adaptive AI Learn more Data & AI – Responsible AI: a framework for ethical AI applications Learn more Data & AI – What does AI mean in the context of security? Learn more