The rapidly evolving field of generative AI is focused on building neural networks that can create realistic content such as text, images, audio, and synthetic data. Generative AI is revolutionizing multiple industries by enabling rapid creation of content, powering intelligent knowledge assistants, augmenting software development with coding co-pilots, and automating complex tasks across various domains.
Yet generative AI models are trained on finite data and only know as much as that data can tell them. In practical terms, this is a problem of ensuring the accuracy and relevance of generated content regarding information not in the model’s training data. These “knowledge gaps” can be in terms of time (all training data is dated before April 2023, for example) or in terms of domains (none of your proprietary data was used to train the model, for example).
This is where retrieval-augmented generation (RAG) comes into play. RAG enhances generative models such as large language models (LLMs) with external information sources to effectively provide the model with “facts” it doesn’t know. This results in more accurate and contextually relevant responses, depending on the quality of that external information and the quality of the model.
Specific examples of how RAG can be used include:
Up-to-date chatbots: RAG-enabled chatbots can fetch real-time information from a company’s database for accurate responses to customer queries. For example, recent outages that a model wouldn’t know about.
Context-aware content creation: RAG applications can be used to inject data to generate long-form content that reflects current events that have happened after a model was trained.
Search-based content summarization: Adding Internet search and document retrieval capabilities can turn RAG applications into dynamic content generators that use traditional search methods to find data to be fed into the typical RAG pipeline.
This post explains two approaches to RAG applications, local and hybrid, and how NVIDIA AI Workbench makes it easy to get started with an example hybrid RAG project from GitHub.
Compute location
The flexibility and power of RAG don’t come for free. RAG applications have a lot of moving parts, most of which involve some form of computation. For example, an end-to-end RAG application must ingest your data into a vector database, perform information retrieval from that database based on your input, and then push the retrieved data through the LLM to generate an appropriate answer. The last step can be especially computationally difficult, and is where GPUs can offer significant benefits.
Finally, the quality of the results greatly depends on the size of the model, measured in billions of parameters. All things being equal, model quality increases with model size. Tens of billions of parameters are better than single billions of parameters, for example. This is because the number of parameters correlates to the amount of data a model can be trained on. More parameters means more data and thus richer, more complex responses.
As larger models require larger GPUs with more video memory (vRAM), users are increasingly trying to figure out how to work with different GPUs in their RAG applications. This raises two questions:
How must a RAG application and the desired model be configured to work with a given GPU?
Where is the GPU running?
Local RAG
Running self-hosted RAG applications is popular right now. In this scenario, a user or organization has everything set up on resources they control, including the GPUs. While this approach is technically more complex than using a service like OpenAI, it enables customization, privacy, and cost controls that aren’t possible with a third-party provider.
For some users, self-hosting even means running everything on an AI workstation. This means keeping data, queries, and computations completely private and self-contained on an AI workstation with one or more NVIDIA RTX 6000 Ada generation GPUs. While creating such an application individually isn’t realistic, there are high-quality RAG applications that run locally, such as LM Studio, AnythingLLM, and NVIDIA ChatRTX. These applications are user-friendly and take a plug-and-play approach so users can work with different models and vector databases.
However, what happens if you want to do something that outstrips your local GPU?
One answer is hybrid RAG.
Hybrid RAG
Hybrid RAG is an approach that uses computational resources running in different locations to provide the necessary compute power. It’s called hybrid because it combines different but complementary resources. Most commonly, this means running the embedding and retrieval steps on a local AI workstation or PC, and using a remote GPU resource for the LLM inference computation. This combination helps yield the best performance, whether you’re working on small projects or handling large amounts of data.
While this approach may sound appealing, it’s not easy to build a hybrid RAG application yourself. You need to know the basics of LangChain, data storage and management, information retrieval, text generation, and building a user interface. Then there are the technical decisions about matching model sizes, precisions, and token limits to the targeted hardware. Finally, you need to know how to get it all wired up to enable a choice of where to run the particular compute components.
There’s a lot to figure out, and that’s where NVIDIA AI Workbench comes in. NVIDIA AI Workbench is a free solution that enables you to develop, test, and prototype generative AI applications on your system choice—PC, workstation, data center, or cloud. AI Workbench supports both development and training, and you can create and share your development environments across people and systems.
Install AI Workbench to get started in minutes on your local or remote workstation or PC. It supports Windows 11, Ubuntu 22.04 and 24.04, as well as macOS. Once installed, you can start a new project or clone one from the AI Workbench examples on GitHub. Everything works through your choice of GitHub or GitLab, so you can work as usual but still share your work easily. Learn more about getting started with AI Workbench.
Hybrid RAG with AI Workbench
To see an example application for running a custom text-based RAG web application with your own documents on your local or remote system, check out the NVIDIA Hybrid RAG Project. This is a Workbench Project that runs a containerized RAG server with a Gradio chat UI front end on the host machine. It also provides a JupyterLab development environment, with an option to integrate with Visual Studio Code. You can use a wide variety of models, and choose where you want to run your inference.
Figure 1. Develop a hybrid RAG project with a customizable Gradio chat UI
The Hybrid RAG Project is fairly simple, and because it’s also a development environment, you can open it up and get into the details of how hybrid RAG works. Completely modifiable, it also provides a suitable environment to begin learning how to set up hybrid applications. Use it to learn, or fork it and adapt it into your own application.
This project seamlessly integrates with the NVIDIA API Catalog, enabling developers to leverage cloud-based inference endpoints for rapid prototyping and development. Additionally, you can download NVIDIA NIMs to local workstations or servers for on-premises deployment and customization. The NVIDIA Developer Program offers free access to experiment with NIMs.
Importantly, these same NIMs can be transitioned into production environments with an NVIDIA AI Enterprise license, ensuring a smooth path from development to deployment. With this flexibility, you can choose the most suitable inference method for your needs, whether it’s cloud-based, local, or a hybrid approach, while maintaining consistency across different environments.
Get started
The hybrid aspect of the Hybrid RAG Project extends beyond the mode of inference choices. AI Workbench itself has a simple hybrid deployment model. You can install AI Workbench locally on RTX PCs and workstations, as well as on remote resources in the cloud or data center. Then you can run this project locally or scale it up to a remote cloud server or data center. The ability to run projects on systems of your choice without the overhead of manually setting up the infrastructure extends to all Workbench Projects. You can customize the AI Workbench examples on GitHub for domain-specific purposes. To create your own Workbench Project, follow the steps in the User Guide.