Understanding Retrieval-Augmented Generation (RAG) on GCP

Retrieval-Augmented Generation (RAG) is a key technique powering more broad and trustworthy application of large language models (LLMs). By integrating external knowledge sources, RAG addresses limitations of LLMs, such as outdated knowledge and hallucinated responses. In this webinar, Jeroen Overschie, a Machine Learning Engineer at Xebia Data, explains how RAG works by going through four different levels of complexity. Jeroen will take you along RAG applications, and their implementations on Google Cloud Platform (GCP).

Jeroen shared, “RAG enables LLM’s to escape the box of static data internal to the model. It allows you to insert data unseen to the model before.” Let’s dive deeper into how RAG can be a practical tool for data-driven systems.

Watch the webinar now!

Key Takeaways

RAG in levels of complexity: RAGs can get complicated. Starting with a basic level, vector search is often a key component powering a RAG setup. In this talk Jeroen takes you through 4 levels all the way to multimodal usage of LLM’s to power your RAG setup.
Scalability: GCP tools offer a cohesive platform to build, manage, and scale RAG systems.
Practical use for decision making: RAG enhances the practical application of AI by delivering accurate, real-time, and actionable insights.

Why Use RAG?

Without RAG, interactions with LLMs are limited by the scope of their training data. For example, when asked about the results of a European Cup match in 2024, Google’s Gemini model, trained up to September 2021, responded that no such event existed. This is obviously not the case. This highlights crucial limitations:

Outdated Knowledge: Models rely solely on their training data, leaving them unaware of events or updates post-training.
No Internal Data Access: Without access to proprietary datasets or internal company documents, responses are generic and lack personalization.
Hallucinations: When a model doesn’t have the correct answer, it fabricates plausible sounding but inaccurate responses.

Benefits of RAG

RAG transforms this static interaction by integrating external sources in real time, ensuring more accurate and relevant responses. Key advantages include:

Up-to-date Knowledge: Access to real-time data provides current and reliable information.
Factual Answers: By grounding responses in verifiable data, RAG minimizes hallucinations.
Internal Data Access: Integration with proprietary company data ensures tailored and contextually relevant outputs.

This ensures users get answers that are not only accurate but also actionable, enhancing trust and usability.

Levels of RAG Implementation on GCP

Level 1: Basic RAG

This level involves document retrieval using vector search and answer generation via a prompt containing the retrieved context. Core concepts like embeddings, vector databases, and prompt engineering are introduced. Chunking text into smaller units before embedding improves performance and enables citations.

Level 2: Hybrid Search

This level combines vector search with keyword search (TF-IDF or BM-25) using Reciprocal Rank Fusion. This approach improves retrieval accuracy, especially when specific keywords are crucial.

Level 3: Advanced Data Formats

This level focuses on handling complex formats like PDF, HTML, and Word. It emphasizes specialized parsing, particularly for tables, often involving computer vision techniques and markdown conversion for better LLM understanding.

Level 4: Multimodal Models

This level leverages multimodal models like GPT-4o, capable of processing images and text. Using PDF pages as images directly in prompts can improve accuracy for content difficult to represent textually, but at a higher cost.

Building RAG Systems with GCP

RAG implementations vary based on flexibility and management requirements:

Flexible Approach - Combine individual tools like Document AI, Vertex AI Vector Search, and Gemini for full control and customization.
Managed Approach - Use integrated services like Vertex AI Search, which handles retrieval and answer generation, simplifying system architecture.

GCP Tools for Building a RAG System

To build an efficient and scalable Retrieval-Augmented Generation (RAG) system, Google Cloud Platform (GCP) provides several powerful tools that can be seamlessly integrated. These services simplify development while maintaining high performance, security, and flexibility.

Cloud Run: API Hosting

Cloud Run is a fully managed compute platform designed to run containerized applications at scale. It allows developers to host APIs, such as the endpoint for querying a RAG system, with minimal overhead.

Key Features:

Serverless Execution: Automatically scales up during high traffic and scales down to zero during idle periods, reducing costs.
Language-Agnostic: Supports any programming language or framework, making it versatile for diverse applications.
Integrated Security: Built-in HTTPS support and Identity and Access Management (IAM) for secure API access.
Application: In a RAG system, Cloud Run can host the query-handling API that receives user inputs, processes them, and returns augmented responses.

Example: The API hosted on Cloud Run serves as the gateway to the RAG system, receiving user questions and managing the interaction with document retrieval and answer generation components.

Vertex AI: Embedding and Vector Search Functionality

Vertex AI is GCP’s unified platform for building, deploying, and managing machine learning models. It plays a pivotal role in embedding creation and vector search in RAG systems.

Key Features:

Embeddings: Converts text data (questions and documents) into high-dimensional vectors that capture semantic meaning.
Vector Similarity Search: Enables efficient retrieval of relevant documents by comparing embeddings.
Customizable Models: Supports pre-trained embedding models like Text Embedding Gecko and custom models tailored to specific datasets.
Scalability: Handles large-scale datasets and complex search queries with low latency.
Application: In a RAG system, Vertex AI is used to embed user queries and documents into a numerical format and perform similarity searches to retrieve the most relevant documents.

Example: If a user asks, “What is the payload capacity of the Falcon 9 rocket to Mars?” Vertex AI matches the question’s embedding with embeddings of document fragments to find the most relevant document.

Cloud Storage: Securely Storing Documents

Cloud Storage provides a scalable and secure solution for storing documents and data used in a RAG system. It supports a variety of file formats, making it ideal for managing unstructured and semi-structured data.

Key Features:

High Durability: Ensures 99.999999999% (11 nines) of durability for stored data.
Access Control: Fine-grained IAM policies to manage who can view or modify stored content.
Global Availability: Allows data to be stored in multiple regions for low-latency access worldwide.
Integration with Other GCP Services: Works seamlessly with tools like Vertex AI and Document AI for downstream processing.
Application: In a RAG system, Cloud Storage is used to house the documents, datasets, or PDFs that are queried and retrieved for context augmentation.

Concluding

We have explored RAG in 4 levels of complexity. We went from building our first basic RAG to a RAG that leverages Multimodal models to answer questions based on complex documents. Each level introduces new complexities which are justified in each their own way. Summarising, the Levels of RAG are:

Level 1 - Basic RAG - RAG’s main steps are 1) retrieval and 2) generation. Important components to do so are Embedding, Vector Search using a Vector database, Chunking and a Large Language Model (LLM).
Level 2 - Hybrid Search - Combining vector search and keyword search can improve retrieval performance. Sparse text search can be done using: TF-IDF and BM- 25. Reciprocal Rank Fusion can be used to merge two search engine rankings.
Level 3 - Advanced data formats - Support formats like HTML, Word and PDF. PDF can contain images, graphs but also tables. Tables need a separate treatment, for example with Computer Vision, to then expose the table to the LLM as Markdown.
Level 4 - Multimodal - Multimodal models can reason across audio, images and even video. Such models can help process complex data formats, for example by exposing PDF’s as images to the model. Given that the extra cost is worth the benefit, such models can be incredibly powerful.

RAG is a very powerful technique which can open up many new possibilities at companies. The Levels of RAG help you reason about the complexity of your RAG and allow you to understand what is difficult to do with RAG and what is easier. So: what is your level? 🫵

We wish you all the best with building your own RAG 👏🏻.

Understanding Retrieval-Augmented Generation (RAG) on GCP

Watch the webinar now!

Key Takeaways

Why Use RAG?

Benefits of RAG

Levels of RAG Implementation on GCP

Level 1: Basic RAG

Level 2: Hybrid Search

Level 3: Advanced Data Formats

Level 4: Multimodal Models

Building RAG Systems with GCP

GCP Tools for Building a RAG System

Cloud Run: API Hosting

Vertex AI: Embedding and Vector Search Functionality

Cloud Storage: Securely Storing Documents

Concluding

Explore more articles