ArchBits
ArchBits

Beyond Basic RAG: Agents, Graphs, and the Multimodal Future

RAG evolves beyond simple retrieval—explore agent patterns where models plan and act, graph retrieval for relationship queries, multimodal search across images and video, and emerging technologies reshaping production systems.

When Simple Retrieval Isn't Enough

Basic RAG follows a predictable pattern: user asks a question, system retrieves relevant documents, model generates an answer. This works well for straightforward queries where the answer exists in a single source. But real-world problems are messier.

Consider a project manager asking: "Compare our current project timeline against industry benchmarks for similar commercial developments, then estimate whether we're at risk of delay." This isn't a simple retrieval task. It requires pulling internal project data, finding external benchmarks, analyzing the comparison, and making a judgment call. The question demands planning, multiple information sources, and reasoning across different types of data.

Or imagine an architect querying: "Show me curtain wall systems from our past projects that match this sketch, then pull the cost data and supplier information." Now you're dealing with visual similarity search, structured database queries, and document retrieval—all for a single question.

This is where RAG evolves beyond simple retrieval patterns. Modern production systems treat RAG as one capability within a broader toolkit. The model becomes an agent that plans, acts, and reflects rather than just retrieving and generating.

RAG Meets Agents: Planning and Tool Use

In agentic systems, the language model doesn't just respond to prompts—it analyzes tasks, breaks them into steps, decides which tools to use, and evaluates its own outputs. RAG becomes one tool in the agent's arsenal, alongside web search, database queries, code execution, and API calls.

Think of it like a senior engineer solving a complex problem. They don't rely on a single approach. They might check internal documentation (RAG), search for recent solutions online (web search), run calculations (code execution), and verify data in databases (SQL queries). They plan their approach, try different methods if the first doesn't work, and verify their conclusions before presenting results.

Loading diagram...

ReAct Pattern: Reasoning + Acting. The agent alternates between reasoning steps ("I need to check X because Y") and action steps (calling tools to get information). This structured approach improves accuracy on complex tasks by 20-30% compared to simple prompting.

The agent workflow looks like this: When a query arrives, the planning component analyzes what information is needed and outlines an approach. It might determine that answering requires internal documentation (triggering RAG), current market data (triggering web search), and some calculations (triggering code execution). The agent executes these steps, collects results, then evaluates whether it has enough information to answer confidently. If not, it plans the next retrieval step. This is multi-hop reasoning—each retrieval informs the next.

For example, asking "How does our API latency compare to AWS Lambda cold start times?" requires multiple steps. First, retrieve internal monitoring data showing your average API latency is 120ms. Then search the web for AWS Lambda benchmark data, finding cold starts range from 200-600ms. Finally, synthesize both data points into a comparison. No single retrieval could answer this—it requires iterative information gathering guided by reasoning.

Self-correction adds another layer of sophistication. After generating an answer, the agent can evaluate it: "Is this response actually supported by the retrieved context? Did I answer the right question?" If something seems off, it can retry with different documents or reformulate the search query. This Self-RAG pattern significantly reduces hallucinations because the model checks its own work.

Accuracy improvement with self-correction:12-15%

A more advanced variant called Corrective RAG (CRAG) grades the relevance of each retrieved document before using it. If the retrieved documents score poorly on relevance, the system automatically falls back to web search or tries a different retrieval strategy. This prevents garbage-in-garbage-out scenarios where irrelevant retrieval leads to poor generation.

The practical question is when to use agentic patterns versus simple RAG. For straightforward queries with clear answers in your knowledge base, basic RAG is faster and cheaper. For queries requiring multiple information sources, complex reasoning, or iterative refinement, agents justify their additional complexity and cost. The cost difference matters—agent workflows might make 3-5 LLM calls instead of 1, directly impacting your API bills.

Graph RAG: When Relationships Matter

Vector similarity search captures semantic meaning remarkably well. "Concrete strength requirements" and "compressive capacity specifications" have similar embeddings even though they use different terminology. But vector search has a fundamental limitation: it doesn't understand explicit relationships between entities.

Consider the query: "Who are the structural engineers that worked on projects with sustainability certifications above LEED Gold?" Vector search might retrieve documents mentioning engineers, sustainability, and LEED certifications. But it can't trace the actual relationships: Engineer A worked on Project X, which has LEED Platinum certification. These are connections between entities, not semantic similarities between text.

This is where graph-based retrieval comes in. Knowledge graphs represent information as nodes (entities like people, projects, companies) and edges (relationships like "worked_on," "located_in," "certified_as"). Graph RAG combines vector search's semantic understanding with graph traversal's ability to follow explicit connections.

Loading diagram...

The retrieval process becomes hybrid: first use vector search to find relevant entities (find engineers specializing in sustainable design), then use graph traversal to find connected entities (which projects did they work on, what certifications do those projects have). This combination answers complex questions that neither approach handles alone.

Microsoft's GraphRAG research introduced a sophisticated enhancement: community detection. The idea is that large knowledge graphs naturally form clusters of related entities. Projects in the same region form one community, projects using similar materials form another, projects by the same contractor form yet another. By detecting these communities and pre-generating summaries for each cluster, you can answer both local questions (specific facts about entities) and global questions (patterns across the entire knowledge base).

When to Use Graph RAG: Your domain has clear entities and relationships worth modeling explicitly—organizational hierarchies, project dependencies, supplier networks, regulatory requirements with citations. For pure semantic search without explicit relationships, vector RAG is simpler and sufficient.

Implementation requires both a graph database and vector embeddings. Neo4j is the most mature option with strong visualization tools and the Cypher query language for graph traversal. AWS Neptune offers managed graph databases integrated with the AWS ecosystem. For self-hosting, NebulaGraph scales to billions of nodes and edges. The architecture stores vector embeddings as node properties, allowing you to combine similarity search (find similar projects) with relationship traversal (find engineers who worked on those projects).

The added complexity matters. Graph databases require data modeling upfront—you need to define what entities and relationships matter for your domain. Maintaining graph structure as data changes requires additional engineering. For many applications, the simpler vector-only approach suffices. But when your queries naturally involve traversing relationships—finding connections, checking hierarchies, analyzing networks—graph RAG becomes worth the investment.

Beyond Text: Multimodal Retrieval

Most RAG systems assume text-only retrieval. But real-world knowledge exists in multiple formats—architectural drawings, site photos, video walkthroughs, audio recordings from meetings. Multimodal RAG extends retrieval across these different data types.

The breakthrough enabling multimodal retrieval is models like CLIP (Contrastive Language-Image Pre-training) that create joint embeddings for text and images. Text and images map to the same vector space, meaning you can search images using text queries or find similar images using image queries. "Modern glass facade" as a text query can retrieve images of buildings with glass facades, even if those images have no text labels.

CLIP Embeddings: A neural network model that learns to match images with their text descriptions. It creates embeddings where semantically related text and images are close together in vector space, enabling cross-modal search.

The practical workflow for image retrieval: index your image collection by encoding each image with CLIP's image encoder. When a user queries with text like "buildings with curved rooflines," encode that text with CLIP's text encoder. Vector search then finds images whose embeddings are most similar to the query embedding. The same approach works in reverse—query with an image to find similar images or related text descriptions.

For document-heavy workflows like reviewing PDFs with mixed content (text, diagrams, tables, photos), multimodal understanding becomes critical. Traditional text extraction misses visual information—charts convey data relationships, diagrams show spatial configurations, photos document site conditions. Modern document parsing services like Unstructured.io extract these components separately, preserving the visual information alongside text.

The retrieval strategy for complex documents involves multiple parallel indexes. Text chunks go into one index using text embeddings. Extracted images go into another index using CLIP embeddings. Tables might be converted to text or embedded separately. During retrieval, you search across all indexes and combine results, ensuring visual information isn't lost just because it's not textual.

Video and audio present additional challenges. For video, you can sample frames and embed them with CLIP for visual search, or transcribe the audio track and embed the transcript for semantic search. The combination enables queries like "Find the segment where the presenter discusses cost estimation" by matching against both visual content and spoken words. For pure audio (podcasts, meetings), transcription with tools like Whisper followed by text embedding often suffices. More sophisticated systems embed audio directly, but transcription-based search is simpler and works well for most use cases.

The cost-benefit calculus for multimodal RAG differs from text-only. Processing images and videos requires more compute. Storing multiple embedding types increases database costs. But for domains where visual information is critical—architecture, manufacturing, healthcare—the capability to retrieve and reason over images and videos is essential. You're not limited to finding text descriptions of things; you can find the actual visual artifacts.

Structured Data and Database Integration

Not all organizational knowledge lives in documents. Databases hold structured information—project timelines, cost data, material specifications, contractor records. Integrating RAG with structured data sources unlocks richer capabilities than either approach alone.

The common pattern is text-to-SQL: convert natural language queries into SQL statements, execute them, and return results. "Show me the top 10 projects by budget overrun in the last year" becomes SELECT project_name, (actual_cost - budgeted_cost) as overrun FROM projects WHERE year = 2025 ORDER BY overrun DESC LIMIT 10. The challenge is that models need to understand your database schema—table names, column definitions, relationships between tables.

For small databases with a handful of tables, you can include the entire schema in the prompt. But this doesn't scale. A production database with 100+ tables and thousands of columns exceeds context limits and overwhelms the model with irrelevant schema information. This is where RAG enters: retrieve only the relevant table schemas based on the query.

The retrieval strategy uses semantic understanding of table descriptions. You maintain natural language descriptions of each table ("contains project timelines and milestone dates") and each important column. When a query arrives, embed it and retrieve the most relevant table schemas. The model then generates SQL using only those tables, dramatically improving accuracy and reducing errors.

More sophisticated systems include example queries in the retrieval. Store pairs of natural language questions and their corresponding SQL statements. When a new query comes in, retrieve similar examples and use them as few-shot demonstrations. The model learns the SQL patterns specific to your schema by seeing examples, rather than reasoning from scratch.

The real power emerges when combining structured and unstructured retrieval. Consider a customer support scenario: "Why was my last order delayed?" The agent first queries the database to get order details (SQL), then retrieves shipping policies and known issues from documentation (RAG), and finally synthesizes both into a personalized response that combines factual order data with explanatory context.

For teams working with data analysis, integrating code execution with RAG creates powerful workflows. Retrieve relevant datasets, generate pandas or numpy code to analyze them, execute the code in a sandbox environment, and use the results to generate insights. "Plot our monthly project costs and identify outliers" becomes: retrieve cost data → generate matplotlib code → execute → return chart. The model performs precise numerical analysis rather than approximating from text.

The New RAG Platform Landscape

The RAG ecosystem is consolidating around managed platforms that simplify implementation at the cost of control. Understanding these options helps you choose the right level of abstraction for your needs.

Google Vertex AI Search (formerly Enterprise Search) offers RAG-as-a-service with minimal setup. Upload documents, and Google handles chunking, embedding, indexing, and retrieval automatically. Query the API, get results. For teams without ML expertise or infrastructure, this removes substantial complexity. The tradeoff: you can't tune chunking strategies, swap embedding models, or access the underlying infrastructure. It's a black box optimized for ease over control.

OpenAI's File Search (part of the Assistants API) provides similar simplicity within OpenAI's ecosystem. Upload up to 10,000 files per assistant, and OpenAI automatically chunks, embeds, and indexes them. When the assistant generates responses, it can call the file_search tool to retrieve relevant context and cite sources. For applications already committed to OpenAI models, this eliminates the need for separate vector database infrastructure. The limitation: you're locked into OpenAI's chunking approach, can't use hybrid search, and can't export embeddings for use elsewhere.

Files per assistant:10,000

Anthropic's prompt caching takes a different approach to cost optimization. Instead of managing retrieval infrastructure, it caches large portions of prompts (like retrieved documents) across requests. Mark your retrieved context as cacheable, and subsequent queries pay only 10% of the normal cost for cached tokens. For systems that repeatedly retrieve the same documents—like internal knowledge bases where popular documents appear frequently—this yields 90% cost reduction on input tokens.

The architectural implication: you can afford to include more context per query. Instead of optimizing for minimal context (fewer chunks retrieved), you can retrieve more comprehensively and rely on caching to control costs. The limitation is cache duration (currently 5 minutes) and it only works with Claude models, but the cost savings are substantial for the right use patterns.

Platform Selection Guide: Use managed platforms (Vertex AI Search, OpenAI File Search) for MVPs, internal tools, and teams without ML infrastructure expertise. Build custom RAG when you need fine-grained control, want to optimize costs at scale, or require specific retrieval algorithms not supported by platforms.

Open-source embedding models continue improving. BGE (from the Beijing Academy of AI) now outperforms OpenAI on standard benchmarks while being free to self-host. Microsoft's E5 models offer strong performance at multiple size tiers. The tradeoff remains the same: open-source models are free per query but require GPU infrastructure, while paid APIs charge per token but eliminate infrastructure management. At sufficient scale, self-hosting becomes economically attractive.

Cohere's Embed v3 introduced interesting capabilities for production systems. The compression feature reduces embedding dimensions from 1024 to 256 with minimal accuracy loss—a 4× reduction in storage costs. The multilingual support covers 100+ languages in a single model, eliminating the need for language-specific embeddings. These optimizations matter when you're storing millions of vectors and serving queries globally.

Production Patterns: Routing, Guardrails, and Resilience

Beyond core retrieval, production systems need sophisticated patterns for quality, safety, and reliability. These patterns don't change what RAG does but dramatically affect how well it works in practice.

Query routing classifies incoming queries and sends them down optimal paths. Not every query needs expensive retrieval and GPT-4 generation. "What's 2+2?" doesn't need to search your knowledge base—route it directly to a cheap model. "Explain our company's sustainability policy" clearly needs RAG and careful generation—route to comprehensive retrieval plus GPT-4.

The routing logic can be simple (rule-based keywords) or sophisticated (semantic classification using embeddings or an LLM). The cost savings add up: routing 60% of queries to GPT-3.5 instead of GPT-4 cuts costs substantially while maintaining quality for simple questions. More advanced routers consider multiple dimensions—query complexity, required accuracy, latency tolerance, user tier.

Guardrails add safety and quality checks at different pipeline stages. Input guardrails scan queries before processing—detecting personally identifiable information (PII), blocking prompt injection attempts, enforcing rate limits, and filtering inappropriate content. Output guardrails verify responses after generation—checking factual consistency against retrieved context, detecting hallucinations, filtering toxic outputs, and redacting sensitive information.

Tools like NeMo Guardrails (NVIDIA) and Guardrails AI provide programmable frameworks for these checks. LlamaGuard (Meta) offers a safety model specifically trained to detect harmful content. The tradeoff is latency—each guardrail adds 50-200ms. Critical applications like healthcare or finance justify comprehensive guardrails despite latency impact. Casual chatbots might skip some checks for speed.

Semantic caching goes beyond exact query matching. Instead of caching only identical prompts, embed incoming queries and search the cache for similar queries (cosine similarity > 0.95). If you find a match, return the cached result without retrieval or generation. For applications with common question patterns—customer support, documentation Q&A—this achieves 60-80% cache hit rates, dramatically reducing costs.

The implementation stores query embeddings in Redis with vector search capabilities, or uses managed solutions like GPTCache or Helicone. The key insight: users often ask the same question phrased differently. "How do I reset my password?" and "What's the process for password recovery?" are semantically similar enough to share cached results.

Resilience patterns handle infrastructure failures gracefully. Your vector database will occasionally have issues. Web search APIs sometimes time out. Design multi-layer fallbacks: if the primary vector search fails, fall back to BM25 search in Elasticsearch (different infrastructure). If that fails, try web search. If everything fails, acknowledge limitations but still attempt to help based on the model's training data.

Circuit breakers prevent cascading failures. Track error rates for each component. If vector search starts failing at high rates, immediately skip to the fallback without waiting for timeouts on every request. This keeps your system responsive even when individual components degrade.

The Future: Longer Contexts, Reasoning Models, and Autonomous Agents

The RAG landscape is evolving rapidly. Understanding likely directions helps build systems that remain relevant as capabilities advance.

Longer context windows continue expanding. 200K tokens today, 1M tokens soon, potentially 10M+ tokens within a few years. This shifts the threshold for when RAG becomes necessary, but it doesn't eliminate RAG's value. Three factors ensure RAG remains relevant: cost scales with context length (10M tokens is expensive even if possible), retrieval targets information efficiently rather than loading everything, and real-time or frequently-updated data requires retrieval regardless of context size.

The likely pattern: hybrid approaches where frequently-accessed documents live in context while the long tail of rarely-needed information is retrieved on demand. Think of it like memory hierarchies in computer systems—hot data in fast cache (context), cold data in slower storage (retrieved).

Reasoning models like OpenAI's o1 series change how models interact with retrieved information. These models think step-by-step before answering, explicitly reasoning through problems. Applied to RAG, this enables more sophisticated multi-document reasoning—synthesizing information across sources, identifying contradictions, and making inferences that require combining multiple pieces of evidence.

Potential improvement on complex RAG tasks:30-40%

The combination unlocks new capabilities. Instead of simple question answering, imagine: "Analyze these architectural specifications, identify conflicts with building codes, and propose modifications that maintain design intent while achieving compliance." This requires understanding specifications, retrieving relevant code sections, reasoning about conflicts, and generating creative solutions—all tasks that benefit from explicit reasoning.

Agentic RAG becomes standard as costs decrease and reliability improves. The future isn't static retrieval pipelines but agents that autonomously decide when to retrieve, what to retrieve, and how to combine information. Models plan multi-step workflows, invoke tools as needed (RAG being one tool among many), self-correct when outputs are inconsistent, and iterate until they generate high-quality results.

This shift is already visible in research systems and early products. The barrier has been cost—agentic workflows make multiple LLM calls instead of one, multiplying expenses. But as inference becomes cheaper and models become more efficient, the economic constraints ease. A year from now, single-pass RAG might be considered the legacy approach.

Multimodal-first retrieval becomes the norm rather than the exception. Text-only RAG is already limiting for many domains. Future systems natively handle mixed queries: search with an image, retrieve related documents and videos. Query with voice, retrieve visual diagrams. The unified multimodal embeddings currently emerging from research labs will become production-ready, making cross-modal search as simple as text search is today.

For industries like architecture and construction where visual information is critical—drawings, site photos, material samples, finished installations—this evolution is particularly important. The ability to query "Show me curtain wall installations similar to this photo, along with their specifications and cost data" becomes straightforward when multimodal retrieval is standard infrastructure.

Fine-tuned retrievers will move from research to production. Currently, most systems use off-the-shelf embeddings from OpenAI or Cohere. These general-purpose embeddings work well, but domain-specific retrievers trained on your data can improve precision significantly. The barriers are falling: easier training frameworks (sentence-transformers), synthetic data generation for creating training examples, and transfer learning from large models reduce the expertise and data required.

The feedback loop becomes practical: log what users click, what they reformulate, what they mark as helpful. Use this implicit feedback to fine-tune embeddings over time, continuously improving retrieval quality. This is reinforcement learning from human feedback (RLHF) applied to retrieval rather than generation.

Building for an Evolving Landscape

Given rapid changes in capabilities and costs, how do you build systems that remain effective as the landscape shifts?

Design for modularity. Abstract interfaces between components so you can swap embedding models, switch vector databases, or update LLM providers without full rewrites. Today's optimal choices won't be optimal in six months. Systems designed with hard dependencies on specific vendors or models become expensive to migrate.

Invest in evaluation infrastructure before optimization. Continuous evaluation pipelines, A/B testing frameworks, and metrics dashboards let you measure impact when you change components. Without measurement, you're guessing whether the new embedding model actually improved results. With it, you make data-driven decisions about where to invest engineering time.

Track costs at component level. Instrument your system to log costs for embedding generation, vector storage, query operations, and LLM generation separately. Analyze monthly spending by feature. This visibility enables optimization targeting actual costs rather than assumed bottlenecks. You might discover that 90% of costs come from a single feature used by 5% of users—a clear candidate for optimization or pricing changes.

Design for graceful degradation. Production systems fail in unexpected ways. Vector databases go down. API rate limits hit. External services time out. Build fallback mechanisms at each failure point. Multiple retrieval methods (vector + BM25 + web search). Cached responses for common queries. Clear error messages when degraded mode is active. Users tolerate temporary reduced capability better than complete failures.

Capture user feedback systematically. Both explicit signals (thumbs up/down, ratings) and implicit signals (query reformulation, stop rate, what users click) provide ground truth for system quality. Design feedback loops where this data flows back to training datasets and evaluation metrics. Systems that learn from production usage improve faster than those relying only on offline evaluation.

The future of RAG isn't replacement by longer contexts or more capable models. It's evolution toward more sophisticated systems that combine retrieval, reasoning, and action. Foundation models handle general knowledge. Retrieval provides specialized and current information. Humans verify critical decisions. Each layer optimizes for what it does best, creating systems more capable than any component alone.