Building Production-Grade RAG Systems: Understanding the Problem Space

I've been quiet on this blog for a while now. Truth is, I lost my appetite for writing these past months. Between traveling to conferences, delivering talks, and shipping some cool features at work, the keyboard just didn't feel the same. There was also this nagging voice in my head: AI content has taken over the world: why bother writing another blog post when an LLM-generated version will probably be better anyway?

But here's the thing, as I kept hacking away on projects, I stumbled across posts that made me pause. Posts that weren't just technically correct, they had personality, insights, born from battle scars, the kind of stuff one can't prompt engineer. And I realised: maybe that's exactly what's missing. Stories from developers doing real work. So this is my modest attempt at bringing this blog back to life.

This post kicks off a three-part series where I dive deep into something I've been building over the past few weeks: production-grade RAG applications. The kind that survives (hopefully) production traffic, handles failures gracefully, and doesn't bankrupt us on LLM costs. Along the way, I'll share the lessons I learned (some the hard way).

When we think about building a Retrieval Augmented Generation (RAG) system, the first instinct is often to grab a vector database, throw in some embeddings, connect an LLM, and call it a day. I've been there. But production RAG systems are an entirely different beast. The gap between a proof-of-concept and a system that can handle real user traffic, maintain acceptable latency, and provide reliable answers is wider than what I initially (naively) thought.

This is the first post in a three-part series where I'll walk you through building a production-inspired RAG pipeline using Java 25, Spring Boot 3.5.7 (Sprintg AI), and Kubernetes. We'll explore not just the happy path, but the real challenges: graceful degradation, semantic caching, hybrid retrieval strategies, observability, and intelligent autoscaling.

The RAG Promise and Reality

Retrieval Augmented Generation fundamentally solves a critical problem with Large Language Models: hallucinations and knowledge staleness. Instead of relying solely on the model's knowledge, RAG systems ground responses in retrieved documents. The architecture is conceptually simple:

  1. User asks a question
  2. System retrieves relevant documents from a knowledge base
  3. Documents are injected into the LLM prompt as context
  4. LLM generates an answer grounded in the retrieved facts
  5. Response is returned with citations

But here's where things get interesting. This simple flow hides a multitude of production concerns that can make or break a system.

What Goes Wrong in Production RAG Systems?

There are a few things that screw your sleep, and get you awake at 3am because of an alert:

The Latency Problem

It obviously depends on your SLO, but traditional RAG systems often stack latencies in series: retrieval time + embedding time + LLM inference time + response streaming. Each step adds milliseconds (or worse, seconds) to your user's waiting time. When your vector database takes 300ms to return results, your LLM takes 2 seconds for first token, and you're processing embeddings for cache lookups, you're looking at 3+ seconds before users see anything.

Users might expect sub-second responses. Anything beyond 2 seconds feels broken. Again, it largely depends on your SLOs.

The Reliability Problem

Vector databases can timeout. LLMs can be overloaded. Networks can fail. In a traditional RAG pipeline, any single component failure means the entire request fails. That's unacceptable in production.

What happens when Weaviate is under load and the semantic search times out? Do we return an error? Or do we have a fallback strategy that still delivers value to our users?

The Cost Problem

Every LLM call costs money. Every embedding calculation burns CPU cycles. When users ask the same question five different ways ("How do I deploy to K8s?", "What's the Kubernetes deployment process?", "K8s deployment steps?"), we're essentially paying for the same answer multiple times.

Even worse, we're making our users wait for responses we've already computed.

The Quality Problem

Vector similarity alone isn't always enough. Sometimes lexical matching (good old BM25) finds documents that semantic search misses—especially for exact terms, acronyms, or technical identifiers. Relying solely on embeddings can leave quality on the table.

The Observability Problem

When the RAG pipeline misbehaves—returning poor answers, experiencing high latency, or burning through our LLM budget—how do we debug it? Traditional application monitoring doesn't capture the nuances of retrieval quality, cache hit rates, or generation costs.

We need visibility into every stage of the pipeline, from retrieval to generation, with metrics that actually matter for RAG workloads.

The Blueprint (?)

I have been searching and looking for different patterns and best practices on how to build a RAG system that addresses each of these concerns. The target architecture for me wasn't just about making things work! it's about making them work reliably, cost-effectively, and observably at scale.

Here's the high-level blueprint:

Let's break down how this architecture solves each problem:

Solving Latency: Semantic Caching

Before doing any expensive operations, we check Redis for semantically similar queries. The cache doesn't just match exact strings—it computes cosine similarity between query embeddings. If a user asks "How does autoscaling work?" and we've previously answered "Explain the autoscaling mechanism", we detect that similarity (threshold 0.90 for example ) and return the cached response immediately.

This short-circuits the entire pipeline. No retrieval. No LLM call. Sub-100ms response times.

The cache stores:

  • Normalized query text (with PII redaction)
  • Deterministic query embedding (8-dimensional for demo purposes)
  • Complete generated answer
  • Citation list
  • Retrieved document IDs
  • Timestamp for observability

Cache entries expire after 10 minutes by default, keeping answers fresh as documentation evolves.

Solving Reliability: Layered Fallbacks

The system implements graceful degradation at every level:

Retriever Fallback: Weaviate has a strict timeout budget (250ms). If it doesn't respond in time, the retriever automatically falls back to OpenSearch for lexical BM25 search. The user still gets an answer... maybe not the semantically perfect one, but a relevant one based on keyword matching.

Generator Fallback: If the LLM endpoint times out or returns an error, the orchestrator doesn't fail. Instead, it synthesizes a deterministic answer by summarizing the top retrieved chunks, clearly marking it as partial and including citations. Users get actionable information even when the model is unavailable.

Streaming Resilience: Server-Sent Events (SSE) provide progressive rendering. Users see tokens as they're generated, and the final event includes citations and a partial flag indicating any degradation.

Every fallback event is instrumented: emitting OpenTelemetry spans with attributes like rag.fallback.reason=weaviate-timeout so we can measure how often each degradation path triggers.

Solving Cost: Intelligent Caching and Deduplication

Semantic caching isn't just about latency—it's about cost. LLM calls are expensive. With a well-tuned cache, we can reduce redundant generation by 40-60% depending on the query distribution.

The cache uses deterministic embeddings for the demo (SHA-256 based hashing producing 8-dimensional vectors), but in production we'd use proper sentence embeddings. The key insight is that cosine similarity > 0.90 means "close enough" to reuse the answer.

Beyond caching, we instrument every request with estimated cost metrics. At $0.002 per 1K tokens (approximate Gemma-2 pricing), a Grafana dashboard visualizes cost-per-request trends, helping you optimize both caching and prompt engineering.

Solving Quality: Hybrid Retrieval

Vector search excels at semantic similarity but can miss exact matches for technical terms, version numbers, or acronyms. OpenSearch provides lexical BM25 ranking that captures these cases.

The retriever service prioritizes vector search (Weaviate) but automatically falls back to lexical search (OpenSearch) when vector queries timeout. This hybrid approach ensures we get the best of both worlds: semantic understanding when available, lexical precision when needed.

In future iterations, we could combine both signals using a re-ranking model, but for many workloads, the fallback strategy alone provides sufficient quality.

Solving Observability: OpenTelemetry + Prometheus + Grafana

Every request flows through instrumented code paths. The observability stack captures:

Traces (OpenTelemetry + Tempo): End-to-end request traces showing retrieval time, document count, cache decisions, LLM first-token latency, and total tokens generated. Custom span attributes include:

  • rag.cache.hit: boolean
  • rag.retrieval.count: number of documents retrieved
  • rag.retrieval.source: "weaviate" or "opensearch"
  • rag.fallback.reason: why degradation occurred
  • rag.model.name: which LLM was used
  • rag.tokens.total: generated token count
  • rag.ttft_ms: time to first token

Metrics (Prometheus + Grafana): Counters and histograms for:

  • rag_orchestrator_latency: p50/p95/p99 end-to-end latency
  • rag_cache_hit_total / rag_cache_miss_total: cache efficiency
  • rag_retrieval_fallback_total: how often fallback triggered
  • rag_tokens_generated_total: token consumption trends
  • rag_cost_usd_total: estimated spend per request

Grafana dashboards visualize these metrics alongside autoscaling replica counts, giving operators a complete view of system behavior under load.

Final thoughts

Whether you're building internal documentation search, customer support automation, or code assistance tools, you'll probably face the same tradeoffs around latency, reliability, cost, and quality.

The architecture I'm suggesting here worked for me, and might be useful for someone on the internet facing the same challenges. It's built for failure.

The complete code is available at github.com/aboullaite/rag-java-k8s. You can run the entire stack locally with make dev-up or deploy to GKE with make gke-cluster.

Stay tuned. The next post will get hands-on with code and architectural patterns.