By Mohammed Aboullaite in Java — Nov 16, 2025

Building Production-Grade RAG Systems: Architecture Deep Dive

In the first part, we explored the production challenges of RAG systems: latency, reliability, cost, quality, and observability. Now let's get our hands dirty with the actual architecture and implementation.

The codebase uses Java 25, Spring Boot 3.5.7, reactive programming with WebFlux, and follows production patterns you'd see in enterprise systems. Every design decision has a reason, and I'll explain the tradeoffs as we go.

Service Boundaries: Why Separation Matters

The system is split into three main modules:

common/                # Shared DTOs and tracing helpers
retriever/             # Reactive Weaviate/OpenSearch retriever service
orchestrator/          # Orchestration, caching, LLM routing, SSE

This separation isn't arbitrary. Each service has distinct scaling characteristics and failure modes:

Retriever is CPU and I/O intensive, scales horizontally, and needs aggressive timeouts
Orchestrator manages state (semantic cache), handles user connections (SSE), and coordinates the pipeline
Common provides shared contracts (DTOs) and telemetry utilities that both services use

By isolating these responsibilities, we can scale the retriever independently during traffic spikes while keeping the orchestrator stable. If the retriever service crashes, the orchestrator can still serve cached responses.

The Retriever Service: Hybrid Search with Fallbacks

Let's start with document retrieval. The retriever service exposes a single endpoint:

POST /v1/retrieve

Request:

{
  "text": "How does autoscaling work?",
  "filters": {"section": "infrastructure"},
  "topK": 5
}

Response:

[
  {
    "id": "doc-03-autoscaling",
    "chunk": "Autoscaling combines HPA and KEDA...",
    "score": 0.87,
    "metadata": {
      "source": "doc-03-autoscaling.md",
      "section": "infrastructure"
    }
  }
]

The implementation is deceptively simple but primarily designed for resilience:

// retriever/src/main/java/me/aboullaite/rag/retriever/service/RetrieverService.java

@Service
public class RetrieverService {

    private final WeaviateGateway weaviateGateway;
    private final OpenSearchGateway openSearchGateway;
    private final RetrieverProperties properties;
    private final Timer retrievalLatency;
    private final Counter fallbackCounter;
    private final Tracer tracer;

    public Mono<List<RetrievedDoc>> retrieve(Query query) {
        int topK = query.topK() > 0 ? query.topK() : properties.getTopKDefault();
        return Mono.defer(() -> executeRetrieval(query, topK));
    }

    private Mono<List<RetrievedDoc>> executeRetrieval(Query query, int topK) {
        Span span = tracer.spanBuilder("rag.retrieve")
                .setAttribute("rag.request.topK", topK)
                .startSpan();
        Timer.Sample sample = Timer.start(meterRegistry);

        return weaviateGateway.search(query, topK)
                .timeout(Duration.ofMillis(properties.getTimeoutMs()))
                .onErrorResume(throwable -> fallback(query, topK, span, throwable))
                .doOnNext(docs -> TracingUtils.recordRetrievedDocs(span, docs))
                .doOnError(span::recordException)
                .doFinally(signalType -> {
                    sample.stop(retrievalLatency);
                    span.end();
                });
    }

    private Mono<List<RetrievedDoc>> fallback(Query query, int topK, Span parentSpan, Throwable throwable) {
        boolean timeout = throwable instanceof TimeoutException;
        log.warn("Primary vector search failed (timeout={}): {}", timeout, throwable.getMessage());
        fallbackCounter.increment();
        TracingUtils.recordFallback(parentSpan, timeout ? "weaviate-timeout" : throwable.getClass().getSimpleName());

        if (!openSearchGateway.isEnabled()) {
            return Mono.just(List.of());
        }

        return openSearchGateway.search(query, topK)
                .doOnNext(docs -> TracingUtils.recordRetrievedDocs(parentSpan, docs));
    }
}

Key Design Decisions

1. Reactive Streams with Project Reactor

Notice the return type: Mono<List<RetrievedDoc>>. This is Project Reactor's reactive type for 0-1 values. By using reactive programming:

We avoid blocking threads during I/O
Timeouts are first-class citizens (.timeout(Duration.ofMillis(250)))
Error handling composes naturally (.onErrorResume())
Observability hooks integrate seamlessly (.doOnNext(), .doFinally())

Spring Boot's WebFlux framework handles request threads efficiently, allowing the retriever to handle hundreds of concurrent requests without thread pool exhaustion.

2. Aggressive Timeouts

The default timeout is 250ms. That's intentionally tight. Why?

Users expect sub-second responses
Vector databases can have occasional slow queries (large result sets, index rebuilds, etc.)
We'd rather fallback to lexical search than make users wait. That is of cource debatable and depends on what wre we optimozaing for!

In load testing, this timeout triggers fallback ~5-15% of the time under heavy load, which is acceptable given the graceful degradation.

3. Observability at Every Step

Every retrieval is instrumented:

OpenTelemetry Span: captures timing, document count, and fallback reasons
Prometheus Timer: records latency histogram for p95/p99 analysis
Prometheus Counter: tracks fallback frequency

When debugging production issues, we can simply filter Tempo traces by rag.fallback.reason=weaviate-timeout to see exactly which requests degraded.

4. Lexical Fallback via OpenSearch

Weaviate is great for semantic search, but sometimes you need exact term matching. OpenSearch provides BM25 ranking, which excels at:

Acronyms (e.g., "HPA", "KEDA", "SSE")
Version numbers (e.g., "Java 25", "Spring Boot 3.5.7")
Exact phrases (e.g., "Server-Sent Events")

The fallback is a deliberate hybrid retrieval strategy. Some RAG systems use re-rankers to combine vector and lexical signals; here, we use fallback for simplicity while maintaining quality.

The Orchestrator Service: Coordination and Caching

The orchestrator is the brain of the system. It coordinates caching, retrieval, prompt assembly, generation, and streaming. Let's walk through the request flow.

Request Flow Diagram

Personal-2025-11-12-075246

Semantic Cache Implementation

The semantic cache is the secret weapon for both latency and cost optimization. Here's the code:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/cache/SemanticCacheService.java

@Service
public class SemanticCacheService {

    private static final String CACHE_INDEX = "rag:cache:index";
    private static final String CACHE_KEY_PREFIX = "rag:cache:";
    private static final double SIMILARITY_THRESHOLD = 0.90;
    private static final Duration CACHE_TTL = Duration.ofMinutes(10);

    private final RedisTemplate<String, String> redisTemplate;

    public Mono<CacheHit> lookup(String normalizedQuery, double[] embedding) {
        return Mono.fromCallable(() -> {
            Set<String> keys = redisTemplate.opsForSet().members(CACHE_INDEX);
            if (keys == null || keys.isEmpty()) {
                return null;
            }

            double maxSimilarity = 0.0;
            CacheEntry bestMatch = null;

            for (String key : keys) {
                String json = redisTemplate.opsForValue().get(key);
                if (json == null) continue;

                CacheEntry entry = deserialize(json);
                double similarity = SimilarityUtils.cosineSimilarity(embedding, entry.embedding());

                if (similarity > maxSimilarity && similarity >= SIMILARITY_THRESHOLD) {
                    maxSimilarity = similarity;
                    bestMatch = entry;
                }
            }

            return bestMatch != null ? new CacheHit(bestMatch, maxSimilarity) : null;
        }).subscribeOn(Schedulers.boundedElastic());
    }

    public Mono<Void> put(String normalizedQuery, double[] embedding,
                          GenerationResponse response, List<RetrievedDoc> docs) {
        return Mono.fromRunnable(() -> {
            String key = CACHE_KEY_PREFIX + UUID.randomUUID();
            CacheEntry entry = new CacheEntry(
                normalizedQuery,
                embedding,
                response.answer(),
                response.citations(),
                docs.stream().map(RetrievedDoc::id).toList(),
                System.currentTimeMillis()
            );

            String json = serialize(entry);
            redisTemplate.opsForValue().set(key, json, CACHE_TTL);
            redisTemplate.opsForSet().add(CACHE_INDEX, key);
        }).subscribeOn(Schedulers.boundedElastic()).then();
    }
}

Why Cosine Similarity Threshold 0.90?

This threshold balances precision and recall:

Too low (e.g., 0.70): You'd match dissimilar queries, returning wrong cached answers
Too high (e.g., 0.98): You'd miss legitimate matches, reducing cache hit rate

At 0.90, queries like:

"How does autoscaling work?"
"Explain the autoscaling mechanism"
"What is the autoscaling strategy?"

...all match and reuse the same cached answer. But unrelated queries like "How do I ingest data?" won't match.

In load testing with realistic query distributions, 0.90 yields ~45% cache hit rate, cutting LLM costs nearly in half.

This is based on my tests and use cases, don't just rely on these numbers, please perform your own testing. Your numbers can be different.

Deterministic Embeddings

For this demo, I'm using deterministic 8-dimensional embeddings generated via SHA-256 hashing:

// common/src/main/java/me/aboullaite/rag/common/embedding/DeterministicEmbedding.java

public class DeterministicEmbedding {
    public static double[] embed(String text) {
        byte[] hash = MessageDigest.getInstance("SHA-256").digest(text.getBytes());
        double[] embedding = new double[8];
        for (int i = 0; i < 8; i++) {
            embedding[i] = (hash[i] & 0xFF) / 255.0;
        }
        return normalize(embedding);
    }
}

Why deterministic embeddings?

No external embedding service dependency for the demo
Reproducible cache behavior in tests
Instant embedding computation (no API latency)

In production, we'd need to use proper sentence embeddings (e.g., all-MiniLM-L6-v2 via Hugging Face or OpenAI text-embedding-3-small). The cache logic remains identical, we just have to swap the embedding function.

Prompt Assembly and Citation Tracking

Once documents are retrieved, we need to construct a prompt that:

Provides clear instructions to the LLM
Injects retrieved context
Enforces citation requirements
Handles edge cases (no documents, partial results, etc.)

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/prompt/PromptAssembler.java

@Component
public class PromptAssembler {

    public PromptBundle assemble(String userPrompt, List<RetrievedDoc> docs) {
        if (docs.isEmpty()) {
            return new PromptBundle(
                noContextPrompt(userPrompt),
                List.of(),
                List.of()
            );
        }

        StringBuilder prompt = new StringBuilder();
        prompt.append("You are a helpful assistant. Answer the question based ONLY on the provided documents.\n\n");
        prompt.append("Documents:\n");

        List<String> citations = new ArrayList<>();
        List<CitationInfo> citationDetails = new ArrayList<>();

        for (int i = 0; i < docs.size(); i++) {
            RetrievedDoc doc = docs.get(i);
            String citationId = doc.id();
            citations.add(citationId);
            citationDetails.add(new CitationInfo(
                citationId,
                doc.metadata().source(),
                doc.metadata().section()
            ));

            prompt.append(String.format("[%s] %s\n\n", citationId, doc.chunk()));
        }

        prompt.append("Question: ").append(userPrompt).append("\n\n");
        prompt.append("Instructions:\n");
        prompt.append("- Answer ONLY using information from the provided documents\n");
        prompt.append("- Cite sources using [doc-id] notation\n");
        prompt.append("- If the documents don't contain enough information, say 'I don't know'\n");
        prompt.append("- Be concise and accurate\n\n");
        prompt.append("Answer:");

        return new PromptBundle(prompt.toString(), citations, citationDetails);
    }

    private String noContextPrompt(String userPrompt) {
        return "You are a helpful assistant. The user asked: " + userPrompt +
               "\n\nNo relevant documents were found. Please respond with: I don't know.";
    }
}

Citation Enforcement

Notice the explicit instructions:

"Answer ONLY using information from the provided documents"
"Cite sources using [doc-id] notation"
"If the documents don't contain enough information, say 'I don't know'"

LLMs are surprisingly good at following these instructions when they're clear and emphatic. In testing with Gemma-2-2B, citation compliance is >85% for well-formed prompts.

The PromptBundle record encapsulates:

public record PromptBundle(
    String prompt,              // Full prompt sent to LLM
    List<String> citations,     // [doc-03-autoscaling, doc-09-infrastructure]
    List<CitationInfo> citationDetails  // Full metadata for UI rendering
) {}

This separation allows the orchestrator to:

Send a clean prompt to the LLM
Return structured citations to the client
Track which documents contributed to each answer (for cache invalidation, analytics, etc.)

LLM Integration: KServe + vLLM

The LLM layer uses KServe (Kubernetes serving framework) with vLLM runtime to host Gemma-2-2B-it (instruction-tuned).

Why KServe?

KServe provides:

Autoscaling: Scale-to-zero when idle, scale-up on demand
GPU management: Automatic GPU resource allocation
Inference optimization: vLLM uses PagedAttention for efficient memory usage
Standardized API: OpenAI-compatible /v1/chat/completions endpoint

The InferenceService definition:

# deploy/kserve-vllm.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: rag-llm
  namespace: rag
spec:
  predictor:
    minReplicas: 0        # Scale to zero when idle
    maxReplicas: 1
    scaleTarget: 1
    scaleMetric: concurrency
    model:
      runtime: vllm-runtime
      modelFormat:
        name: huggingface
      args:
        - --model
        - google/gemma-2-2b-it
        - --dtype
        - auto
        - --max-model-len
        - "4096"
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "3"
          memory: 12Gi
        requests:
          cpu: "2"
          memory: 8Gi

vLLM Runtime Configuration

vLLM is a high-performance inference engine optimized for LLMs. Key features:

PagedAttention: Reduces memory fragmentation, increases throughput
Continuous batching: Processes multiple requests efficiently
Quantization support: --dtype auto enables FP16/BF16 for faster inference

With Gemma-2-2B on an L4 GPU (24GB), vLLM achieves:

Time-to-first-token (TTFT): ~100-300ms
Throughput: ~50-80 tokens/sec
Concurrent requests: 4-8 (depending on sequence length)

LLM Client Implementation

The orchestrator calls KServe via a reactive client:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/client/LlmClient.java

@Component
public class LlmClient {

    private final WebClient webClient;
    private final OrchestratorProperties properties;

    public Mono<LlmResponse> generate(String prompt) {
        Map<String, Object> request = Map.of(
            "model", properties.getModelName(),
            "messages", List.of(
                Map.of("role", "user", "content", prompt)
            ),
            "max_tokens", properties.getMaxTokens(),
            "temperature", properties.getTemperature()
        );

        long startNano = System.nanoTime();
        AtomicLong ttftNano = new AtomicLong(0);

        return webClient.post()
                .uri("/v1/chat/completions")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(Map.class)
                .map(response -> {
                    if (ttftNano.get() == 0) {
                        ttftNano.set(System.nanoTime() - startNano);
                    }

                    String content = extractContent(response);
                    int tokens = estimateTokens(content);
                    long ttftMillis = ttftNano.get() / 1_000_000;

                    return new LlmResponse(content, tokens, ttftMillis);
                })
                .timeout(Duration.ofSeconds(properties.getGenerationTimeoutSeconds()));
    }
}

Time-to-First-Token (TTFT) is a critical metric for user experience. Measuring it accurately requires:

Start timer when request begins
Capture timestamp on first response byte
Calculate delta in milliseconds

This metric appears in OpenTelemetry spans as rag.ttft_ms, allowing us to track degradation trends in Grafana.

Streaming Responses with Server-Sent Events

One of the best UX improvements in modern LLM applications is streaming. Instead of waiting 3+ seconds for the complete answer, users see tokens as they're generated.

SSE Endpoint

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/web/AskController.java

@GetMapping(value = "/v1/ask/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<String>> askStream(
        @RequestParam String prompt,
        @RequestParam(required = false) Map<String, String> filters,
        @RequestParam(required = false, defaultValue = "5") Integer topK) {

    return askService.askStreaming(prompt, filters, topK)
            .map(chunk -> {
                if (chunk.isComplete()) {
                    return ServerSentEvent.<String>builder()
                            .event("complete")
                            .data(toJson(chunk))
                            .build();
                } else {
                    return ServerSentEvent.<String>builder()
                            .event("token")
                            .data(chunk.token())
                            .build();
                }
            });
}

Event Types:

token: Individual generated tokens (streamed progressively)
complete: Final event containing citations and metadata

Client-side consumption (JavaScript):

const eventSource = new EventSource('/v1/ask/stream?prompt=How+does+caching+work');

eventSource.addEventListener('token', (event) => {
    document.getElementById('answer').textContent += event.data;
});

eventSource.addEventListener('complete', (event) => {
    const result = JSON.parse(event.data);
    displayCitations(result.citations);
    eventSource.close();
});

Progressive rendering dramatically improves perceived performance. Users engage with partial responses while generation continues, reducing perceived wait time by 50-70%.

Request Orchestration: Putting It All Together

Here's the core orchestration logic that ties everything together:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/service/AskService.java

public Mono<GenerationResponse> ask(String prompt, Map<String, String> filters, Integer topK) {
    String sanitizedPrompt = redact(prompt);  // PII redaction
    double[] embedding = embeddingService.embed(sanitizedPrompt);
    Span span = tracer.spanBuilder("rag.ask").startSpan();
    Timer.Sample sample = Timer.start(meterRegistry);

    return cacheService.lookup(sanitizedPrompt, embedding)
            .flatMap(hit -> onCacheHit(hit, span))
            .switchIfEmpty(Mono.defer(() -> {
                cacheMissCounter.increment();
                TracingUtils.recordCacheHit(span, false);
                return generateWithRetrieval(sanitizedPrompt, filters, topK, embedding, span);
            }))
            .doOnError(span::recordException)
            .doFinally(signalType -> {
                sample.stop(askLatency);
                span.end();
            });
}

private Mono<GenerationResponse> generateWithRetrieval(
        String sanitizedPrompt,
        Map<String, String> filters,
        Integer topK,
        double[] embedding,
        Span parentSpan) {
    Query query = new Query(sanitizedPrompt, filters, topK);

    return retrieverClient.retrieve(query)
            .flatMap(docs -> produceAnswer(sanitizedPrompt, docs, embedding, parentSpan))
            .switchIfEmpty(Mono.defer(() -> produceAnswer(sanitizedPrompt, List.of(), embedding, parentSpan)));
}

private Mono<GenerationResponse> produceAnswer(
        String sanitizedPrompt,
        List<RetrievedDoc> docs,
        double[] embedding,
        Span parentSpan) {
    PromptBundle promptBundle = promptAssembler.assemble(sanitizedPrompt, docs);

    return llmClient.generate(promptBundle.prompt())
            .map(response -> toGenerationResponse(response, promptBundle, false, parentSpan))
            .flatMap(response -> cacheService.put(sanitizedPrompt, embedding, response, docs)
                    .thenReturn(response))
            .onErrorResume(ex -> {
                log.warn("LLM call failed, using fallback: {}", ex.getMessage());
                fallbackCounter.increment();
                TracingUtils.recordFallback(parentSpan, ex.getClass().getSimpleName());
                GenerationResponse fallback = fallbackResponse(docs, promptBundle.citationDetails());
                return cacheService.put(sanitizedPrompt, embedding, fallback, docs)
                        .onErrorResume(e -> Mono.empty())
                        .thenReturn(fallback);
            });
}

Reactive Composition Explained

The flow uses reactive operators to compose asynchronous operations:

cacheService.lookup(): Check cache (non-blocking I/O to Redis)
.flatMap(hit -> onCacheHit()): If cache hit, return immediately
.switchIfEmpty(Mono.defer(() -> ...)): If cache miss, proceed to retrieval
retrieverClient.retrieve(): Call retriever service (HTTP call)
.flatMap(docs -> produceAnswer()): Generate answer with retrieved docs
llmClient.generate(): Call LLM (HTTP streaming)
.flatMap(response -> cacheService.put()): Cache the result
.onErrorResume(ex -> fallbackResponse()): Graceful degradation on error
.doFinally(): Stop timer and close span (always executes)

This composition is non-blocking. No threads wait on I/O. Spring WebFlux dispatches work efficiently across an event loop, enabling high concurrency with minimal thread overhead.

Component Summary

Let's recap the key components and their roles:

Component	Responsibility	Technology	Scaling Strategy
Orchestrator	Request coordination, caching, streaming	Spring WebFlux, Redis	Horizontal (stateless except cache)
Retriever	Hybrid search (vector + lexical)	Spring WebFlux, Weaviate, OpenSearch	HPA (CPU) + KEDA (RPS)
Semantic Cache	Similarity-based response caching	Redis, cosine similarity	Vertical (single instance for consistency)
Vector Store	Semantic document search	Weaviate	Managed/external service
Lexical Store	Fallback keyword search	OpenSearch	Managed/external service
LLM Serving	Model inference with GPU	KServe, vLLM, Gemma-2-2B	KServe autoscaling (scale-to-zero)
Observability	Metrics, traces, dashboards	Prometheus, Tempo, Grafana, OTEL	N/A (infrastructure)

Why This Architecture Scales

The components here aren't just for demos. they're a good starting point to build production systems:

Service Isolation: Retriever and orchestrator scale independently
Reactive Programming: Non-blocking I/O maximizes throughput
Timeouts Everywhere: Aggressive timeouts prevent cascading failures
Graceful Degradation: Fallbacks at every layer (cache → retrieval → generation)
Observability-First: Traces and metrics built into every code path
Cost Awareness: Semantic caching reduces LLM spend by ~40-60%

When traffic spikes, the retriever scales out (2→30 replicas). When traffic drops, KServe scales the LLM to zero. When Weaviate slows down, OpenSearch takes over. When the LLM fails, deterministic fallbacks keep users informed.

Every point of failure has a fallback, to provide a resilient and good experience to users.

The complete code is available at github.com/aboullaite/rag-java-k8s.

Stay tuned for the last part, providing a Kubernetes deep dive.