Building RAG Update: Hybrid Search, Reranking & Production Hardening

When I published the original series in November 2025, I was happy with where the system landed. It had semantic caching, fallback strategies, distributed tracing, autoscaling, solid production patterns throughout. But as I kept working with it and preparing a talk around the same material, I kept spotting areas where the system could go further.

Four months later, I finally made those improvements. This post covers what changed, why, and what I learned along the way.

Where We're Taking Things Further

The original system was functional and resilient, but there were natural next steps I'd been thinking about since day one:

From fallback to fusion. The system had both Weaviate (vector) and OpenSearch (lexical), but OpenSearch only kicked in when Weaviate failed. The natural evolution: combine their results to get the best of both worlds, all the time.
Adding a reranking step. Whatever the vector DB returned as top-5, that's what the LLM saw. Adding a second pass to pick the best candidates from a broader pool is a well-known quality boost.
Making LLM parameters configurable. Temperature and max tokens were baked into the Java code. For experimentation and tuning, these should live in configuration.
Upgrading the embedding model. all-MiniLM-L6-v2 served us well, but it's from 2021 and the field has moved fast. Time for a newer model.
Tuning for real-world load. Some of the original timeout and caching values were optimized for local development. Under sustained traffic, they needed adjustment.

Let's walk through each one.

Change 1: Hybrid Search with Reciprocal Rank Fusion

This was the most impactful change. The insight is simple: vector search and keyword search fail in complementary ways.

Vector search excels at "what does this mean?" but struggles with exact terms. Ask about "SLA for the premium tier" and vector search finds documents about service guarantees and uptime commitments, conceptually right, but it might miss the document that literally contains the acronym "SLA."

Keyword search (BM25) does the opposite. It finds exact term matches but misses semantic connections.

The solution: run both in parallel and merge the results.

Implementation

The RetrieverService now runs Weaviate and OpenSearch concurrently using Mono.zip(), each with independent 500ms timeouts:

private Mono<List<RetrievedDoc>> executeHybridRetrieval(Query query, int topK, Span span) {
    hybridCounter.increment();

    Mono<List<RetrievedDoc>> vectorMono = weaviateGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -> {
                log.warn("Vector search failed in hybrid mode: {}", ex.getMessage());
                return Mono.just(List.of());
            });

    Mono<List<RetrievedDoc>> lexicalMono = openSearchGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -> {
                log.warn("Lexical search failed in hybrid mode: {}", ex.getMessage());
                return Mono.just(List.of());
            });

    return Mono.zip(vectorMono, lexicalMono)
            .map(tuple -> mergeWithRRF(tuple.getT1(), tuple.getT2(), topK));
}

A few things to note:

Both searches are independent. If one fails, the other still returns results. This is strictly better than the old fallback-only approach, we get hybrid quality when both work, and graceful degradation when one doesn't.
500ms timeout each, not combined. Since they run in parallel, the total retrieval time is max(vector, lexical), not vector + lexical.

Reciprocal Rank Fusion (RRF)

The merging algorithm is RRF, which is the industry standard for combining ranked lists from different sources:

private List<RetrievedDoc> mergeWithRRF(List<RetrievedDoc> vectorResults,
                                         List<RetrievedDoc> lexicalResults, int topK) {
    Map<String, Double> scores = new HashMap<>();
    Map<String, RetrievedDoc> docsByKey = new HashMap<>();

    for (int i = 0; i < vectorResults.size(); i++) {
        RetrievedDoc doc = vectorResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    for (int i = 0; i < lexicalResults.size(); i++) {
        RetrievedDoc doc = lexicalResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    return scores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(entry -> {
                RetrievedDoc original = docsByKey.get(entry.getKey());
                return new RetrievedDoc(original.id(), original.chunk(),
                                        entry.getValue(), original.meta());
            })
            .collect(Collectors.toList());
}

RRF is elegant because it's rank-based, not score-based. We don't need to normalize scores across different systems (Weaviate's cosine distance and OpenSearch's BM25 scores live on completely different scales). The k=60 constant is standard and works well in practice.

Documents that appear high in both lists get the highest combined score. A document ranked #1 in vector and #3 in lexical will outscore one ranked #1 in vector but absent from lexical results, which is exactly what we want.

Feature flag

Hybrid search is togglable via configuration:

retriever:
  hybrid-enabled: ${HYBRID_ENABLED:true}

When disabled, the system falls back to the original behavior: Weaviate primary, OpenSearch on failure only. This was useful for A/B comparison during development.

Change 2: Reranking

Hybrid search gives us better candidates. Reranking picks the best candidates from that list.

The pattern is straightforward: retrieve broadly (top-20), rerank precisely (top-5), send only the best to the LLM. Initial retrieval is optimized for recall (don't miss relevant docs). Reranking is optimized for precision (only keep the most relevant).

In production, we'd use a cross-encoder model like BAAI/bge-reranker-v2-m3 or the Cohere Rerank API. For this demo, I implemented a lightweight reranker using cosine similarity between deterministic embeddings of the query and each chunk:

@Component
public class Reranker {

    private final Timer rerankLatency;

    public Reranker(MeterRegistry meterRegistry) {
        this.rerankLatency = Timer.builder("rag_rerank_latency")
                .description("Time spent reranking retrieved documents")
                .register(meterRegistry);
    }

    public Mono<List<RetrievedDoc>> rerank(String query, List<RetrievedDoc> candidates, int topK) {
        return Mono.fromCallable(() -> {
            Timer.Sample sample = Timer.start();
            try {
                double[] queryEmbedding = DeterministicEmbedding.embed(query);

                List<RetrievedDoc> reranked = candidates.stream()
                        .map(doc -> {
                            double[] chunkEmbedding = DeterministicEmbedding.embed(doc.chunk());
                            double similarity = cosineSimilarity(queryEmbedding, chunkEmbedding);
                            return new RetrievedDoc(doc.id(), doc.chunk(), similarity, doc.meta());
                        })
                        .sorted(Comparator.comparingDouble(RetrievedDoc::score).reversed())
                        .limit(topK)
                        .collect(Collectors.toList());

                log.debug("Reranked {} candidates down to {}", candidates.size(), reranked.size());
                return reranked;
            } finally {
                sample.stop(rerankLatency);
            }
        });
    }
}

The integration into RetrieverService is clean, reranking wraps the retrieval result:

int fetchK = properties.isRerankEnabled() ? properties.getRetrieveK() : topK;

Mono<List<RetrievedDoc>> retrieval;
if (properties.isHybridEnabled() && openSearchGateway.isEnabled()) {
    retrieval = executeHybridRetrieval(query, fetchK, span);
} else {
    retrieval = executeSingleSourceRetrieval(query, fetchK, span);
}

if (properties.isRerankEnabled()) {
    result = result.flatMap(docs -> reranker.rerank(query.text(), docs, topK));
}

When reranking is enabled, we fetch retrieveK (default 20) candidates instead of the final topK (default 5), then let the reranker narrow down. This gives the reranker a wider pool to work with.

Like hybrid search, reranking is feature-flagged via rerank-enabled in the config.

Change 3: Embedding Model Upgrade

all-MiniLM-L6-v2 has been a workhorse since 2021. It scores ~63 on the MTEB benchmark. Its bigger sibling, all-MiniLM-L12-v2, scores higher while keeping the same 384 dimensions, making it a drop-in upgrade.

The change is a single line in deploy/weaviate.yaml:

# Before
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2

# After
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L12-v2

Memory limit bumped from 2Gi to 3Gi to accommodate the larger model. Same dimensions means the Weaviate schema doesn't change, but we do need to re-ingest all documents since the embeddings will be different (make ingest).

For production systems, I'd recommend going further: intfloat/e5-large-v2 (1024 dims) or BAAI/bge-large-en-v1.5 score 75-76 on MTEB. But those require schema changes, more memory, and larger storage. The L6→L12 swap was the highest ROI for this demo.

Change 4: LLM Client Improvements

Two small but useful refinements:

Configurable temperature and max tokens

Previously these were hardcoded in the Java source:

// Before
.put("temperature", 0.7)
.put("max_tokens", 512);

// After
.put("temperature", properties.getTemperature())
.put("max_tokens", properties.getMaxTokens());

Now driven by application.yaml:

rag:
  temperature: ${LLM_TEMPERATURE:0.7}
  max-tokens: ${LLM_MAX_TOKENS:512}

Small change, but it means we can tune generation behavior via ConfigMap without redeploying. Handy for experimenting with different temperature values across environments.

Better token counting

The original code counted tokens by splitting on whitespace:

// Before: counts words, not tokens
int tokens = Math.max(1, text.split("\\s+").length);

// After: rough approximation: 1 token ≈ 4 characters
int tokens = Math.max(1, text.length() / 4);

Neither is perfect without a proper tokenizer, but length / 4 is much closer to reality for English text. This feeds into the cost estimation metrics on the observability dashboard, so getting it roughly right matters.

Change 5: Production Tuning

Timeouts and thresholds

The original values were tuned for local development. Under sustained load, some of them needed breathing room:

Setting	Before	After	Why
Retrieval timeout	250ms	500ms	Reduced unnecessary fallbacks under load
LLM generation timeout	1,800ms	5,000ms	Cold models and complex prompts need headroom
Cache similarity threshold	0.90	0.87	More cache hits, still precise enough
Cache TTL	600s (10 min)	3,600s (1 hour)	RAG docs don't change that often

The retrieval timeout change alone reduced the fallback rate from ~15% under load to ~3%. That's a meaningful quality improvement, every unnecessary fallback means the user gets lexical-only results instead of hybrid.

The Updated Retrieval Flow

Here's how the retrieval pipeline looks now, end to end:

User query arrives at Retriever
    │
    ├── Hybrid enabled?
    │     YES → Run Weaviate + OpenSearch in parallel (500ms each)
    │           → Merge results with RRF (k=60)
    │     NO  → Run Weaviate only (500ms timeout)
    │           → On failure, fallback to OpenSearch
    │
    ├── Rerank enabled?
    │     YES → Take top-20 candidates
    │           → Rerank by cosine similarity
    │           → Return top-5
    │     NO  → Return top-5 directly
    │
    └── Return to Orchestrator

Every step is independently toggleable via configuration, instrumented with Prometheus metrics (rag_retrieval_hybrid_total, rag_rerank_latency), and traced with OpenTelemetry spans.

What's Next

These changes addressed the most impactful improvements. The system is meaningfully better, but there's more on the roadmap:

A proper cross-encoder reranker. The cosine similarity reranker is a stand-in. A real cross-encoder (bge-reranker-v2-m3) would give much better precision, at the cost of ~80ms latency and an additional inference sidecar.
Query routing. Not every question needs RAG. A router agent that decides per-query whether to use the cache, call a tool, run the RAG pipeline, or just let the LLM answer from its training data, that's the next architectural evolution.
Better embedding model. all-MiniLM-L12-v2 is better than L6, but models like intfloat/e5-large-v2 or BAAI/bge-large-en-v1.5 would be a step change in retrieval quality.
Contextual retrieval. Anthropic's technique of prepending chunk-specific context before embedding (e.g., "This chunk is from the autoscaling documentation") reduces retrieval failures by up to 67%. That's a significant number worth exploring.

The full code is at github.com/aboullaite/rag-java-k8s. Deploy locally with make dev-up && make build && make deploy && make ingest and try it yourself.