Laytoun' thoughts!

Building RAG Update: Hybrid Search, Reranking & Production Hardening

Mohammed Aboullaite — Sun, 22 Mar 2026 10:14:17 GMT

When I published the original series in November 2025, I was happy with where the system landed. It had semantic caching, fallback strategies, distributed tracing, autoscaling, solid production patterns throughout. But as I kept working with it and preparing a talk around the same material, I kept spotting areas where the system could go further.

Four months later, I finally made those improvements. This post covers what changed, why, and what I learned along the way.

Where We're Taking Things Further

The original system was functional and resilient, but there were natural next steps I'd been thinking about since day one:

From fallback to fusion. The system had both Weaviate (vector) and OpenSearch (lexical), but OpenSearch only kicked in when Weaviate failed. The natural evolution: combine their results to get the best of both worlds, all the time.
Adding a reranking step. Whatever the vector DB returned as top-5, that's what the LLM saw. Adding a second pass to pick the best candidates from a broader pool is a well-known quality boost.
Making LLM parameters configurable. Temperature and max tokens were baked into the Java code. For experimentation and tuning, these should live in configuration.
Upgrading the embedding model. all-MiniLM-L6-v2 served us well, but it's from 2021 and the field has moved fast. Time for a newer model.
Tuning for real-world load. Some of the original timeout and caching values were optimized for local development. Under sustained traffic, they needed adjustment.

Let's walk through each one.

Change 1: Hybrid Search with Reciprocal Rank Fusion

This was the most impactful change. The insight is simple: vector search and keyword search fail in complementary ways.

Vector search excels at "what does this mean?" but struggles with exact terms. Ask about "SLA for the premium tier" and vector search finds documents about service guarantees and uptime commitments, conceptually right, but it might miss the document that literally contains the acronym "SLA."

Keyword search (BM25) does the opposite. It finds exact term matches but misses semantic connections.

The solution: run both in parallel and merge the results.

Implementation

The RetrieverService now runs Weaviate and OpenSearch concurrently using Mono.zip(), each with independent 500ms timeouts:

private Mono> executeHybridRetrieval(Query query, int topK, Span span) {
    hybridCounter.increment();

    Mono> vectorMono = weaviateGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -> {
                log.warn("Vector search failed in hybrid mode: {}", ex.getMessage());
                return Mono.just(List.of());
            });

    Mono> lexicalMono = openSearchGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -> {
                log.warn("Lexical search failed in hybrid mode: {}", ex.getMessage());
                return Mono.just(List.of());
            });

    return Mono.zip(vectorMono, lexicalMono)
            .map(tuple -> mergeWithRRF(tuple.getT1(), tuple.getT2(), topK));
}

A few things to note:

Both searches are independent. If one fails, the other still returns results. This is strictly better than the old fallback-only approach, we get hybrid quality when both work, and graceful degradation when one doesn't.
500ms timeout each, not combined. Since they run in parallel, the total retrieval time is max(vector, lexical), not vector + lexical.

Reciprocal Rank Fusion (RRF)

The merging algorithm is RRF, which is the industry standard for combining ranked lists from different sources:

private List mergeWithRRF(List vectorResults,
                                         List lexicalResults, int topK) {
    Map scores = new HashMap<>();
    Map docsByKey = new HashMap<>();

    for (int i = 0; i < vectorResults.size(); i++) {
        RetrievedDoc doc = vectorResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    for (int i = 0; i < lexicalResults.size(); i++) {
        RetrievedDoc doc = lexicalResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    return scores.entrySet().stream()
            .sorted(Map.Entry.comparingByValue().reversed())
            .limit(topK)
            .map(entry -> {
                RetrievedDoc original = docsByKey.get(entry.getKey());
                return new RetrievedDoc(original.id(), original.chunk(),
                                        entry.getValue(), original.meta());
            })
            .collect(Collectors.toList());
}

RRF is elegant because it's rank-based, not score-based. We don't need to normalize scores across different systems (Weaviate's cosine distance and OpenSearch's BM25 scores live on completely different scales). The k=60 constant is standard and works well in practice.

Documents that appear high in both lists get the highest combined score. A document ranked #1 in vector and #3 in lexical will outscore one ranked #1 in vector but absent from lexical results, which is exactly what we want.

Feature flag

Hybrid search is togglable via configuration:

retriever:
  hybrid-enabled: ${HYBRID_ENABLED:true}

When disabled, the system falls back to the original behavior: Weaviate primary, OpenSearch on failure only. This was useful for A/B comparison during development.

Change 2: Reranking

Hybrid search gives us better candidates. Reranking picks the best candidates from that list.

The pattern is straightforward: retrieve broadly (top-20), rerank precisely (top-5), send only the best to the LLM. Initial retrieval is optimized for recall (don't miss relevant docs). Reranking is optimized for precision (only keep the most relevant).

In production, we'd use a cross-encoder model like BAAI/bge-reranker-v2-m3 or the Cohere Rerank API. For this demo, I implemented a lightweight reranker using cosine similarity between deterministic embeddings of the query and each chunk:

@Component
public class Reranker {

    private final Timer rerankLatency;

    public Reranker(MeterRegistry meterRegistry) {
        this.rerankLatency = Timer.builder("rag_rerank_latency")
                .description("Time spent reranking retrieved documents")
                .register(meterRegistry);
    }

    public Mono> rerank(String query, List candidates, int topK) {
        return Mono.fromCallable(() -> {
            Timer.Sample sample = Timer.start();
            try {
                double[] queryEmbedding = DeterministicEmbedding.embed(query);

                List reranked = candidates.stream()
                        .map(doc -> {
                            double[] chunkEmbedding = DeterministicEmbedding.embed(doc.chunk());
                            double similarity = cosineSimilarity(queryEmbedding, chunkEmbedding);
                            return new RetrievedDoc(doc.id(), doc.chunk(), similarity, doc.meta());
                        })
                        .sorted(Comparator.comparingDouble(RetrievedDoc::score).reversed())
                        .limit(topK)
                        .collect(Collectors.toList());

                log.debug("Reranked {} candidates down to {}", candidates.size(), reranked.size());
                return reranked;
            } finally {
                sample.stop(rerankLatency);
            }
        });
    }
}

The integration into RetrieverService is clean, reranking wraps the retrieval result:

int fetchK = properties.isRerankEnabled() ? properties.getRetrieveK() : topK;

Mono> retrieval;
if (properties.isHybridEnabled() && openSearchGateway.isEnabled()) {
    retrieval = executeHybridRetrieval(query, fetchK, span);
} else {
    retrieval = executeSingleSourceRetrieval(query, fetchK, span);
}

if (properties.isRerankEnabled()) {
    result = result.flatMap(docs -> reranker.rerank(query.text(), docs, topK));
}

When reranking is enabled, we fetch retrieveK (default 20) candidates instead of the final topK (default 5), then let the reranker narrow down. This gives the reranker a wider pool to work with.

Like hybrid search, reranking is feature-flagged via rerank-enabled in the config.

Change 3: Embedding Model Upgrade

all-MiniLM-L6-v2 has been a workhorse since 2021. It scores ~63 on the MTEB benchmark. Its bigger sibling, all-MiniLM-L12-v2, scores higher while keeping the same 384 dimensions, making it a drop-in upgrade.

The change is a single line in deploy/weaviate.yaml:

# Before
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2

# After
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L12-v2

Memory limit bumped from 2Gi to 3Gi to accommodate the larger model. Same dimensions means the Weaviate schema doesn't change, but we do need to re-ingest all documents since the embeddings will be different (make ingest).

For production systems, I'd recommend going further: intfloat/e5-large-v2 (1024 dims) or BAAI/bge-large-en-v1.5 score 75-76 on MTEB. But those require schema changes, more memory, and larger storage. The L6→L12 swap was the highest ROI for this demo.

Change 4: LLM Client Improvements

Two small but useful refinements:

Configurable temperature and max tokens

Previously these were hardcoded in the Java source:

// Before
.put("temperature", 0.7)
.put("max_tokens", 512);

// After
.put("temperature", properties.getTemperature())
.put("max_tokens", properties.getMaxTokens());

Now driven by application.yaml:

rag:
  temperature: ${LLM_TEMPERATURE:0.7}
  max-tokens: ${LLM_MAX_TOKENS:512}

Small change, but it means we can tune generation behavior via ConfigMap without redeploying. Handy for experimenting with different temperature values across environments.

Better token counting

The original code counted tokens by splitting on whitespace:

// Before: counts words, not tokens
int tokens = Math.max(1, text.split("\\s+").length);

// After: rough approximation: 1 token ≈ 4 characters
int tokens = Math.max(1, text.length() / 4);

Neither is perfect without a proper tokenizer, but length / 4 is much closer to reality for English text. This feeds into the cost estimation metrics on the observability dashboard, so getting it roughly right matters.

Change 5: Production Tuning

Timeouts and thresholds

The original values were tuned for local development. Under sustained load, some of them needed breathing room:

Setting	Before	After	Why
Retrieval timeout	250ms	500ms	Reduced unnecessary fallbacks under load
LLM generation timeout	1,800ms	5,000ms	Cold models and complex prompts need headroom
Cache similarity threshold	0.90	0.87	More cache hits, still precise enough
Cache TTL	600s (10 min)	3,600s (1 hour)	RAG docs don't change that often

The retrieval timeout change alone reduced the fallback rate from ~15% under load to ~3%. That's a meaningful quality improvement, every unnecessary fallback means the user gets lexical-only results instead of hybrid.

The Updated Retrieval Flow

Here's how the retrieval pipeline looks now, end to end:

User query arrives at Retriever
    │
    ├── Hybrid enabled?
    │     YES → Run Weaviate + OpenSearch in parallel (500ms each)
    │           → Merge results with RRF (k=60)
    │     NO  → Run Weaviate only (500ms timeout)
    │           → On failure, fallback to OpenSearch
    │
    ├── Rerank enabled?
    │     YES → Take top-20 candidates
    │           → Rerank by cosine similarity
    │           → Return top-5
    │     NO  → Return top-5 directly
    │
    └── Return to Orchestrator

Every step is independently toggleable via configuration, instrumented with Prometheus metrics (rag_retrieval_hybrid_total, rag_rerank_latency), and traced with OpenTelemetry spans.

What's Next

These changes addressed the most impactful improvements. The system is meaningfully better, but there's more on the roadmap:

A proper cross-encoder reranker. The cosine similarity reranker is a stand-in. A real cross-encoder (bge-reranker-v2-m3) would give much better precision, at the cost of ~80ms latency and an additional inference sidecar.
Query routing. Not every question needs RAG. A router agent that decides per-query whether to use the cache, call a tool, run the RAG pipeline, or just let the LLM answer from its training data, that's the next architectural evolution.
Better embedding model. all-MiniLM-L12-v2 is better than L6, but models like intfloat/e5-large-v2 or BAAI/bge-large-en-v1.5 would be a step change in retrieval quality.
Contextual retrieval. Anthropic's technique of prepending chunk-specific context before embedding (e.g., "This chunk is from the autoscaling documentation") reduces retrieval failures by up to 67%. That's a significant number worth exploring.

The full code is at github.com/aboullaite/rag-java-k8s. Deploy locally with make dev-up && make build && make deploy && make ingest and try it yourself.

Building Production-Grade RAG Systems: Kubernetes, Autoscaling & LLMs

Mohammed Aboullaite — Sat, 22 Nov 2025 18:13:19 GMT

We finally got first drop of snow this week in Stockholm. The eather is getting colder and days shorter. That only motivates me to continue writing my third and final post in the RAG series.
In part one, we explored the production challenges of RAG systems. In part two, we dove deep into the architecture and component design. Now let's talk about the elephant in the room: why Kubernetes?

The LLM Infrastructure Problem

LLM applications have unique operational requirements that traditional web applications don't face:

1. GPU Resource Management

Running models like Gemma-2-2B requires GPUs. Not just any GPUs—specific GPU types (L4, A100, H100) with minimum VRAM requirements. You need:

Dynamic allocation: Spin up GPU nodes when needed, tear down when idle
Multi-tenancy: Share expensive GPUs across multiple services (when possible)
Isolation: Ensure one model's OOM crash doesn't kill other workloads
Scheduling: Route inference requests to GPU-backed pods automatically

Kubernetes thicks all the boxes.

2. Heterogeneous Scaling

Our RAG pipeline has components with radically different scaling profiles:

Component	Scaling Trigger	Resource Type	Scale Range
Retriever	CPU + Request rate	CPU-intensive	2-30 replicas
Orchestrator	Connection count	I/O-bound	2-10 replicas
LLM (KServe)	Inference load	GPU-bound	0-3 replicas (scale-to-zero)
Vector DB	Query volume	Memory + I/O	Managed/external
Cache (Redis)	Memory usage	Memory-bound	1 replica (stateful)

Each component needs independent scaling logic. The retriever might scale out to 30 replicas during a traffic spike while the orchestrator stays at 2 replicas. The LLM should scale to zero when idle (saving $$$ on GPU costs) but warm up quickly when requests arrive.

Kubernetes HPA, KEDA, and KServe give you fine-grained control over each layer.

3. Network Complexity

Our RAG pipelines involve complex service-to-service communication:

Orchestrator → Retriever (HTTP)
Retriever → Weaviate (gRPC)
Retriever → OpenSearch (REST)
Orchestrator → KServe (HTTP with streaming)
Orchestrator → Redis (TCP)
All services → OTEL Collector (gRPC)
All services → Prometheus (HTTP scraping)

We need:

Service discovery (how does the orchestrator find the retriever?)
Load balancing (distribute requests across retriever replicas)
Retry logic (retry failed requests with backoff)
Circuit breaking (stop calling unhealthy services)
Observability (trace requests across service boundaries)

Kubernetes Services, Ingress, and service meshes (Istio, Linkerd) handle this out of the box.

4. Deployment Complexity

Production deployments require:

Canary releases: Route 10% of traffic to new versions, monitor metrics, rollback if needed
Blue-green deployments: Swap entire environments atomically
Rolling updates: Replace pods gradually without downtime
Rollback: Revert to previous versions quickly
Health checks: Readiness and liveness probes to avoid routing to broken pods
Resource limits: Prevent resource exhaustion and noisy neighbor problems

Kubernetes Deployments, StatefulSets, and Rollouts (via Argo) provide these primitives.

5. Observability at Scale

When you have 30 retriever pods, 10 orchestrator pods, and multiple LLM replicas, you need:

Distributed tracing: See request flows across services
Metrics aggregation: Scrape metrics from all pods automatically
Log aggregation: Centralized logging with correlation IDs
Dashboarding: Real-time visibility into system health

Kubernetes + Prometheus + OpenTelemetry + Grafana + Tempo is the standard stack.

Why Not Serverless?

Serverless functions (Lambda, Cloud Functions, Cloud Run) work for stateless HTTP APIs. But LLM workloads break the serverless model:

Cold starts: LLMs take 10-60 seconds to load into GPU memory. Cold starts are unacceptable.

Execution time limits: Serverless has timeouts (AWS Lambda: 15 minutes, Cloud Functions: 60 minutes). Long-running inference or batch processing exceeds these limits.

GPU support: Limited or expensive. AWS Lambda doesn't support GPUs. Cloud Run supports GPUs but at premium pricing without scale-to-zero.

Stateful caching: Semantic caching requires shared state (Redis). Serverless architectures push state to external services, adding latency.

Cost: Serverless pricing is optimized for bursty, short-lived workloads. LLM inference is compute-intensive and benefits from sustained usage discounts on VMs/GPUs.

All of those arguments can be mitigated. But Kubernetes gives us serverless-like abstractions (KServe scale-to-zero) while maintaining control over GPU resources, state management, and cost.

The Kubernetes Deployment Strategy

Let's walk through deploying this RAG system to Kubernetes, both locally (KinD) and in production (GKE).

Production Deployment on GKE

Google Kubernetes Engine (GKE) provides managed Kubernetes with:

Autopilot mode: Google manages nodes, scaling, security patches
GPU node pools: L4, A100, H100 GPU support
Regional clusters: High availability across zones
Integrated logging: Stackdriver integration
VPC-native networking: Secure service-to-service communication

Creating the cluster:

# Configure environment
vim .env  # Set GCP_PROJECT, GKE_REGION, REGISTRY

# Create GKE cluster
make gke-cluster

This provisions:

Node pool: 3 nodes, e2-standard-4 (4 vCPU, 16GB RAM)
Autopilot: Optional (use make gke-autopilot for fully managed)
Region: europe-west4 (Belgium, low latency to EU users)
Network: VPC-native with private IPs

Installing KServe:

KServe requires manual installation (for now). Run these commands:

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager-webhook -n cert-manager

# Install Knative Serving
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-core.yaml

# Install Kourier networking layer
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.20.0/kourier.yaml
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve.yaml
kubectl wait --for=condition=available --timeout=300s deployment/kserve-controller-manager -n kserve
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve-cluster-resources.yaml

# Configure raw deployment mode (no Knative autoscaling)
kubectl patch configmap/inferenceservice-config -n kserve --type merge \
  --patch '{"data":{"deploy":"{\"defaultDeploymentMode\":\"RawDeployment\"}"}}'

Create GPU node pool:

make gke-gpu

This creates a separate node pool with:

GPU type: NVIDIA L4 (24GB VRAM, cost-effective for inference)
Machine type: g2-standard-4 (4 vCPU, 16GB RAM, 1x L4 GPU)
Nodes: 1 node (autoscales 0-3 based on demand)
Taints: nvidia.com/gpu=present:NoSchedule (only GPU workloads land here)

Deploy the stack:

# Build and push images to Artifact Registry
gcloud auth configure-docker europe-north1-docker.pkg.dev
make build

# Deploy all services + LoadBalancer
make gke-deploy

# Ingest sample data
make ingest

# Get external IP
kubectl get svc orchestrator-public -n rag

Your production RAG system is now live at http://.

Autoscaling

Autoscaling is where Kubernetes shines. Let's break down each layer.

Horizontal Pod Autoscaler (HPA): CPU-Based Scaling

HPA scales pods based on resource metrics (CPU, memory). For the retriever service:

# deploy/hpa-retriever.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: retriever-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: retriever
  minReplicas: 2
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
        - type: Pods
          value: 4
          periodSeconds: 30
      selectPolicy: Max

How it works:

HPA queries Kubernetes Metrics Server for pod CPU usage
If average CPU > 70%, scale up
If average CPU < 70%, scale down (after stabilization window)
Scale-up is aggressive (100% increase or +4 pods, whichever is greater)
Scale-down is gradual (50% decrease every 60 seconds, after 5-minute stabilization)

Why these settings?

minReplicas: 2: Ensures redundancy (if one pod crashes, traffic routes to the other)
maxReplicas: 30: Handles extreme traffic spikes without unbounded cost
averageUtilization: 70: Headroom for bursts without constant scaling oscillation
scaleUp aggressive, scaleDown gradual: Prefer over-provisioning during spikes, slow drain during cooldowns

KEDA: Event-Driven Autoscaling

HPA is great for CPU/memory scaling, but what about custom metrics? Enter KEDA (Kubernetes Event-Driven Autoscaling).

KEDA scales based on external metrics like:

Prometheus queries
Kafka message lag
AWS SQS queue depth
Custom HTTP endpoints

For the retriever, we scale based on requests per second:

# deploy/keda-retriever.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: retriever-keda
  namespace: rag
spec:
  scaleTargetRef:
    name: retriever
  minReplicaCount: 2
  maxReplicaCount: 30
  cooldownPeriod: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.observability.svc.cluster.local:9090
        query: |
          sum(rate(http_server_requests_seconds_count{app="retriever",namespace="rag"}[1m]))
        threshold: "15"

How it works:

KEDA queries Prometheus every 30 seconds
Executes the PromQL query: sum(rate(http_server_requests_seconds_count{app="retriever"}[1m]))
If total RPS > 15 requests/sec, scale up by 1 replica
If total RPS < 15 requests/sec, scale down by 1 replica (after 120-second cooldown)

Why RPS-based scaling?

CPU utilization is a lagging indicator. By the time CPU hits 70%, users are already experiencing latency. RPS is a leading indicator—traffic increases before CPU saturates.

With 15 RPS threshold:

2 replicas handle 30 RPS
10 replicas handle 150 RPS
30 replicas handle 450 RPS

In load testing, this scales the retriever from 2→20 replicas within 90 seconds when traffic ramps from 10→200 RPS.

KServe: Scale-to-Zero for LLMs

GPU costs are expensive. Leaving an L4 GPU idle costs ~$0.60/hour (~$430/month). Scale-to-zero is critical.

KServe (via Knative Serving) provides:

Scale to zero: Terminate pods when idle for 60 seconds
Warm-up on demand: Spin up pods on first request
Concurrency-based scaling: Scale based on in-flight requests

# deploy/kserve-vllm.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: rag-llm
  namespace: rag
spec:
  predictor:
    minReplicas: 0        # Scale to zero when idle
    maxReplicas: 3        # GPU instance
    scaleTarget: 20        # Target 20 concurrent request
    scaleMetric: concurrency
    model:
      runtime: vllm-runtime
      modelFormat:
        name: huggingface
      args:
        - --model
        - google/gemma-2-2b-it
        - --dtype
        - auto
        - --max-model-len
        - "4096"
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "3"
          memory: 12Gi

Cold start latency:

Model download: ~10-20 seconds (cached after first run)
Model load to GPU: ~15-30 seconds
First inference: ~1-3 seconds
Total cold start: ~30-50 seconds

Mitigation strategies:

Keep-alive: Background service pings the model every 45 seconds to prevent scale-down
Fallback: Orchestrator uses deterministic fallback during cold starts
Pre-warming: Scale to 1 replica before traffic spikes (e.g., before scheduled events)

For production workloads with consistent traffic, set minReplicas: 1 to avoid cold starts.

Combining HPA + KEDA + KServe

Why use both HPA and KEDA for the retriever?

They complement each other:

HPA: Reacts to CPU saturation (protects against resource exhaustion)
KEDA: Reacts to request rate (proactive scaling before CPU saturates)

Kubernetes merges their recommendations, choosing the higher replica count. During traffic spikes:

KEDA detects rising RPS and scales to 3 replicas
CPU usage remains <70% (HPA doesn't trigger)
If CPU spikes above 70% (e.g., slow queries), HPA overrides and scales to 15 replicas
When traffic drops, KEDA scales down after cooldown

This multi-signal approach prevents both under-provisioning (latency spikes) and over-provisioning (wasted cost).

Observability Stack: Seeing What's Happening

You can't optimize what you can't measure. The observability stack provides end-to-end visibility.

Architecture Overview

OpenTelemetry Tracing

Every request generates a distributed trace spanning multiple services. The OpenTelemetry Java agent instruments Spring Boot automatically:

# deploy/retriever.yaml

initContainers:
  - name: otel-agent-downloader
    image: busybox:1.36
    command:
      - sh
      - -c
      - >
        wget -q -O /otel/javaagent.jar
        https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.21.0/opentelemetry-javaagent.jar
    volumeMounts:
      - name: otel-agent
        mountPath: /otel

containers:
  - name: retriever
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://otel-collector.observability.svc.cluster.local:4317
      - name: JAVA_TOOL_OPTIONS
        value: "-javaagent:/otel/javaagent.jar"
    volumeMounts:
      - name: otel-agent
        mountPath: /otel

Custom span attributes provide RAG-specific context:

// common/src/main/java/me/aboullaite/rag/common/tracing/TracingUtils.java

public class TracingUtils {
    public static void recordCacheHit(Span span, boolean hit) {
        span.setAttribute("rag.cache.hit", hit);
    }

    public static void recordRetrievedDocs(Span span, List docs) {
        span.setAttribute("rag.retrieval.count", docs.size());
        if (!docs.isEmpty()) {
            span.setAttribute("rag.retrieval.source",
                docs.get(0).metadata().source());
        }
    }

    public static void recordFallback(Span span, String reason) {
        span.setAttribute("rag.fallback.reason", reason);
    }

    public static void recordModelUsage(Span span, String model, long ttftMs, int tokens) {
        span.setAttribute("rag.model.name", model);
        span.setAttribute("rag.ttft_ms", ttftMs);
        span.setAttribute("rag.tokens.total", tokens);
    }
}

In Grafana Tempo, you can:

Filter traces by rag.cache.hit=false (cache misses)
Find slow requests by rag.ttft_ms > 1000 (first token > 1 second)
Identify fallback triggers by rag.fallback.reason=weaviate-timeout

Prometheus Metrics

Spring Boot Actuator exposes Prometheus metrics at /actuator/prometheus. The retriever and orchestrator emit custom metrics:

// Orchestrator metrics
this.askLatency = Timer.builder("rag_orchestrator_latency")
        .description("End-to-end /v1/ask latency")
        .register(meterRegistry);

this.cacheHitCounter = Counter.builder("rag_cache_hit_total")
        .description("Semantic cache hits")
        .register(meterRegistry);

this.tokensCounter = Counter.builder("rag_tokens_generated_total")
        .description("Total tokens generated by model responses")
        .register(meterRegistry);

this.costSummary = DistributionSummary.builder("rag_cost_usd_total")
        .description("Approximate request cost in USD")
        .register(meterRegistry);

Prometheus scrapes these metrics every 15 seconds:

# deploy/prometheus.yaml

scrape_configs:
  - job_name: 'retriever'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - rag
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Grafana Dashboards

The Grafana dashboard visualizes key RAG metrics:

┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline Dashboard                                      │
├─────────────────────────────────────────────────────────────┤
│ Cache Hit Rate:  47.3%  │  Avg Latency:     847ms           │
│ Total Requests:  12.4k  │  p95 Latency:    1.32s            │
│ Fallback Rate:    8.2%  │  p99 Latency:    2.15s            │
│ Estimated Cost: $14.32  │  Tokens/sec:      142             │
├─────────────────────────────────────────────────────────────┤
│ [Line chart: Request rate over time (RPS)]                  │
│ [Line chart: Cache hit ratio (percentage)]                  │
│ [Line chart: Latency percentiles (p50/p95/p99)]             │
│ [Bar chart: Replica count by service]                       │
│ [Line chart: Token throughput (tokens/sec)]                 │
│ [Line chart: Estimated cost per request]                    │
└─────────────────────────────────────────────────────────────┘

Key metrics to watch:

Cache hit ratio: Should be >40% for typical workloads. If <20%, investigate query distribution or lower similarity threshold.
p95 latency: Should be <2 seconds. If higher, check:
- Retrieval timeout settings (too aggressive?)
- LLM TTFT (model overloaded?)
- Network latency (cross-region calls?)
Fallback rate: Should be <10%. If higher, investigate:
- Weaviate performance (slow queries, resource exhaustion)
- Timeout settings (too strict?)
Token throughput: Tracks LLM utilization. If low despite high traffic, you might need more GPU instances.
Cost per request: Average should be ~$0.001-$0.005 depending on token generation. Spikes indicate cache misses or long responses.

Load Testing and Performance Tuning

Load testing validates autoscaling and identifies bottlenecks. The repo includes a k6 script:

// scripts/loadtest-k6.js

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '1m', target: 10 },   // Ramp to 10 RPS
    { duration: '3m', target: 30 },   // Sustain 30 RPS
    { duration: '1m', target: 50 },   // Spike to 50 RPS
    { duration: '2m', target: 10 },   // Cool down to 10 RPS
    { duration: '1m', target: 0 },    // Drain
  ],
};

export default function () {
  const queries = [
    'How does autoscaling work?',
    'Explain the caching mechanism',
    'What is the fallback strategy?',
    'How do I deploy to Kubernetes?',
    'What observability tools are used?',
  ];

  const prompt = queries[Math.floor(Math.random() * queries.length)];
  const payload = JSON.stringify({ prompt, topK: 5 });

  const res = http.post('http://localhost:8080/v1/ask', payload, {
    headers: { 'Content-Type': 'application/json' },
  });

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 3s': (r) => r.timings.duration < 3000,
    'has answer': (r) => JSON.parse(r.body).answer.length > 0,
  });

  sleep(1);
}

Run the test:

make port-forward  # In one terminal
k6 run scripts/loadtest-k6.js  # In another terminal

Watch in Grafana:

Retriever replicas scale from 2→10→20 as RPS increases
Cache hit ratio stabilizes around 45% after warm-up
p95 latency stays <2s despite traffic spike
Fallback rate increases slightly during peak (10-15%)

Tuning recommendations:

If p95 latency > 2s: Increase retriever maxReplicas or lower HPA CPU threshold to 60%
If fallback rate > 15%: Increase Weaviate timeout from 250ms to 500ms
If cache hit ratio < 30%: Lower similarity threshold from 0.90 to 0.85 (validate answer quality)
If cost per request > $0.01: Reduce max_tokens in LLM config or improve prompt efficiency

Cost Optimization Strategies

Running LLMs in production is expensive. Here's how to minimize cost without sacrificing quality:

1. Semantic Caching

Cache hit ratio of 45% means you're avoiding 45% of LLM calls. At ~$0.002 per request, that's ~$0.001 saved per cached request. For 100k requests/day:

Without cache: $200/day
With 45% cache hit: $110/day
Savings: $90/day = $32,850/year

2. KServe Scale-to-Zero

If your traffic has idle periods (e.g., nights, weekends), scale-to-zero saves GPU costs:

L4 GPU: $0.60/hour = $14.40/day
If idle 50% of the time: $7.20/day savings = $2,628/year

3. Spot/Preemptible Instances

GKE supports spot instances (70% discount on compute):

Standard g2-standard-4: $0.35/hour
Spot g2-standard-4: $0.10/hour
Savings: $0.25/hour = $6/day = $2,190/year

Caveat: Spot instances can be preempted. Use them for stateless services (retriever, orchestrator) but not for stateful stores (Redis, databases).

4. Right-Sizing Resources

Monitor actual resource usage in Grafana:

If retriever CPU consistently <50%, reduce requests.cpu from 200m to 100m
If orchestrator memory consistently <200MB, reduce requests.memory from 256Mi to 128Mi

Over-provisioning wastes money. Under-provisioning causes OOM kills. Find the sweet spot.

Why Kubernetes can be an option for running LLM apps

Let's revisit the original question: why Kubernetes for LLM workloads?

Because Kubernetes provides:

GPU orchestration: Dynamic allocation, multi-tenancy, isolation
Heterogeneous scaling: Independent scaling per component (HPA, KEDA, KServe)
Service mesh: Discovery, load balancing, retries, circuit breaking
Deployment primitives: Canary, blue-green, rolling updates, rollbacks
Observability integration: Prometheus, Tempo, Grafana, OpenTelemetry
Cost optimization: Spot instances, autoscaling, scale-to-zero
Portability: Run in GCP (GKE), AWS (EKS), Azure (AKS), on-prem

LLM applications are distributed systems. They have complex dependencies, heterogeneous resource requirements, and demanding operational SLOs. Kubernetes is purpose-built for this.

Yes, there's a learning curve. Yes, it's more complex. But the operational benefits—reliability, scalability, observability, cost efficiency—are undeniable.

Wrapping Up

We've built a production-grade RAG system from the ground up:

Part 1: We discussed the core challenges (latency, reliability, cost, quality, observability)

Part 2: We designed a resilient architecture with semantic caching, hybrid retrieval, graceful degradation, and comprehensive instrumentation

Part 3: We outlined the K8S deployment with intelligent autoscaling (HPA, KEDA, KServe), full observability (Prometheus, Tempo, Grafana), and cost optimization

The patterns here, service isolation, reactive programming, timeouts, fallbacks, tracing, metrics, autoscaling, are how you build systems that serve millions of users.

The code is open source: github.com/aboullaite/rag-java-k8s

Thanks for following along this series! If you have questions or want to discuss RAG architectures, Kubernetes patterns, or LLM infrastructure, find me on Twitter or LinkedIn.

Now go build something great.

Building Production-Grade RAG Systems: Architecture Deep Dive

Mohammed Aboullaite — Sun, 16 Nov 2025 13:32:49 GMT

In the first part, we explored the production challenges of RAG systems: latency, reliability, cost, quality, and observability. Now let's get our hands dirty with the actual architecture and implementation.

The codebase uses Java 25, Spring Boot 3.5.7, reactive programming with WebFlux, and follows production patterns you'd see in enterprise systems. Every design decision has a reason, and I'll explain the tradeoffs as we go.

Service Boundaries: Why Separation Matters

The system is split into three main modules:

common/                # Shared DTOs and tracing helpers
retriever/             # Reactive Weaviate/OpenSearch retriever service
orchestrator/          # Orchestration, caching, LLM routing, SSE

This separation isn't arbitrary. Each service has distinct scaling characteristics and failure modes:

Retriever is CPU and I/O intensive, scales horizontally, and needs aggressive timeouts
Orchestrator manages state (semantic cache), handles user connections (SSE), and coordinates the pipeline
Common provides shared contracts (DTOs) and telemetry utilities that both services use

By isolating these responsibilities, we can scale the retriever independently during traffic spikes while keeping the orchestrator stable. If the retriever service crashes, the orchestrator can still serve cached responses.

The Retriever Service: Hybrid Search with Fallbacks

Let's start with document retrieval. The retriever service exposes a single endpoint:

POST /v1/retrieve

Request:

{
  "text": "How does autoscaling work?",
  "filters": {"section": "infrastructure"},
  "topK": 5
}

Response:

[
  {
    "id": "doc-03-autoscaling",
    "chunk": "Autoscaling combines HPA and KEDA...",
    "score": 0.87,
    "metadata": {
      "source": "doc-03-autoscaling.md",
      "section": "infrastructure"
    }
  }
]

The implementation is deceptively simple but primarily designed for resilience:

// retriever/src/main/java/me/aboullaite/rag/retriever/service/RetrieverService.java

@Service
public class RetrieverService {

    private final WeaviateGateway weaviateGateway;
    private final OpenSearchGateway openSearchGateway;
    private final RetrieverProperties properties;
    private final Timer retrievalLatency;
    private final Counter fallbackCounter;
    private final Tracer tracer;

    public Mono> retrieve(Query query) {
        int topK = query.topK() > 0 ? query.topK() : properties.getTopKDefault();
        return Mono.defer(() -> executeRetrieval(query, topK));
    }

    private Mono> executeRetrieval(Query query, int topK) {
        Span span = tracer.spanBuilder("rag.retrieve")
                .setAttribute("rag.request.topK", topK)
                .startSpan();
        Timer.Sample sample = Timer.start(meterRegistry);

        return weaviateGateway.search(query, topK)
                .timeout(Duration.ofMillis(properties.getTimeoutMs()))
                .onErrorResume(throwable -> fallback(query, topK, span, throwable))
                .doOnNext(docs -> TracingUtils.recordRetrievedDocs(span, docs))
                .doOnError(span::recordException)
                .doFinally(signalType -> {
                    sample.stop(retrievalLatency);
                    span.end();
                });
    }

    private Mono> fallback(Query query, int topK, Span parentSpan, Throwable throwable) {
        boolean timeout = throwable instanceof TimeoutException;
        log.warn("Primary vector search failed (timeout={}): {}", timeout, throwable.getMessage());
        fallbackCounter.increment();
        TracingUtils.recordFallback(parentSpan, timeout ? "weaviate-timeout" : throwable.getClass().getSimpleName());

        if (!openSearchGateway.isEnabled()) {
            return Mono.just(List.of());
        }

        return openSearchGateway.search(query, topK)
                .doOnNext(docs -> TracingUtils.recordRetrievedDocs(parentSpan, docs));
    }
}

Key Design Decisions

1. Reactive Streams with Project Reactor

Notice the return type: Mono>. This is Project Reactor's reactive type for 0-1 values. By using reactive programming:

We avoid blocking threads during I/O
Timeouts are first-class citizens (.timeout(Duration.ofMillis(250)))
Error handling composes naturally (.onErrorResume())
Observability hooks integrate seamlessly (.doOnNext(), .doFinally())

Spring Boot's WebFlux framework handles request threads efficiently, allowing the retriever to handle hundreds of concurrent requests without thread pool exhaustion.

2. Aggressive Timeouts

The default timeout is 250ms. That's intentionally tight. Why?

Users expect sub-second responses
Vector databases can have occasional slow queries (large result sets, index rebuilds, etc.)
We'd rather fallback to lexical search than make users wait. That is of cource debatable and depends on what wre we optimazing for!

In load testing, this timeout triggers fallback ~5-15% of the time under heavy load, which is acceptable given the graceful degradation.

3. Observability at Every Step

Every retrieval is instrumented:

OpenTelemetry Span: captures timing, document count, and fallback reasons
Prometheus Timer: records latency histogram for p95/p99 analysis
Prometheus Counter: tracks fallback frequency

When debugging production issues, we can simply filter Tempo traces by rag.fallback.reason=weaviate-timeout to see exactly which requests degraded.

4. Lexical Fallback via OpenSearch

Weaviate is great for semantic search, but sometimes you need exact term matching. OpenSearch provides BM25 ranking, which excels at:

Acronyms (e.g., "HPA", "KEDA", "SSE")
Version numbers (e.g., "Java 25", "Spring Boot 3.5.7")
Exact phrases (e.g., "Server-Sent Events")

The fallback is a deliberate hybrid retrieval strategy. Some RAG systems use re-rankers to combine vector and lexical signals; here, we use fallback for simplicity while maintaining quality.

The Orchestrator Service: Coordination and Caching

The orchestrator is the brain of the system. It coordinates caching, retrieval, prompt assembly, generation, and streaming. Let's walk through the request flow.

Request Flow Diagram

Semantic Cache Implementation

The semantic cache is the secret weapon for both latency and cost optimization. Here's the code:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/cache/SemanticCacheService.java

@Service
public class SemanticCacheService {

    private static final String CACHE_INDEX = "rag:cache:index";
    private static final String CACHE_KEY_PREFIX = "rag:cache:";
    private static final double SIMILARITY_THRESHOLD = 0.90;
    private static final Duration CACHE_TTL = Duration.ofMinutes(10);

    private final RedisTemplate redisTemplate;

    public Mono lookup(String normalizedQuery, double[] embedding) {
        return Mono.fromCallable(() -> {
            Set keys = redisTemplate.opsForSet().members(CACHE_INDEX);
            if (keys == null || keys.isEmpty()) {
                return null;
            }

            double maxSimilarity = 0.0;
            CacheEntry bestMatch = null;

            for (String key : keys) {
                String json = redisTemplate.opsForValue().get(key);
                if (json == null) continue;

                CacheEntry entry = deserialize(json);
                double similarity = SimilarityUtils.cosineSimilarity(embedding, entry.embedding());

                if (similarity > maxSimilarity && similarity >= SIMILARITY_THRESHOLD) {
                    maxSimilarity = similarity;
                    bestMatch = entry;
                }
            }

            return bestMatch != null ? new CacheHit(bestMatch, maxSimilarity) : null;
        }).subscribeOn(Schedulers.boundedElastic());
    }

    public Mono put(String normalizedQuery, double[] embedding,
                          GenerationResponse response, List docs) {
        return Mono.fromRunnable(() -> {
            String key = CACHE_KEY_PREFIX + UUID.randomUUID();
            CacheEntry entry = new CacheEntry(
                normalizedQuery,
                embedding,
                response.answer(),
                response.citations(),
                docs.stream().map(RetrievedDoc::id).toList(),
                System.currentTimeMillis()
            );

            String json = serialize(entry);
            redisTemplate.opsForValue().set(key, json, CACHE_TTL);
            redisTemplate.opsForSet().add(CACHE_INDEX, key);
        }).subscribeOn(Schedulers.boundedElastic()).then();
    }
}

Why Cosine Similarity Threshold 0.90?

This threshold balances precision and recall:

Too low (e.g., 0.70): You'd match dissimilar queries, returning wrong cached answers
Too high (e.g., 0.98): You'd miss legitimate matches, reducing cache hit rate

At 0.90, queries like:

"How does autoscaling work?"
"Explain the autoscaling mechanism"
"What is the autoscaling strategy?"

...all match and reuse the same cached answer. But unrelated queries like "How do I ingest data?" won't match.

In load testing with realistic query distributions, 0.90 yields ~45% cache hit rate, cutting LLM costs nearly in half.

This is based on my tests and use cases, don't just rely on these numbers, please perform your own testing. Your numbers can be different.

Deterministic Embeddings

For this demo, I'm using deterministic 8-dimensional embeddings generated via SHA-256 hashing:

// common/src/main/java/me/aboullaite/rag/common/embedding/DeterministicEmbedding.java

public class DeterministicEmbedding {
    public static double[] embed(String text) {
        byte[] hash = MessageDigest.getInstance("SHA-256").digest(text.getBytes());
        double[] embedding = new double[8];
        for (int i = 0; i < 8; i++) {
            embedding[i] = (hash[i] & 0xFF) / 255.0;
        }
        return normalize(embedding);
    }
}

Why deterministic embeddings?

No external embedding service dependency for the demo
Reproducible cache behavior in tests
Instant embedding computation (no API latency)

In production, we'd need to use proper sentence embeddings (e.g., all-MiniLM-L6-v2 via Hugging Face or OpenAI text-embedding-3-small). The cache logic remains identical, we just have to swap the embedding function.

Prompt Assembly and Citation Tracking

Once documents are retrieved, we need to construct a prompt that:

Provides clear instructions to the LLM
Injects retrieved context
Enforces citation requirements
Handles edge cases (no documents, partial results, etc.)

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/prompt/PromptAssembler.java

@Component
public class PromptAssembler {

    public PromptBundle assemble(String userPrompt, List docs) {
        if (docs.isEmpty()) {
            return new PromptBundle(
                noContextPrompt(userPrompt),
                List.of(),
                List.of()
            );
        }

        StringBuilder prompt = new StringBuilder();
        prompt.append("You are a helpful assistant. Answer the question based ONLY on the provided documents.\n\n");
        prompt.append("Documents:\n");

        List citations = new ArrayList<>();
        List citationDetails = new ArrayList<>();

        for (int i = 0; i < docs.size(); i++) {
            RetrievedDoc doc = docs.get(i);
            String citationId = doc.id();
            citations.add(citationId);
            citationDetails.add(new CitationInfo(
                citationId,
                doc.metadata().source(),
                doc.metadata().section()
            ));

            prompt.append(String.format("[%s] %s\n\n", citationId, doc.chunk()));
        }

        prompt.append("Question: ").append(userPrompt).append("\n\n");
        prompt.append("Instructions:\n");
        prompt.append("- Answer ONLY using information from the provided documents\n");
        prompt.append("- Cite sources using [doc-id] notation\n");
        prompt.append("- If the documents don't contain enough information, say 'I don't know'\n");
        prompt.append("- Be concise and accurate\n\n");
        prompt.append("Answer:");

        return new PromptBundle(prompt.toString(), citations, citationDetails);
    }

    private String noContextPrompt(String userPrompt) {
        return "You are a helpful assistant. The user asked: " + userPrompt +
               "\n\nNo relevant documents were found. Please respond with: I don't know.";
    }
}

Citation Enforcement

Notice the explicit instructions:

"Answer ONLY using information from the provided documents"
"Cite sources using [doc-id] notation"
"If the documents don't contain enough information, say 'I don't know'"

LLMs are surprisingly good at following these instructions when they're clear and emphatic. In testing with Gemma-2-2B, citation compliance is >85% for well-formed prompts.

The PromptBundle record encapsulates:

public record PromptBundle(
    String prompt,              // Full prompt sent to LLM
    List citations,     // [doc-03-autoscaling, doc-09-infrastructure]
    List citationDetails  // Full metadata for UI rendering
) {}

This separation allows the orchestrator to:

Send a clean prompt to the LLM
Return structured citations to the client
Track which documents contributed to each answer (for cache invalidation, analytics, etc.)

LLM Integration: KServe + vLLM

The LLM layer uses KServe (Kubernetes serving framework) with vLLM runtime to host Gemma-2-2B-it (instruction-tuned).

Why KServe?

KServe provides:

Autoscaling: Scale-to-zero when idle, scale-up on demand
GPU management: Automatic GPU resource allocation
Inference optimization: vLLM uses PagedAttention for efficient memory usage
Standardized API: OpenAI-compatible /v1/chat/completions endpoint

The InferenceService definition:

# deploy/kserve-vllm.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: rag-llm
  namespace: rag
spec:
  predictor:
    minReplicas: 0        # Scale to zero when idle
    maxReplicas: 1
    scaleTarget: 1
    scaleMetric: concurrency
    model:
      runtime: vllm-runtime
      modelFormat:
        name: huggingface
      args:
        - --model
        - google/gemma-2-2b-it
        - --dtype
        - auto
        - --max-model-len
        - "4096"
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "3"
          memory: 12Gi
        requests:
          cpu: "2"
          memory: 8Gi

vLLM Runtime Configuration

vLLM is a high-performance inference engine optimized for LLMs. Key features:

PagedAttention: Reduces memory fragmentation, increases throughput
Continuous batching: Processes multiple requests efficiently
Quantization support: --dtype auto enables FP16/BF16 for faster inference

With Gemma-2-2B on an L4 GPU (24GB), vLLM achieves:

Time-to-first-token (TTFT): ~100-300ms
Throughput: ~50-80 tokens/sec
Concurrent requests: 4-8 (depending on sequence length)

LLM Client Implementation

The orchestrator calls KServe via a reactive client:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/client/LlmClient.java

@Component
public class LlmClient {

    private final WebClient webClient;
    private final OrchestratorProperties properties;

    public Mono generate(String prompt) {
        Map request = Map.of(
            "model", properties.getModelName(),
            "messages", List.of(
                Map.of("role", "user", "content", prompt)
            ),
            "max_tokens", properties.getMaxTokens(),
            "temperature", properties.getTemperature()
        );

        long startNano = System.nanoTime();
        AtomicLong ttftNano = new AtomicLong(0);

        return webClient.post()
                .uri("/v1/chat/completions")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(Map.class)
                .map(response -> {
                    if (ttftNano.get() == 0) {
                        ttftNano.set(System.nanoTime() - startNano);
                    }

                    String content = extractContent(response);
                    int tokens = estimateTokens(content);
                    long ttftMillis = ttftNano.get() / 1_000_000;

                    return new LlmResponse(content, tokens, ttftMillis);
                })
                .timeout(Duration.ofSeconds(properties.getGenerationTimeoutSeconds()));
    }
}

Time-to-First-Token (TTFT) is a critical metric for user experience. Measuring it accurately requires:

Start timer when request begins
Capture timestamp on first response byte
Calculate delta in milliseconds

This metric appears in OpenTelemetry spans as rag.ttft_ms, allowing us to track degradation trends in Grafana.

Streaming Responses with Server-Sent Events

One of the best UX improvements in modern LLM applications is streaming. Instead of waiting 3+ seconds for the complete answer, users see tokens as they're generated.

SSE Endpoint

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/web/AskController.java

@GetMapping(value = "/v1/ask/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux> askStream(
        @RequestParam String prompt,
        @RequestParam(required = false) Map filters,
        @RequestParam(required = false, defaultValue = "5") Integer topK) {

    return askService.askStreaming(prompt, filters, topK)
            .map(chunk -> {
                if (chunk.isComplete()) {
                    return ServerSentEvent.builder()
                            .event("complete")
                            .data(toJson(chunk))
                            .build();
                } else {
                    return ServerSentEvent.builder()
                            .event("token")
                            .data(chunk.token())
                            .build();
                }
            });
}

Event Types:

token: Individual generated tokens (streamed progressively)
complete: Final event containing citations and metadata

Client-side consumption (JavaScript):

const eventSource = new EventSource('/v1/ask/stream?prompt=How+does+caching+work');

eventSource.addEventListener('token', (event) => {
    document.getElementById('answer').textContent += event.data;
});

eventSource.addEventListener('complete', (event) => {
    const result = JSON.parse(event.data);
    displayCitations(result.citations);
    eventSource.close();
});

Progressive rendering dramatically improves perceived performance. Users engage with partial responses while generation continues, reducing perceived wait time by 50-70%.

Request Orchestration: Putting It All Together

Here's the core orchestration logic that ties everything together:

// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/service/AskService.java

public Mono ask(String prompt, Map filters, Integer topK) {
    String sanitizedPrompt = redact(prompt);  // PII redaction
    double[] embedding = embeddingService.embed(sanitizedPrompt);
    Span span = tracer.spanBuilder("rag.ask").startSpan();
    Timer.Sample sample = Timer.start(meterRegistry);

    return cacheService.lookup(sanitizedPrompt, embedding)
            .flatMap(hit -> onCacheHit(hit, span))
            .switchIfEmpty(Mono.defer(() -> {
                cacheMissCounter.increment();
                TracingUtils.recordCacheHit(span, false);
                return generateWithRetrieval(sanitizedPrompt, filters, topK, embedding, span);
            }))
            .doOnError(span::recordException)
            .doFinally(signalType -> {
                sample.stop(askLatency);
                span.end();
            });
}

private Mono generateWithRetrieval(
        String sanitizedPrompt,
        Map filters,
        Integer topK,
        double[] embedding,
        Span parentSpan) {
    Query query = new Query(sanitizedPrompt, filters, topK);

    return retrieverClient.retrieve(query)
            .flatMap(docs -> produceAnswer(sanitizedPrompt, docs, embedding, parentSpan))
            .switchIfEmpty(Mono.defer(() -> produceAnswer(sanitizedPrompt, List.of(), embedding, parentSpan)));
}

private Mono produceAnswer(
        String sanitizedPrompt,
        List docs,
        double[] embedding,
        Span parentSpan) {
    PromptBundle promptBundle = promptAssembler.assemble(sanitizedPrompt, docs);

    return llmClient.generate(promptBundle.prompt())
            .map(response -> toGenerationResponse(response, promptBundle, false, parentSpan))
            .flatMap(response -> cacheService.put(sanitizedPrompt, embedding, response, docs)
                    .thenReturn(response))
            .onErrorResume(ex -> {
                log.warn("LLM call failed, using fallback: {}", ex.getMessage());
                fallbackCounter.increment();
                TracingUtils.recordFallback(parentSpan, ex.getClass().getSimpleName());
                GenerationResponse fallback = fallbackResponse(docs, promptBundle.citationDetails());
                return cacheService.put(sanitizedPrompt, embedding, fallback, docs)
                        .onErrorResume(e -> Mono.empty())
                        .thenReturn(fallback);
            });
}

Reactive Composition Explained

The flow uses reactive operators to compose asynchronous operations:

cacheService.lookup(): Check cache (non-blocking I/O to Redis)
.flatMap(hit -> onCacheHit()): If cache hit, return immediately
.switchIfEmpty(Mono.defer(() -> ...)): If cache miss, proceed to retrieval
retrieverClient.retrieve(): Call retriever service (HTTP call)
.flatMap(docs -> produceAnswer()): Generate answer with retrieved docs
llmClient.generate(): Call LLM (HTTP streaming)
.flatMap(response -> cacheService.put()): Cache the result
.onErrorResume(ex -> fallbackResponse()): Graceful degradation on error
.doFinally(): Stop timer and close span (always executes)

This composition is non-blocking. No threads wait on I/O. Spring WebFlux dispatches work efficiently across an event loop, enabling high concurrency with minimal thread overhead.

Component Summary

Let's recap the key components and their roles:

Component	Responsibility	Technology	Scaling Strategy
Orchestrator	Request coordination, caching, streaming	Spring WebFlux, Redis	Horizontal (stateless except cache)
Retriever	Hybrid search (vector + lexical)	Spring WebFlux, Weaviate, OpenSearch	HPA (CPU) + KEDA (RPS)
Semantic Cache	Similarity-based response caching	Redis, cosine similarity	Vertical (single instance for consistency)
Vector Store	Semantic document search	Weaviate	Managed/external service
Lexical Store	Fallback keyword search	OpenSearch	Managed/external service
LLM Serving	Model inference with GPU	KServe, vLLM, Gemma-2-2B	KServe autoscaling (scale-to-zero)
Observability	Metrics, traces, dashboards	Prometheus, Tempo, Grafana, OTEL	N/A (infrastructure)

Why This Architecture Scales

The components here aren't just for demos. they're a good starting point to build production systems:

Service Isolation: Retriever and orchestrator scale independently
Reactive Programming: Non-blocking I/O maximizes throughput
Timeouts Everywhere: Aggressive timeouts prevent cascading failures
Graceful Degradation: Fallbacks at every layer (cache → retrieval → generation)
Observability-First: Traces and metrics built into every code path
Cost Awareness: Semantic caching reduces LLM spend by ~40-60%

When traffic spikes, the retriever scales out (2→30 replicas). When traffic drops, KServe scales the LLM to zero. When Weaviate slows down, OpenSearch takes over. When the LLM fails, deterministic fallbacks keep users informed.

Every point of failure has a fallback, to provide a resilient and good experience to users.

The complete code is available at github.com/aboullaite/rag-java-k8s.

Stay tuned for the last part, providing a Kubernetes deep dive.

Building Production-Grade RAG Systems: Understanding the Problem Space

Mohammed Aboullaite — Mon, 10 Nov 2025 16:00:00 GMT

I've been quiet on this blog for a while now. Truth is, I lost my appetite for writing these past months. Between traveling to conferences, delivering talks, and shipping some cool features at work, the keyboard just didn't feel the same. There was also this nagging voice in my head: AI content has taken over the world: why bother writing another blog post when an LLM-generated version will probably be better anyway?

But here's the thing, as I kept hacking away on projects, I stumbled across posts that made me pause. Posts that weren't just technically correct, they had personality, insights, born from battle scars, the kind of stuff one can't prompt engineer. And I realised: maybe that's exactly what's missing. Stories from developers doing real work. So this is my modest attempt at bringing this blog back to life.

This post kicks off a three-part series where I dive deep into something I've been building over the past few weeks: production-grade RAG applications. The kind that survives (hopefully) production traffic, handles failures gracefully, and doesn't bankrupt us on LLM costs. Along the way, I'll share the lessons I learned (some the hard way).

When we think about building a Retrieval Augmented Generation (RAG) system, the first instinct is often to grab a vector database, throw in some embeddings, connect an LLM, and call it a day. I've been there. But production RAG systems are an entirely different beast. The gap between a proof-of-concept and a system that can handle real user traffic, maintain acceptable latency, and provide reliable answers is wider than what I initially (naively) thought.

This is the first post in a three-part series where I'll walk you through building a production-inspired RAG pipeline using Java 25, Spring Boot 3.5.7 (Sprintg AI), and Kubernetes. We'll explore not just the happy path, but the real challenges: graceful degradation, semantic caching, hybrid retrieval strategies, observability, and intelligent autoscaling.

The RAG Promise and Reality

Retrieval Augmented Generation fundamentally solves a critical problem with Large Language Models: hallucinations and knowledge staleness. Instead of relying solely on the model's knowledge, RAG systems ground responses in retrieved documents. The architecture is conceptually simple:

User asks a question
System retrieves relevant documents from a knowledge base
Documents are injected into the LLM prompt as context
LLM generates an answer grounded in the retrieved facts
Response is returned with citations

But here's where things get interesting. This simple flow hides a multitude of production concerns that can make or break a system.

What Goes Wrong in Production RAG Systems?

There are a few things that screw your sleep, and get you awake at 3am because of an alert:

The Latency Problem

It obviously depends on your SLO, but traditional RAG systems often stack latencies in series: retrieval time + embedding time + LLM inference time + response streaming. Each step adds milliseconds (or worse, seconds) to your user's waiting time. When your vector database takes 300ms to return results, your LLM takes 2 seconds for first token, and you're processing embeddings for cache lookups, you're looking at 3+ seconds before users see anything.

Users might expect sub-second responses. Anything beyond 2 seconds feels broken. Again, it largely depends on your SLOs.

The Reliability Problem

Vector databases can timeout. LLMs can be overloaded. Networks can fail. In a traditional RAG pipeline, any single component failure means the entire request fails. That's unacceptable in production.

What happens when Weaviate is under load and the semantic search times out? Do we return an error? Or do we have a fallback strategy that still delivers value to our users?

The Cost Problem

Every LLM call costs money. Every embedding calculation burns CPU cycles. When users ask the same question five different ways ("How do I deploy to K8s?", "What's the Kubernetes deployment process?", "K8s deployment steps?"), we're essentially paying for the same answer multiple times.

Even worse, we're making our users wait for responses we've already computed.

The Quality Problem

Vector similarity alone isn't always enough. Sometimes lexical matching (good old BM25) finds documents that semantic search misses—especially for exact terms, acronyms, or technical identifiers. Relying solely on embeddings can leave quality on the table.

The Observability Problem

When the RAG pipeline misbehaves—returning poor answers, experiencing high latency, or burning through our LLM budget—how do we debug it? Traditional application monitoring doesn't capture the nuances of retrieval quality, cache hit rates, or generation costs.

We need visibility into every stage of the pipeline, from retrieval to generation, with metrics that actually matter for RAG workloads.

The Blueprint (?)

I have been searching and looking for different patterns and best practices on how to build a RAG system that addresses each of these concerns. The target architecture for me wasn't just about making things work! it's about making them work reliably, cost-effectively, and observably at scale.

Here's the high-level blueprint:

Let's break down how this architecture solves each problem:

Solving Latency: Semantic Caching

Before doing any expensive operations, we check Redis for semantically similar queries. The cache doesn't just match exact strings—it computes cosine similarity between query embeddings. If a user asks "How does autoscaling work?" and we've previously answered "Explain the autoscaling mechanism", we detect that similarity (threshold 0.90 for example ) and return the cached response immediately.

This short-circuits the entire pipeline. No retrieval. No LLM call. Sub-100ms response times.

The cache stores:

Normalized query text (with PII redaction)
Deterministic query embedding (8-dimensional for demo purposes)
Complete generated answer
Citation list
Retrieved document IDs
Timestamp for observability

Cache entries expire after 10 minutes by default, keeping answers fresh as documentation evolves.

Solving Reliability: Layered Fallbacks

The system implements graceful degradation at every level:

Retriever Fallback: Weaviate has a strict timeout budget (250ms). If it doesn't respond in time, the retriever automatically falls back to OpenSearch for lexical BM25 search. The user still gets an answer... maybe not the semantically perfect one, but a relevant one based on keyword matching.

Generator Fallback: If the LLM endpoint times out or returns an error, the orchestrator doesn't fail. Instead, it synthesizes a deterministic answer by summarizing the top retrieved chunks, clearly marking it as partial and including citations. Users get actionable information even when the model is unavailable.

Streaming Resilience: Server-Sent Events (SSE) provide progressive rendering. Users see tokens as they're generated, and the final event includes citations and a partial flag indicating any degradation.

Every fallback event is instrumented: emitting OpenTelemetry spans with attributes like rag.fallback.reason=weaviate-timeout so we can measure how often each degradation path triggers.

Solving Cost: Intelligent Caching and Deduplication

Semantic caching isn't just about latency—it's about cost. LLM calls are expensive. With a well-tuned cache, we can reduce redundant generation by 40-60% depending on the query distribution.

The cache uses deterministic embeddings for the demo (SHA-256 based hashing producing 8-dimensional vectors), but in production we'd use proper sentence embeddings. The key insight is that cosine similarity > 0.90 means "close enough" to reuse the answer.

Beyond caching, we instrument every request with estimated cost metrics. At $0.002 per 1K tokens (approximate Gemma-2 pricing), a Grafana dashboard visualizes cost-per-request trends, helping you optimize both caching and prompt engineering.

Solving Quality: Hybrid Retrieval

Vector search excels at semantic similarity but can miss exact matches for technical terms, version numbers, or acronyms. OpenSearch provides lexical BM25 ranking that captures these cases.

The retriever service prioritizes vector search (Weaviate) but automatically falls back to lexical search (OpenSearch) when vector queries timeout. This hybrid approach ensures we get the best of both worlds: semantic understanding when available, lexical precision when needed.

In future iterations, we could combine both signals using a re-ranking model, but for many workloads, the fallback strategy alone provides sufficient quality.

Solving Observability: OpenTelemetry + Prometheus + Grafana

Every request flows through instrumented code paths. The observability stack captures:

Traces (OpenTelemetry + Tempo): End-to-end request traces showing retrieval time, document count, cache decisions, LLM first-token latency, and total tokens generated. Custom span attributes include:

rag.cache.hit: boolean
rag.retrieval.count: number of documents retrieved
rag.retrieval.source: "weaviate" or "opensearch"
rag.fallback.reason: why degradation occurred
rag.model.name: which LLM was used
rag.tokens.total: generated token count
rag.ttft_ms: time to first token

Metrics (Prometheus + Grafana): Counters and histograms for:

rag_orchestrator_latency: p50/p95/p99 end-to-end latency
rag_cache_hit_total / rag_cache_miss_total: cache efficiency
rag_retrieval_fallback_total: how often fallback triggered
rag_tokens_generated_total: token consumption trends
rag_cost_usd_total: estimated spend per request

Grafana dashboards visualize these metrics alongside autoscaling replica counts, giving operators a complete view of system behavior under load.

Final thoughts

Whether you're building internal documentation search, customer support automation, or code assistance tools, you'll probably face the same tradeoffs around latency, reliability, cost, and quality.

The architecture I'm suggesting here worked for me, and might be useful for someone on the internet facing the same challenges. It's built for failure.

The complete code is available at github.com/aboullaite/rag-java-k8s. You can run the entire stack locally with make dev-up or deploy to GKE with make gke-cluster.

Stay tuned. The next post will get hands-on with code and architectural patterns.

A look into Deep Java Library!

Mohammed Aboullaite — Mon, 12 Jun 2023 18:10:42 GMT

When you think about building machine learning apps, Java is not the first language that comes to mind, probably not even in the top 3 or 5! But Java has proved time and again that it is capable of modernising itself, and even if it's not the first choice for job for many use cases, it offer a choice for the 10 million developers that are using it.

A few weeks back I started exploring a new Java library called DJL, an ope source, engine-agnostic Java framework for deep learning. In this post we're going to understand some of djl capabilities by building a speech recognition application.

Photo by Hunter Harritt / Unsplash

Deep Java Library

DJL was first released in 2019 by Amazon web services, aiming to offer simple to use easy to get started machine learning framework for java developers. It is offers multiple java APIs for simplifying, training, testing, deploying, analysing, and predicting outputs using deep-learning models.

DJL APIs abstract away the complexity involved in developing Deep learning models, making them easy to learn and easy to apply. With the bundled set of pre-trained models in model-zoo, users can immediately start integrating Deep learning into their Java applications.

Showtime

As I mentioned earlier we're building a small Speech Recognition application. The backend is built using java 17 and Spring boot 3.1. The Frontend is built with React JS 18.2. Full application code is shared in this repo.

via GIPHY

Backend configuration

First of all, we'd need to add the necessary DJL dependencies. I am using DJL version 0.22.1, the latest release as of this writing. We'd need two specific djl dependencies for this application

djl-api: DJL core api.
pytorch-engine: The DJL implementation for PyTorch Engine, enabling to load and use pytorch built models.

    
      ai.djl
      api
      ${djl.version}
    
    
      ai.djl.pytorch
      pytorch-engine
      ${djl.version}

We'll need next to configure DJL, specifying which model we want to use for inference (prediction).
The loadModel method defines a Criteria class to locate the model we want to use. In the Criteria we especified:

Engine: Which engine you want your model to be loaded. Pytorch in our case
Input/Output data type: defines desired input (Audio in our example) and output data type (transcription)
model url: Defines where the model is located,
Translator: Specifies custom data processing functionality to be used to ZooModel

We then load the pre-trained model using (ModelZoo)[https://javadoc.io/doc/ai.djl/api/latest/ai/djl/repository/zoo/ModelZoo.html] directly using a uri for convinience. The model we'll be using (wav2vec)[https://arxiv.org/abs/2006.11477] model, a speech model that accepts a float array corresponding to the raw waveform of the speech signal.

@Configuration
public class ModelConfiguration {

  private static final Logger LOG = LoggerFactory.getLogger(ModelConfiguration.class);
  
  @Bean
  public ZooModel loadModel() throws IOException, ModelException, TranslateException {
    // Load model.
    String url = "https://resources.djl.ai/test-models/pytorch/wav2vec2.zip";
    Criteria criteria =
        Criteria.builder()
            .setTypes(Audio.class, String.class)
            .optModelUrls(url)
            .optTranslatorFactory(new SpeechRecognitionTranslatorFactory())
            .optModelName("wav2vec2.ptl")
            .optEngine("PyTorch")
            .build();

    return criteria.loadModel();
  }

  @Bean
  public Supplier predictorProvider(ZooModel model) {
    return model::newPredictor;
  }

}

That's pretty much all the configuration we need in order to start using our model. The service class sumply make calls the predictor for inference.

  @Resource
  private Supplier predictorProvider;

  public String predict(InputStream stream) throws IOException, ModelException, TranslateException {
    Audio audio = AudioFactory.newInstance().fromInputStream(stream);

    try (var predictor = predictorProvider.get()) {
      return predictor.predict(audio);
      }
    }

The rest is pretty much simple Spring boot configuration.

Frontend Configuration

The frontend make use of the amazing react-audio-analyser library, offering the possibility to record an audio from the browser and convert it to wav format. The rest is pretty much straightforward, only making a REST call to transcription endpoint and showing the result in the browser.

Pixie, the missing developer observability tool!

Mohammed Aboullaite — Sun, 28 May 2023 16:27:47 GMT

Needless to say how important monitoring and observability is, especially in a cloud native, distributed world! No system should got to production without having monitoring tools in place.
On the other hand, the devops movement and cloud native era introduced a plethora of tools to run, deploy and monitor our application, which drastically increased there complexity.

With the increased number of tooling and the complexity of our architectures, developers find themselves in an ever growing challenge to debug their systems, spot bottlenecks, identify hotspots or improve system performance.

Photo by Mitchel Boot / Unsplash

Enter Pixie!

I recently stumbled upon a new CNCF tool called Pixie, An open source observability tool to K8S applications. Pixie was contributed by New Relic to CNCF in 2021.

What triggered my interest for Pixie is, unlike other observability tools (at least that I know of), the focus on developers and DX (developer experience). Pixie offers both a high-level state overview of the k8s cluster, as well as drill down, more tailored, granular and detailed view of the health and performance of your app.

Pixie uses eBPF to collect metrics and events, without the need for manual instrumentation (code changes, redeploys ...). Pixie also stores and compute telemetry data in-memory within the cluster. Collected data is retained for up to 24h, with the possibility to export them in the openTelemetry format to your favorite monitoring tool for long term retention.

The heavy-lifting-done-locally approach that Pixie offers comes with a cost nevertheless. It has the advantage to ensure better security (no data needs to leave your cluster) and scalability. However the performance overhead for node CPU usage is between 2-5% as Pixie claims, and requires at least 1GiB memory requirement per node.

Pixie in action!

For the demo, I created a standard K8S cluster in GKE, since Autopilot mode is still not supported in pixie.

Installing Pixie is pretty straighforward, just run:

$ bash -c "$(curl -fsSL https://withpixie.ai/install.sh)"

A prompt will appear asking you to signin or register for a Pixie account.
Once authenticated, we can deploy Pixie on our GKE cluster using:

$ px deploy

This would install, among other things, Viser Pixie's data plane, responsible for collecting and processing data within the cluster that is being monitored.

For convenience, I reused the manifests from my service mesh demo, based on sock-shop microservices app from Weaveworks.

Pixie, support interacting with the platform using 3 ways: CLI, web-based Live UI or API. Unsurprisingly, using the web UI is the easiest and most intuitive way to interact with Pixie and check you data, especially if you are new to it.

Once connected to Pixie Console UI, you'd need to select which cluster to interact with to and which script to execute. PxL Scripts uses Pixie Language (PxL) DSL to query cluster data and transform/visualize metrics.

Pixie dashboard

Pixie CLI is as fun as to play with as the web UI, it is rich and interactive. You can use px help to list all Pixie CLI options, and px scripts list to list all built-in scripts. Below is an image of running px live px/http_data script which shows a sample of the HTTP/2 traffic flowing through your cluster. Notice the link above that sends you to the web UI which is very convenient to go back and forth.

A great example of Pixie usage is application profiling, to detect hotspot and analyse CPU spikes. Pixie's px/pod gives an overview of the high level application metrics (latency, error, throughput ...) and resource utilization for the selected pod. What excited me is the Pod Performance Flamegraph at the end of the page whic is greatly useful to identify performance issues. You can see an example below of the CPU spike in the beginning of the java orders app while the JVM is warming up and JIT is executed, while slowly cooling down as the compilation finishes.

Those are just a few of the many features and options that Pixies offers (Which I am still uncovering myself). head over to the documentation page to read more about it!

What the CRaC ?!

Mohammed Aboullaite — Sat, 20 May 2023 13:34:26 GMT

If you've been following the news lately in the Java ecosystem (aside from Java 28th anniversary), you should've heard of CRaC. Two big announcements were revealed this week:

Azul announced earlier this week the general availability of and commercial support for Azul Zulu Builds of OpenJDK for Java 17 including CRaC functionality.
The next release of Spring framework, 6.1, will add support for CRaC.

If you are wondering what CRaC is all about,I got you covered, read on :)

Photo by Josh Calabrese / Unsplash

What is CRaC?!

Explain it like I am 6 years old (Sort of!)

In a world where streaming services are omnipresent, we can stop watching a video whenever we want, and we expect to resume from (almost) the same position where we left off, even on another device. Imagine if we can apply the same analogy to our running applications: Take a snapshot (pause) of the running state, and restore (resume) it in another server.

In more technical terms

CRaC stands for Coordinated Restore at Checkpoint. It is an OpenJDK project, developed by Azule Systems, with the aim to speed up the JVM startup time by capturing/freezing its running state, where all the heavy lifting is performed (loading classes, JIT compilation, code optimizations...), and serializing its state on disk (Checkpoint), to resume it later from that state (Restore) and run it exactly as it was during the time of the freeze.

CRaC uses CRIU technology under the hood to perform its magic. CRIU is a C library facilitating the implementation of checkpoint/restore functionalities for Linux, and the maintainers claim it is the most feature-rich and up-to-date with the kernel for implementing CR in Linux.

CRIU is also the technology behind docker checkpoint experimental command, allowing to make serializable snapshots of a running container, and recreated later (even in another host). Podman has a similar feature with podman container checkpoint. Similarly, CRIU has support for Kubernetes and LXC/LXD as well.

In the Java space, CRIU is also used in OpenJ9 to improve JVM startup time, and empower InstantOn Project from Open Liberty

Showtime

CRaC is only available on Linux, so in order to run this demo you'd need a Linux machine. I tried to use Docker on Mac but had little success and stumbled upon many issues.

I am also using this simple Spring boot code showcasing the upcoming support for CRaC in Spring framework 6.2. Kudos to the Spring team and @sdeleuze for the amazing work.

First, you'd need to install the recently available Zulu JDK with CRaC support. You can either install it manually or use sdkman using:

$ sdk install java 17.42.21-zulu

Next, we'd need to build the project by running the below command. This assume that you already cloned the project and it is your current directory.

$ mvn clean verify

Once it finishes building, we can run our app using

$ java -XX:CRaCCheckpointTo=./crac-files -jar target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar

Notice anything new? The java argument -XX:CRaCCheckpointTo=path indicates to the jvm to enable CRaC and defines the path to store the image.
If everything goes as expected, the app should be running after a few seconds. Make sure to hit it with a few requests to warm up the application:

$ curl localhost:8080
Greetings from Spring Boot!

Now leave your app running (or run it in the background), and in another terminal, we're going to use the jcmd command to trigger checkpoint:

$ jcmd target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar JDK.checkpoint
201931:
CR: Checkpoint ...

201931 represents PID of our running spring boot app, which should now be stopped as indicated in the logs:

2023-05-20T12:02:06.610Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping Spring-managed lifecycle beans before JVM checkpoint
2023-05-20T12:02:06.615Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 2147482623
2023-05-20T12:02:06.617Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean 'webServerGracefulShutdown' completed its stop procedure
2023-05-20T12:02:06.617Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 2147481599
2023-05-20T12:02:06.618Z  INFO 201931 --- [Attach Listener] org.eclipse.jetty.server.Server          : Stopped Server@53f3bdbd{STOPPING}[11.0.15,sto=0]
2023-05-20T12:02:06.624Z  INFO 201931 --- [Attach Listener] o.e.jetty.server.AbstractConnector       : Stopped ServerConnector@1a4927d6{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
2023-05-20T12:02:06.629Z  INFO 201931 --- [Attach Listener] o.e.j.s.h.ContextHandler.application     : Destroying Spring FrameworkServlet 'dispatcherServlet'
2023-05-20T12:02:06.632Z  INFO 201931 --- [Attach Listener] o.e.jetty.server.handler.ContextHandler  : Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@35399441{application,/,[file:///tmp/jetty-docbase.8080.3095195653033098747/],STOPPED}
2023-05-20T12:02:06.638Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean 'webServerStartStop' completed its stop procedure
2023-05-20T12:02:06.639Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase -2147483647
2023-05-20T12:02:06.640Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean 'springBootLoggingLifecycle' completed its stop procedure
Killed

Inspecting your files now, you should see crar-files folder created with different .img files. Those are all the images that were generated from the checkpoint operation. Those images can be inspected using crit tool. If you are using ubuntu, you can install crit command-line as part of the criu package using apt-get install criu.

crit is pretty handy to check the content of images folder. We can, for example check what process we checkpointed:

$ crit x crarc-files ps
    PID   PGID    SID   COMM
 201931 201931 201381   java

We can inspect checkpoint files descriptors:

$ crit x crarc-files fds
          201931
	      0: TTY.36
	      1: TTY.36
	      2: TTY.36
	      3: /root/.sdkman/candidates/java/17.0.7.crac-zulu/lib/modules
	      4: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      5: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      6: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      7: /home/maboullaite/spring-boot-crac-demo/crac-files/perfdata
	      8: /dev/random
	      9: /dev/urandom
	    cwd: /home/maboullaite/spring-boot-crac-demo
	   root: /

We can even extract image info of one of the images using crit show

$ crit show cracr-files/core-201931.img
{
    "magic": "CORE",
    "entries": [
        {
            "mtype": "X86_64",
            "thread_info": {
                "clear_tid_addr": "0x7f19b70d5550",
                "gpregs": {
                ...}
         "thread_core": {
                "futex_rla": 139748422014304,
                "futex_rla_len": 24,
                "sched_nice": 0,
                "sched_policy": 0,
                "sas": {
                    "ss_sp": 0,
                    "ss_size": 0,
                    "ss_flags": 2
                },
                "signals_p": {},
                "creds": {
                    "uid": 0,
                    "gid": 0,
                    "euid": 0,
                    "egid": 0,
                    "suid": 0,
                    "sgid": 0,
                    "fsuid": 0,
                    "fsgid": 0,
                    "cap_inh": [
                        0,
                        0
                    ],
                    "cap_prm": [
                        4294967295,
                        511
                    ],
                    "cap_eff": [
                        4294967295,
                        511
                    ],
                    "cap_bnd": [
                        4294967295,
                        511
                    ],
                    "secbits": 0,
                    "groups": [
                        0
                    ]
                },
                "comm": "java"
            }
        }
    ]
}

crac-files directory also contains log files, which are pretty handy in issues.
To restore our image and run the app from it's saved state, we simply run:

$ java -XX:CRaCRestoreFrom=./crac-files

Which results in a lightning-speed start compared to the previous start.

But what about AoT and Graal Native images?

Well, it is quite different. While native images achieve very fast startup time and a very small memory footprint, it isn't the cure to all problems. Native image generation requires that each class you'd need at runtime be made available in build time for the compilation to succeed, which might represent some challenges for java developers. Debugging is also another aspect where native images fall short.

CRaC (and similar tools) allows us to still benefit from JVM capabilities that we're familiar with while benefiting from the fast startup needed for many cloud-native workloads. On the other hand, as Thomas brought to my attention, the size of the snapshot is orders of magnitude bigger compared to the size of the native image.

Maybe you can also mention the size of the snapshot as another disadvantage compared to native image. Furthermore, in scenarios like serverless, there is no possibility to debug with regular Java mechanisms in production on many systems anyway.
— Thomas Wuerthinger (thomaswue.dev) 💙 (@thomaswue) May 20, 2023

Final Thoughts

The general availability of CRaC would help boost the adoption of CR technology in the java space, making the java language even more modern and more suitable for the cloud-native world. Exciting times!
Finally, it is worth mentioning that CR technology is not new, Google uses it to migrate batch jobs in Borg.

Ressources

My home office setup!

Mohammed Aboullaite — Tue, 31 Aug 2021 13:50:16 GMT

Hello dear reader 👋
Let me set come context some context first before diving into how I set up my home office. I am a software engineer, a meticulous one you can say! I am sharing my own setup because many friends asked me to do so (and I truly thank them for the kind words and encouragement). The items you gonna see below are my own preference based on my own research. So don't take everything you read here as a recommendation, do your own research and comparisons ✌️

With that being said, let's get back to business! I've relocated a couple of months ago to Stockholm, as I joined Spotify. Needless to mention that remote/home work is nowadays the new normal. Worst, it doesn't look it's going to change anytime soon. At first, building my home workspace wasn't really a priority, but as I started to have more back pain, productivity drop and a bit of anxiety, It became necessity.

After 2 months of gathering gadgets and items from here and there, here's the final look:

List of Equipment and details

Dell UltraSharp 34 Curved USB-C Hub Monitor (U3421WE): I like the "hub" aspect of this monitor. It's actually very useful especially for a MacBook pro user like me. Everything plugs into the monitor itself—all your USB gear (mouse, keyboard, backup drive, microphone, whatever you've got)—even ethernet! And then ONE cable connects to your laptop, and even charges it.
Vaydeer 2 Tiers Aluminum Monitor Stand with Wireless Charging: What I like about this monitor is the modern look, solidity, compactness and also has a middle tray to put notes and other stuffs and best part are four USB’s hub + wireless phone charging!
Rain Design mStand Laptop Stand: Very Solid. I used it with both 15' and 13' sizes of macbook pro. It fits really well, looks nice matching aluminum material.
Neewer Ring Light Kit:18"/48cm Outer 55W 5500K Dimmable LED Ring Light: This one is very popular among beauty blogger 😅 However it's affordable compared to other alternatives, with (very )powerful light. Easy to setup and use.
Yeelight LED Screen Light Bar Pro: This one surprise me, I wasn't expecting to be that good. I comes both with a nice round remote controller, or you can use the Yeelight app for more light themes.
I put more lighting under my desk using Govee WiFi LED Strip. It has even a music mode and it's CRAZY!
Lamicall Tablet Stand for iPad Pro: Robust, feels nice and does its job very well!
Sony WH-1000XM3 Noise Cancelling Wireless Headphones: Although I usually use airpods pro for almost all my meetings, I have to admit I like those headphones a lot. The noise canceling is amazing. The sound quality is excellent, and if you use the Headphones app you can change the base and other EQ settings.
New Bee Headphones Stand: It looks nice, fits well and does the job!
Logitech C925e webcam: Nothing much to say here, It's a webcam. Not very happy with it though. I may change it later tbh.
Large Amazon basics mouse pad. Not fancy, but quite practical,
Apple MacBook pro 13', airpods pro, ipad Pro 12.5', Magic mouse and magic keyboard! You can call me an apple fan-boy 🍎
Herman Miller chair (Aeron) and adjustable desk (Nevi desk).

Building Native Covid19 Tracker CLI using Java, PicoCLI & GraalVM

Mohammed Aboullaite — Mon, 11 May 2020 22:40:09 GMT

When it comes to building CLI apps, Java is not your first ( not even among the top 3) choice that comes to mind. However, one of the amazing things about java is its ecosystem and vibrant community, which means that you can find some tools/libraries for (nearly) everything.

Golang particularly excels in this area for several reasons, but one aspect where Go sparkles is the ability to compile a program to a single, small native executable file, that makes the program runs faster and much easier to distribute. However Java Apps have been traditionally hard to distribute since they require the JVM to be already installed on the target machine.

In this post, I describe my experience building a small CLI app to track covid19, using picocli and turning it into a lightweight, standalone binary that is easy to use and distribute, using Graal VM.

The complete source code for this application can be found in this Github repo.

PicoCLI

Picocli is a modern library for building command line applications on the JVM.
Picocli aims to be the easiest way to create rich command line applications that can run on and off the JVM. It offers colored output, TAB autocompletion, nested subcommands, and comes with couple of grear features compared to other JVM CLI libraries such as negatable options, repeating composite argument groups, repeating subcommands and custom parameter processing.

Picocli based applications can also easily be integrate with Dependency Injection containers. Picocli ships with a picocli-spring-boot-starter module that includes a PicocliSpringFactory and Spring Boot auto-configuration to use Spring dependency injection in your picocli command line application.

The Micronaut microservices framework has built-in support for picocli.

Covid-19 Tracker app

Covid-19 Data

The CLI app gets data from Novel COVID API. A free and easy to use API, that gathers data from multiple sources (Johns Hopkins University, the New York Times, Worldometers, and Apple reports)

Dependencies

There are a couple of libraries that I used to build this app. First and foremost, picocli as the heart of the CLI app. I opted for Jersey Client to handle HTTP communication with Rest server and collect data, as well as Jackson the well know java json library.

The hard part was finding some Ascii based tables and graphs, and honestly my choices were very limited. I ended up using ascii-table to create and customize ASCII tables and ascii-data t to generate some nice looking text-based line-graphs.

This what my pom-file dependencies section contians:

...
    
        info.picocli
        picocli
        4.2.0
    
    
        org.glassfish.jersey.core
        jersey-client
        2.30.1
    
    
        org.glassfish.jersey.media
        jersey-media-json-jackson
        2.30.1
    
    
        com.github.freva
        ascii-table
        1.1.0
    
    
        com.mitchtalmadge
        ascii-data
        1.4.0
    
    
        org.glassfish.jersey.inject
        jersey-hk2
        2.30.1

Show me the code

Now that I've everything the app need, let's have a look at the code. Below is the main class:

@Command(description = "Track covid-19 from your command line",
        name = "cov19", mixinStandardHelpOptions = true, version = "cov19 1.0")
public class Covid19Cli implements Callable {
    @Option(names = {"-c", "--country"}, description = "Country to display data for", defaultValue = "all")
    String country;
    @Option(names = {"-g", "--graph"}, description = "show data as graph history of last 30 days")
    boolean graph;
    @Option(names = {"-a", "--all"}, description = "show data for all affected countries")
    boolean all;

    CovidAPI covidAPI = new CovidAPI();

    public static void main(String[] args) {
        int exitCode = new CommandLine(new Covid19Cli()).execute(args);
        System.exit(exitCode);
    }

    public Integer call() throws Exception {
        if (this.all && !this.country.equals("all")){
            System.out.println(Ansi.AUTO.string("@|bold,red, ****** Cannot combine global (`-a`) and country (`-c`) options ****** |@\n"));
            return 1;
        }

        this.colorise(this.country);
        if(this.graph){
            PrintUtils.printGrapgh(covidAPI.history(this.country));
            return 0;
        }
        if (this.all){
            PrintUtils.printCountryStatTable(covidAPI.allCountryStats());
            return 0;
        }
        if(this.country.equals("all")) {
            PrintUtils.printGlobalTable(Arrays.asList(covidAPI.globalStats()));
            return 0;
        }
        PrintUtils.printCountryStatTable(Arrays.asList(covidAPI.countryStats(this.country)));
        return 0;
    }

A couple of interesting things here:

The @Command annotation from picocli enables us to define the general information about the command.
mixinStandardHelpOptionsconfig option adds magically --help and --version flag to CLI.
The class implements Callable, as picocli needs a predictable way of executing command, parsing params and options and returning exit code.
The execute method shows the usage help or version information if requested by the user
Invalid user input will result in a helpful error message. If the user input was valid, the business logic, present in call method, is invoked.
Finally, the execute method returns an exit status code that can be used to call System.exit if desired. By default, the execute method returns CommandLine.ExitCode.OK (0) on success, CommandLine.ExitCode.SOFTWARE (1) when an exception occurred in the Runnable, Callable or command method, and CommandLine.ExitCode.USAGE (2) for invalid input.
The fields of the class are annotated with @option, to declare what options the application expects. Picocli initializes these fields based on the command line arguments which commonly start with - or --.
Note that some options have one name and some have more
Option can have default values using the defaultValue annotation attribute.

Building and testing the app

As any java app using maven, we run mvn clean package to compile our app and generate the jar file. I used the maven shade plugin to package the artifact in an uber-jar (including all its dependencies).
Now, we can verify that our CLI is working using:

$ java -jar covid-java-cli-1.0-SNAPSHOT.jar --help                                                                                                   

Usage: cov19 [-aghV] [-c=]
Track covid-19 from your command line
  -a, --all                 show data for all affected countries
  -c, --country=   Country to display data for
  -g, --graph               show data as graph history of last 30 days
  -h, --help                Show this help message and exit.
  -V, --version             Print version information and exit.

So far, the application is working but doesn't feel too much like an actual CLI. Ideally, we should aim for a more native experience and simply run ./mycly instead calling java -jar each time!
This is what will try to accomplish in the next section with GraalVM.

GraalVM, Building a native image

This was the hardest part while working on this app, for the simple reason that GraalVM native image compiler supports for reflection is partial and requires additional configuration.
This impact my application in 2 ways:

Picocli uses reflection to discover classes and methods annotated with @Command, and fields, methods or method parameters annotated with @Option.
Jersey client uses reflection as well

Picocli includes a picocli-codegen module, that contains an annotation processor to generate GraalVM configuration files at compile time rather than at runtime. So the first one was easy to fix by adding the below config to my pom.xml file:

             
                org.apache.maven.plugins
                maven-compiler-plugin
                3.8.1
                
                    
                        
                            info.picocli
                            picocli-codegen
                            4.2.0

It generate configuration files for reflection, resources and dynamic proxies.

target
├── classes
│   ├── META-INF
│   │   └── native-image
│   │       └── picocli-generated
│   │           ├── proxy-config.json
│   │           ├── reflect-config.json
│   │           └── resource-config.json

As for jersey, I had to do some testing and debugging to generate the reflection.json to make our application Graal-enabled! Below a snippet from it:

...
  {
    "name" : "org.glassfish.jersey.internal.config.ExternalPropertiesConfigurationFeature",
    "allDeclaredConstructors": true,
    "allPublicConstructors": true,
    "allDeclaredFields": true,
    "allPublicFields": true,
    "allDeclaredMethods": true,
    "allPublicMethods": true
  },
  {
    "name" : "org.glassfish.jersey.message.internal.MessageBodyFactory",
    "allDeclaredConstructors": true,
    "allPublicConstructors": true,
    "allDeclaredFields": true,
    "allPublicFields": true,
    "allDeclaredMethods": true,
    "allPublicMethods": true
  },
  {
    "name" : "com.fasterxml.jackson.module.jaxb.JaxbAnnotationIntrospector",
    "allDeclaredConstructors": true,
    "allPublicConstructors": true,
    "allDeclaredFields": true,
    "allPublicFields": true,
    "allDeclaredMethods": true,
    "allPublicMethods": true
  },

Now that our reflection config is in place, We are pretty done with our application. The next natural step is to compile our application ahead of time and generate the native binary.

First off, we need to install GraalVM native-image tool and call it manually. However, recent GraalVM releases added the possibility to build native images right out of maven without running the native-image tool as a separate step after building the uber-jar. In order for it to run, the plugin expectsJAVA_HOME to be set as Graal CV, it will not work otherwise.

          
                org.graalvm.nativeimage
                native-image-maven-plugin
                20.0.0
                
                    me.aboullaite.Covid19Cli
                    cov19-cli
                    
                        --no-fallback
                        --report-unsupported-elements-at-runtime
                        --allow-incomplete-classpath
                        -H:ReflectionConfigurationFiles=classes/reflection.json
                        -H:+ReportExceptionStackTraces
                        -H:EnableURLProtocols=https
                    
                    false
                
                
                    
                        
                            native-image
                        
                        verify

Everything is ready. Now we can generate a native image by running mvn clean verify, which will trigger native image compilation. The process will take about a minute to complete.
At the end, we have a native executable under target/cov19-cli.

$ ./target/cov19-cli --help 
                                                                                                                               
Usage: cov19 [-aghV] [-c=]
Track covid-19 from your command line
  -a, --all                 show data for all affected countries
  -c, --country=   Country to display data for
  -g, --graph               show data as graph history of last 30 days
  -h, --help                Show this help message and exit.
  -V, --version             Print version information and exit.

Comparing startup time

I couldn't resist the thought of comparing the startup times of running the application on a normal JIT-based JVM to that of the native image. Below the results I got for on my machine:

$ gtime -p java -jar covid-java-cli-1.0-SNAPSHOT.jar --help                                                                                        
real 0.32
user 0.67
sys 0.09
                                                                          
$ gtime -p ./cov19-cli --help                                                                                                                    
real 0.01
user 0.00
sys 0.00

Finals words

Building Java-based native CLI tools is becoming possible nowadays with Picocli and GraalVM. Of course, there are several limitations in native-image compiler, mainly the refliction support. Neverthless, the combination of both tools to create CLI tools, without JVM overhead, looks promising.

Ressources:

Java 14 features: Text Blocks & Foreign-Memory Access API

Mohammed Aboullaite — Sun, 22 Mar 2020 11:46:32 GMT

This is the fourth and last post in the blog post series I wrote covering the features that has been added to java 14, released just a couple of days back.

Java 14 / JDK 14: General Availability: https://t.co/THxJ9llBpj #jdk14 #java14 #openjdk #java
— Mark Reinhold (@mreinhold) March 17, 2020

In this post, I will cover 2 more features: Text Blocks (Second Preview) and Foreign-Memory Access API (Incubator).

Java 14 rew features articles:

JEP 368: Text Blocks (Second Preview)

The first preview of Text Blocks was introduced in Java 13 as a new, more concrete and concise vision for how Raw String Literals should work in Java. You can read more about the withdraw of JEP 326 here.

A text block is a multi-line string literal that avoids the need for most escape sequences, automatically formats the string in a predictable way, make inline multi-line Strings more readable and gives the developer control over the format when desired.

Usage

Text Blocks starts with three double quote characters ("""), continues with zero or more space, tab, and form feed characters, and concludes with a line terminator. The closing delimiter is a sequence of three double quote characters (""").

"""
Text
Block
Example
"""

Optionally, the closing delimiter can appear in line on the closing line:

"""
Text
Block
Example"""

Note that the result type of a text block is still a String. Text blocks just provide us with another way to write String literals in our source code.

Indentation

Text Blocks made it a bit easier to indent our code properly. To calculate how many white space characters should be removed from every line, the compiler determines the line with the least white space characters and then shifts the complete text block to the left. The compiler takes whitespace indentation into consideration, differentiating incidental whitespaces from essential whitespaces.

For example, including the trailing blank line with the closing delimiter, the common white space prefix is 11, so eleven white spaces are removed from the start of each line.

// spaces (dots) will be removed
        String text= """
...........     some text
...........     having fun
...........     with Text Blocks
...........""";

Now, suppose the closing delimiter is moved slightly to the right of the content, in this case 16 white spaces are removed from the start of each line:

// spaces (dots) will be removed
        String text= """
................some text
................having fun
................with Text Blocks
................   """;

The spaces visualized with dots are considered to be incidental and hence will be removed.

Escaping

The use of the escape sequences \" and \n is permitted in a text block, but not necessary or recommended. However, representing the sequence """ in a text block requires the escaping of at least one " character, to avoid mimicking the closing delimiter.

String code =
    """
    String text = \"""
        This is a Text Block inside a Text Block
    \""";
    """;

The string represented by a text block is not the literal sequence of characters in the content. Instead, the string represented by a text block is the result of applying the following transformations to the content, in order:

Line terminators are normalized to the ASCII LF, character, as follows:
- An ASCII CR (Carriage Return) character followed by an ASCII LF (Line Feed) character is translated to an ASCII LF character.
- An ASCII CR character is translated to an ASCII LF character.
Incidental white space is removed, as if by execution of String::stripIndent on the characters resulting from step 1.
Escape sequences are interpreted, as if by execution of String::translateEscapes on the characters resulting from step 2.

New escape sequences

With Java 14, escaping in text blocks got 2 more features:

The \ escape sequence explicitly suppresses the insertion of a newline character. Very useful when you have long lines of text in the source code that you want to format in a readable way.
The new \s escape sequence simply translates to a single space (\u0020). Which basically tells the compiler to preserve any spaces in front of this escaped space, instead of ignoring them (default behaviour).

Methods

To support the new features of Text Blocks, a couple of methods have been introduced (some of them already mentioned above):

String::stripIndent(): used to strip away incidental white space from the text block content
String::translateEscapes(): used to translate escape sequences
String::formatted(Object... args): simplify value substitution in the text block

JEP 370: Foreign-Memory Access API (Incubator)

This incubating feature enables efficient, safe and deterministic access to native memory segments out of JVM heap space (off-heap). The JEP also states that this foreign memory API is intended as an alternative to currently used approaches (java.nio.ByteBuffer introduced in 2002 with Java 1.4 and sun.misc.Unsafe way before).

The foreign-memory access API, part of Project Panama, introduces three main abstractions:

MemorySegment: is used to model a contiguous memory region with given spatial and temporal bounds.
MemoryAddress: can be thought of as an offset within a segment.
MemoryLayout: a way to define the layout of a memory segment in a language neutral fashion.

To start playing with this API, You need first to add jdk.incubator.foreign module manually and enable preview features.
The simple example below allocates 10 bytes memory out of JVM heap space and prints its base address.

import jdk.incubator.foreign.MemoryAddress;
import jdk.incubator.foreign.MemorySegment;

public class FmaExample {
    public static void main(String[] args) {

        MemoryAddress address = MemorySegment.allocateNative(4).baseAddress();
        System.out.print(address);
    }
}

// Prints
WARNING: Using incubator modules: jdk.incubator.foreign
MemoryAddress{ region: MemorySegment{ id=0x1406e03c limit: 4 } offset=0x0 }

In the above, we are using the overloaded allocateNative() which takes a long value of the size in bytes and create a new native memory segment that models a newly allocated block of off-heap memory. There are two other versions of this method, one which accepts a MemoryLayout and one which accepts a size in bytes and the byte alignment.

In order to use the memory segment from the example above, memory-access var handle should be used. They are obtained using factory methods in the MemoryHandles class. The example below stores 10 bytes as int at the base of the off-heap memory segment:

import jdk.incubator.foreign.MemoryAddress;
import jdk.incubator.foreign.MemoryHandles;
import jdk.incubator.foreign.MemorySegment;
import java.lang.invoke.VarHandle;
import java.nio.ByteOrder;

public class FmaExample {
    public static void main(String[] args) {

        MemoryAddress address = MemorySegment.allocateNative(10).baseAddress();
        VarHandle handle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
        handle.set(address, 10);

        System.out.println("Memory Value: " + handle.get(address));
    }
}

// Prints
WARNING: Using incubator modules: jdk.incubator.foreign
Memory Value: 10

Ressources and further reading:

Java 14 features: Switch Expressions, JFR Event Streaming and more

Mohammed Aboullaite — Wed, 11 Mar 2020 15:05:13 GMT

This is the third post in a series of blog posts highlighting some features and improvements that will be introduced in Java 14, expected to go GA in a couple of days.
In this post, We will have a look into Switch expression, JFR streaming, as well as some various minor improvement.

Java 14 rew features articles:

Switch Expression (Standard): JEP 361

Switch expression was introduced first in JDK 12 as a preview feature, then refined in JDK 13 and it will be made final and permanent in JDK 14.

A look into Switch expression

Often, a switch statement produces a value in each of its case blocks. Switch expressions enable you to use more concise expression syntax: fewer repetitive case and break keywords and less error-prone.
Consider the following example:

        WeekDay day = WeekDay.FRIDAY;
        String dayType;
        switch (day) {
            case MONDAY:
            case TUESDAY:
            case WEDNESDAY:
            case THURSDAY:
            case FRIDAY:
                dayType = "Weekday";
                break;
            case SATURDAY:
            case SUNDAY:
                dayType = "Weekend";
                break;

            default:
                throw new IllegalArgumentException("Invalid Day");
        }

That's how we check if a specific day is a Weekday or not, using our good old switch statement. It would be better if we could return this information without the need of storing it in the variable dayType; we can do this with a switch expression which is both clearer and safer:

 String dayType = switch (day){
            case MONDAY, THURSDAY, WEDNESDAY, TUESDAY, FRIDAY -> "Weekday";
            case SATURDAY, SUNDAY -> "Weekend";
            default -> throw new IllegalArgumentException("Invalid Day");
        };

As you can notice, instead of having to break out different cases, we used the new switch lambda-style syntax, which allows the expression on the right to execute if the label matches. This is a more straightforward control flow, free of fall-through (No need for break statements).
Furthermore, the above example used "arrow case" labels with the arrow between label and execution. We could instead use "colon case" labels:

        String dayType = switch (day){
            case MONDAY, THURSDAY, WEDNESDAY, TUESDAY, FRIDAY:
                yield "Weekday";
            case SATURDAY, SUNDAY:
                yield "Weekend";
            default:
                throw new IllegalArgumentException("Invalid Day");
        };

But what is yield? yield statement has been introduced in JDK 13! It takes one argument, which is the value that the case label produces in a switch expression. This is an easy thumb of rule to differentiate between a switch expression and a switch statement.

JFR Event Streaming: JEP 349

Java Flight Recorder has a long history. It was first part of the BEA JRockit JVM. Then After Oracle acquired BEA it became a commercial feature of Oracle JDK. To be finally open sourced with the release of OpenJDK 11 (JEP 328) and also in the process to be backported to 8.
The arrival of JDK 14 introduces a new feature to JFR: the ability for JFR to produce a continuous stream of events.

What is JFR?

JFR is basically a monitoring tool that collects information about the events in a Java Virtual Machine (JVM) during the execution of a Java application. It is designed to affect the performance of a running application as little as possible.

JFR in JDK 14

With JEP 349, a new usage mode for JFR becomes available, which is JFR Event Streaming. This API provides a way for programs to receive callbacks when JFR events occur and respond to them immediately for both in-process and out-of-process applications. Same set of events can be recorded as in the non-streaming way. Therefore, event streaming would be performed at the same time as non-streaming.
Check out the following example:

        Configuration config = Configuration.getConfiguration("default");
        try (var es = new RecordingStream(config)) {
            es.onEvent("jdk.GarbageCollection", System.out::println);
            es.onEvent("jdk.CPULoad", System.out::println);
            es.onEvent("jdk.JVMInformation", System.out::println);
            es.setMaxAge(Duration.ofSeconds(10));
            es.start();
        }

This snippet starts JFR on the local JVM using the default recorder settings and print the Garbage Collection, CPU Load and JVM Information events to standard output:

jdk.JVMInformation {
  startTime = 12:13:28.724
  jvmName = "OpenJDK 64-Bit Server VM"
  jvmVersion = "OpenJDK 64-Bit Server VM (14+36-1461) for bsd-amd64 JRE (14+36-1461), built on Feb  6 2020 19:03:05 by "mach5one" with clang 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)"
  jvmArguments = N/A
  jvmFlags = N/A
  javaArguments = "me.aboullaite.JFRStreamTest"
  jvmStartTime = 12:13:28.415
  pid = 72713
}
jdk.CPULoad {
  startTime = 12:13:30.682
  jvmUser = 0.98%
  jvmSystem = 0.08%
  machineTotal = 2.86%
}

NUMA-Aware Memory Allocation for G1: JEP 345

NUMA (Non-uniform memory access)is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can share memory locally, improving performance and the ability of the system to be expanded.

This JEP aims to improve G1 performance on large machines by implementing NUMA-aware memory allocation. G1's heap is organized as a collection of fixed-size regions. A region is typically a set of physical pages, although when using large pages (via -XX:+UseLargePages) several regions may make up a single physical page. If the +XX:+UseNUMA option is specified then, when the JVM is initialized, the regions will be evenly spread across the total number of available NUMA nodes.

Non-Volatile Mapped Byte Buffers: JEP 352

This JEP improves FileChannel API to support creating mapped byte buffers on non-volatile memory (persistent memory). The only API change required is a new enumeration employed by FileChannel clients to request mapping of a file located on an NVM-backed file system rather than a conventional, file storage system. The new enumeration values are used when calling the FileChannel::map method to create, respectively, a read-only or read-write MappedByteBuffer mapped over an NVM device file. This feature is only supported in Linux/x64 and Linux/AArch64 platforms.

Deprecate the Solaris and SPARC Ports: JEP 362

Solaris/SPARC, Solaris/x64, and Linux/SPARC ports are deprecated and will be removed in a future release. The main motivation is to enable OpenJDK Community contributors to accelerate the development of new features, moving the platform forward.

Remove the Concurrent Mark Sweep (CMS) Garbage Collector: JEP 363

CMS Garbage Collected was deprecated in Java 9, and it is removed with Java 14.

ZGC on macOS (JEP 364) and Windows (JEP 365)

ZGC was introduced in Java 11, but it was only supported in Linux. Now it is also available in macOS and Windows. For windows ZGC is not supported on Windows 10 and Windows Server older than version 1803, since older versions lack the required API for placeholder memory reservations.

Deprecate the ParallelScavenge + SerialOld GC Combination: JEP 366

The Parallel Scavenge young and Serial old garbage collector combination is deprecated due to little use and significant amount of maintenance effort.

Remove the Pack200 Tools and API: JEP 367

Pack200 tools and api was deprecated in Java 11, and it is removed with Java 14.

Resources and further reading:

Skaffold, OKE & OCIR!

Mohammed Aboullaite — Fri, 06 Mar 2020 03:31:16 GMT

If you're working on cloud-native apps and containers, you probably already noticed that, amid all features that containers offer, they added somehow a new later of complexity to the development workflow! We spend a great amount of time building container images, pushing them across registries, updating Kubernetes manifests, redeploying the application and checking if everything works as intended... even for the smallest changes. The feedback loop gets bigger and bigger!

One of the open source tools that helps to solve this issue, especially while working with kubernetes, is Skaffold! Skaffold is a command line tool by Google, that facilitates continuous development for Kubernetes applications. The goal is to help developers to focus on writing and maintaining code rather than managing the repetitive steps required during the edit-debug-deploy inner loop.

In this posts, I'm describing the steps to continuously deploy your cloud native apps, focus on coding and boost productivity, using Skaffold and Oracle Cloud, mainly OKE and OCIR.

Prerequisites!

Make sure that you've Docker installed in your machine, If not, you can either install Docker Desktop for Mac and Windows, or Docker engine for Linux users. This link describes the necessary step to guide you through.

Additionally, since we'll be interacting with K8S, we necessarily need to use the he Kubernetes command-line tool: kubectl. The complete guide on how to install and configure kubectl can be found here.

Installing Skaffold

Installing Skaffold is pretty straightforward. Below the details to configure Skaffold on Mac, Windows and Linux:

Mac

If you're familiar with Homebrew, just run the following command to setup Skaffold on your machine: brew install skaffold. Otherwise , run the below commands in your terminal that basically download and place the binary in the /usr/local/bin folder:

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-darwin-amd64
chmod +x skaffold
sudo mv skaffold /usr/local/bin

Linux

Linux users can run the following commands to install and configure Skaffold:

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64
chmod +x skaffold
sudo mv skaffold /usr/local/bin

Windows

If you're using Windows, you need to download the .exe file from here and place it under your PATH folder.

More details can be found on Skaffold's Getting Started Guide page.

Oracle Cloud configuration

Since you reading this, I suppose you already have an Oracle cloud account! If not, head over to the Always Free Services page and create one. Yes It's free... Forever (at least for now)!
Once done, We need to setup a kubernetes cluster and container registry. This can be easily done by accessing the Developer services from the side menu, under Solutions and Platform you can create and configure your OKE cluster and private OCIR! A detailled step by step guide, in-depth description of the processes can be found here and here.

OKE config

As any cloud service provider, Oracle cloud has their own command line tool to efficiently work, interact and manage with Oracle cloud Services. make sure to install and configure it following this link

Once the OCI CLI setup completed go to your OKE Cluster page and hit the Access Kubeconfig button on top of the page. Following the instructions will help to create the correct kubectl configuration. Worth nothing to mention that the OCI CLI works with multiple contexts, which means that it will keep your previous kubeconfig intact while adding/merging the new config into it. This can be easily verified by running kubectl config view to check kubeconfig settings, or kubectl config get-contexts to list all your contexts.

The last step is to set the default context to your OKE cluster by running: kubectl config use-context

OCIR config

You should by now have created your private container registry in Oracle Cloud. Make sure its private, even if not mandatory, it is how the way things should be, from a security standard and an enterprise perspective.
OCIR stands for Oracle Cloud Container Registry. It's basically an Oracle-managed registry for your Docker container images. You can read more about Docker Registry here.

Since our OCIR is private, we need to configure a token to access it for both pushing and pulling our containers images. Head over to your Oracle Cloud console page, click User Settings under your profile image, hit the Auth Tokens page and then click the Generate Token button. Carefully note down the generated token as we will need it in the next steps.

We need afterward to make sure that we can access our registry with the generated token. For that we login to OCIR from Docker cli by typing the following command in your terminal:
docker login .ocir.io

where is the key for the Oracle Cloud Infrastructure Registry region you're using. This link contains a list of oracle cloud region keys.

You will be prompted to provide a username and password! The username follows the format: /. If your tenancy is federated with Oracle Identity Cloud Service, use the format /oracleidentitycloudservice/. Note that tenancy-namespace is the auto-generated Object Storage namespace string of the tenancy containing the repository from which the application is to pull the image. The password is the auth token you copied earlier.
If everything is fine, you should get a Login Succeeded message. If the login fails, try to verify and repeat the step above.

Now that we're sure that the registry is accessible, we create a Secret that will be used in our K8S manifests to pull the image from it! This can be achieved by running:

## An email address is required, but it doesn't matter what you specify
$ kubectl create secret docker-registry  --docker-server=.ocir.io --docker-username='/' --docker-password='' --docker-email=''

Hello World!

To put everything together, we'll be using an example from he Skaffold samples to check our setup. The example can be found here. The folder contains a single file go application, that prints Hello World! every seconds. To containerize the app, The Dockerfile uses multistage build feature to build the app in the first stage (builder) and copy and run the generated binary in the production/second stage.

The example also provide a simple k8s-pod.yaml to run the app in the K8S cluster. This file need to be updated by specifying the Docker secret created to access the OCIR, using imagePullSecrets. Below the updated file:

apiVersion: v1
kind: Pod
metadata:
  name: getting-started
spec:
  containers:
  - name: getting-started
    image: skaffold-example
  imagePullSecrets:
  - name: ocirsecret

Finally, you can either change the skaffold.yaml file to match the new registry or use --default-repo flag to prefix the image name with the OCIR registry, with no no manual YAML editing! The Skaffold config file contains many stages specifying the steps to build and deploy your application. More details can be found here

Now, you can continuously develop, deploy and test your changes using:

$ skaffold dev --default-repo=.ocir.io/tenancy-namespace>/

You can make changes to the main.go file and skaffold will build a new image, push it to OCIR, deploy it on OKE and print you the logs!

Ressources:

5 reasons to attend DevNexus 2020

Mohammed Aboullaite — Sun, 02 Feb 2020 18:19:05 GMT

We are writing these lines from 2 different beautiful cities; Brussels for me where I'm attending, for the first time the amazing FOSDEM; and Copenhagen for my dear friend, java champion and boss Badr El Houari after participating in jspirit unconference, as he likes to describe himself lately: unconference advocate!

Our next conference this year will be Jfokus in Stockholm, then I'll fly afterward to Atlanta to attend one of the conferences that I wanted to attend since a while: DevNexus.

In this post, We share with you top 5 reason on why I'm attending, and why you should join me to this year's Devnexus.

JUG Leaders Summit

I am attending DevNexus not as a speaker (unfortunately), but as a JUG leader representing MoroccoJUG (a great honor). DevNexus team organizes this year a GLOBAL JUGS LEADERS SUMMIT!

MoroccoJUG is the only active JUG in Morocco, a previous member of the JCP, and had been at the forefront of Adopt-a-JSR from the very start. In fact the JUG was awarded as an "Outstanding Adopt-a-JSR Participant" for their contributions to Java EE 7.

The JUG leaders summit is an amazing opportunity to meet fellow JUG leaders and discuss common challenges, exchange ideas, give feedback, learn tips on how to build and engaged community run successful events.

The JUG leaders summit is organized during the first day of the conference, Feb 19, and I am already super excited to be part of it :)

By Java Community for Java community

Devnexus is the largest independent Java platform conference in the USA, run by the Atlanta JUG. Devnexus become an annual attendance of over 2000 software developers and one of the leading technology events held annually around the globe.

I heard a lot of cool things about the conference and how the organizers aim to connect developers from all over the world and promoting open-source values. Besides the technical talks, there will be many opportunities at the conference to meet with the community and to network.

Meeting the usual suspect

Usual suspect are everywhere! That's a fact. But Devnexus is one of the annual rendezvous for many usual suspect to meet, hang out, share knowledge, learn from each other, and have fun. Never underestimate the power of a little fun mixed with some interesting people!

I typically spend as much time talking to people in the hallways as I do attending talks, It's a great way to build new relationships and make connections with attendees from diverse backgrounds and with a lot to share.

Great speaker lineup

with many of rock star speakers and 14 concurrent tracks and 150+ individual sessions, Devnexus brings to participants an unparalleled opportunities for both learning about the latest technology trends and diving deep into technologies that interest them.

The sheer amount of content at Devnexus is nothing less than astounding! With such many tracks and diverse sessions, happening simultaneously from early morning until late evening, covering a wide range of technology trends, no matter what happens, there’ll be something that you will learn and take from this conference.

Great Location

Atlanta is the No. 1 filming location for movies and TV shows in the world, according to FilmL.A. Atlanta is a city that has many people buzzing. Millions swing by yearly to Georgia's capital city to feel its historical significance and get a taste of it’s vibrant culture.

Devnexus is an opportunity for me to visit Atlanta and discover its southern charm, dynamic culture and rich history.

Java 14 new features: Records

Mohammed Aboullaite — Wed, 29 Jan 2020 15:56:04 GMT

This is the second article in the blog post series discussing the new features introduced in java 14. Today's article is focused on Records that aims to provide a compact & concise way for declaring data classes.

Java 14 rew features articles:

Why ?

Java is verbose!

You've for sure already heard this statement before, from your colleagues, at a conference, probably in meetups, or you already saw it in twitter or reddit!
Brian Goetz](https://twitter.com/BrianGoetz), Java Language Architect at Oracle, wrote a detailled post in that matter, stating that, as example, developers who want to create simple data carrier classes in a way that are easy to understand have to write a lot of low-value, repetitive, error-prone code: constructors, accessors, equals(), hashCode(), toString()...
To avoid the frustration, some make use of IDE capabilities to do the legwork of writing the boilerplate, but fail to consider much beyond functionality of the code itself to help the reader distill the design intent. Others use some libraries such as Lombok, While the lazy ones just omit the those methods, leading to surprising behavior and poor debuggability.

A new type declaration: Record!

Records are a special kind of lightweight classes in java, intended to be simple data carriers, similar to what exist in other languages (such as case classes in Scala, data classes in Kotlin and record classes in C#). The aim is to extend the Java language syntax and create a way to say that the type represents only data. By making this statement, We're telling the compiler to do all the work for us and produce the methods without any effort from outside.

Show me the code

Let start with the following Person record:

public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){}

The record class is an immutable, transparent carrier for a fixed set of fields known as the record components that provides a state description for the record. Each component gives rise to a final field that holds the provided value and an accessor method to retrieve the value. The field name and the accessor name match the name of the component.

Let's now try to compile the Person class. Since records still a preview language feature, which means that we need to enable the preview flag:

javac --enable-preview -source 14 Person.java

Now if we examen the class file with javap, you can see that the compiler has autogenerated a bunch of boilerplate code:

$ javap Person                                                                                                                                                                             Compiled from "Person.java"
public final class Person extends java.lang.Record {
  public Person(java.lang.String, java.lang.String, int, java.lang.String, java.util.Date);
  public java.lang.String toString();
  public final int hashCode();
  public final boolean equals(java.lang.Object);
  public java.lang.String firstName();
  public java.lang.String lastName();
  public int age();
  public java.lang.String address();
  public java.util.Date birthday();
}

Notice a couple of things here:

a private final field, with the same name and type, for each component in the state description;
a public read accessor method, with the same name and type, for each component in the state description;
a public constructor, whose signature is the same as the state description, which initializes each field from the corresponding argument;
implementations of equals and hashCode that say two records are equal if they of the same type and contain the same state;
implementation of toString that includes all the components, with their names.

Looking further and examining the byte code, we notice that both hashCode, equals and toString rely on invokedynamic to dynamically invoke the appropriate method containing the implicit implementation.

 public java.lang.String toString();
    descriptor: ()Ljava/lang/String;
    flags: (0x0001) ACC_PUBLIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokedynamic #32,  0             // InvokeDynamic #0:toString:(LPerson;)Ljava/lang/String;
         6: areturn
      LineNumberTable:
        line 2: 0

  public final int hashCode();
    descriptor: ()I
    flags: (0x0011) ACC_PUBLIC, ACC_FINAL
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokedynamic #36,  0             // InvokeDynamic #0:hashCode:(LPerson;)I
         6: ireturn
      LineNumberTable:
        line 2: 0

  public final boolean equals(java.lang.Object);
    descriptor: (Ljava/lang/Object;)Z
    flags: (0x0011) ACC_PUBLIC, ACC_FINAL
    Code:
      stack=2, locals=2, args_size=2
         0: aload_0
         1: aload_1
         2: invokedynamic #40,  0             // InvokeDynamic #0:equals:(LPerson;Ljava/lang/Object;)Z
         7: ireturn
      LineNumberTable:
        line 2: 0

Can I define additional methods, fields...

The short answer to this question is Yes, you can add static fields/methods! But the question is however, should you ?!
Keep in mind that the goal behind Records is to enable developers to group related fields together as a single immutable data item without the need to write verbose code. Which means that whenever you feel the temptation to add more fields/methods to your record, think if a full class makes more sens and should be used instead.
For example, we can define a method that returns a Person's full name:

public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){
public String fullName(){
    return firstName + " " + lastName;
 }
}

Compact constructor

Additionally, Records introduced Compact Constructor, with the aim that only validation and/or normalization code need to be given in the constructor body. The remaining initialization code is supplied by the compiler.
For example, if we want to validate a Person age to make sure that it's not negative, the code would looks similar to:

public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){
public Person{
    if (age < 0) { 
        throw new IllegalArgumentException( "Age must be greater than 0!"); 
     }
   }
}

Notice that no explicit parameter list is given for the compact constructor, but is derived from the record component list.

Final word

Records address a common issue with using classes as wrappers for data. Plain data classes are significantly reduced from several lines of code to a one-liner.
Keep in mind that Records are a preview language feature, which means that, although it is fully implemented, it is not yet standardized in the JDK.

Ressources:

Tweets Sentiment Analysis using Stanford CoreNLP

Mohammed Aboullaite — Wed, 08 Jan 2020 13:04:30 GMT

We're living in an era where data become the most valuable resource! Nearly every app in the market now, tries to understand its users, their behaviours, preferences, reactions and words! How many times, just after mentioning a watch ⌚ in a private conversation with your friend on messenger, your Facebook feed starts popping up ads about watches from different vendors?! It Happens EVERY single time!

Understanding this kind data, classifying and representing it is the challenge that Natural Language Processing (NLP) tries to solve.
In this article, I describe how I built a small application to perform sentiment analysis on tweets, using Stanford CoreNLP library, Twitter4J, Spring Boot and ReactJs! The code is available on GitHub.

Application

For everything related to Machine learning, java is generally not a popular choice. However, given the language popularity, there are some libraries and frameworks for pretty much everything!
The application uses Stanford CoreNLP library java api to analyse tweets extracted by Twitter4J library. The backend server is developed using spring (boot), and the frontend built using ReactJS.
As main functionalities, the application enable based on a keyword to either analyse live twitter stream data and classify it, or perform a search and post-analyse the tweets. The default behaviour is the streaming mode, but we can easily switch to the search mode simply by a click of button!

Stanford CoreNLP

The Stanford CoreNLP is a Java natural language analysis library that provides statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, that can be incorporated into applications with human language technology needs.

Stanford CoreNLP integrates many NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, the sentiment analysis tools, and provides model files for analysis for multiples languages.

The snippet below shows analyse(String tweet) method from SentimentAnalyzerService class which runs sentiment analysis on a single tweet, scores it from 0 to 4 based on whether the analysis comes back with Very Negative, Negative, Neutral, Positive or Very Positive respectively.

public int analyse(String tweet) {

        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, parse, sentiment");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation annotation = pipeline.process(tweet);
        for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
            Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
            return RNNCoreAnnotations.getPredictedClass(tree);
        }
        return 0;
    }

Fetching Tweets

I made use of the popular open source java library Twitter4J to fetch tweets. It provides a convenient API for accessing the Twitter API.
The TwitterService class contains the main methods interacting with Twitter API to search for tweets based on keywords:

fetchTweets builds a Query to search tweets containing a specific keyword. it has a second parameter count, which specifies the number of tweets to return per page, up to a max of 100. I also filter the twitter search results to make sure no retweet or tweet replies are returned.

public Flux fetchTweets(String keyword, int count) throws TwitterException {
        Twitter twitter = this.config.twitter(this.config.twitterFactory());
        Query query = new Query(keyword.concat(" -filter:retweets -filter:replies"));
        query.setCount(count);
        query.setLocale("en");
        query.setLang("en");
        return Flux.fromStream( twitter.search(query).getTweets().stream()).map(status -> this.cleanTweets(status));

    }

streamTweets collects live tweets matching a specific keyword

    public Flux streamTweets(String keyword){
        TwitterStream stream = config.twitterStream();
        FilterQuery tweetFilterQuery = new FilterQuery();
        tweetFilterQuery.track(new String[]{keyword});
        tweetFilterQuery.language(new String[]{"en"});
        return Flux.create(sink -> {
            stream.onStatus(status -> sink.next(this.cleanTweets(status)));
            stream.onException(sink::error);
            stream.filter(tweetFilterQuery);
            sink.onCancel(stream::shutdown);
        });
    }

Both methods fetch only tweets in english and returns a reactor Flux, capable of emitting a stream of 0 or more items, and then optionally either completing or erroring.

You should have noticed the call to cleanTweets before passing the tweets to the analyzer service. This method perform some cleanup on tweet text, removing the unneeded elements like links, hashtags, usernames ...

    private TwitterStatus cleanTweets(Status status){
        TwitterStatus twitterStatus = new TwitterStatus(status.getCreatedAt(), status.getId(), status.getText(), null, status.getUser().getName(), status.getUser().getScreenName(), status.getUser().getProfileImageURL());
        // Clean up tweets
        String text = status.getText().trim()
                // remove links
                .replaceAll("http.*?[\\S]+", "")
                // remove usernames
                .replaceAll("@[\\S]+", "")
                // replace hashtags by just words
                .replaceAll("#", "")
                // correct all multiple white spaces to a single white space
                .replaceAll("[\\s]+", " ");
        twitterStatus.setText(text);
        twitterStatus.setSentimentType(analyzerService.analyse(text));
        return twitterStatus;
    }

Showing the analyzed data

Now that we've our backend service ready, the final step is to consume our resources. Both endpoints implement SSE (Server Sent Events), which is a HTTP standard that allows a web application to handle an unidirectional event stream and receive updates whenever server emits data.

I used ReactJs with Typescript to build the Web UI components and consume the exposed REST endpoints. The main component is TweetList that handles the calls and share data with other components.

Once loaded, the component open an event stream with the server, calling the /stream endpoint, looking for all tweets containing Java keyword and saving them into array. It runs the effect and clean it up only once.

React.useEffect(() => {
    const eventSource = new EventSource(
      state.API_URL + "stream/" + state.hashtag
    );
    eventSource.onmessage = (event: any) => {
      const tweet = JSON.parse(event.data);
      let tweets = [...state.tweets, tweet];
      setState({ ...state, tweets: tweets });
    };
    eventSource.onerror = (event: any) => eventSource.close();
    setState({ ...state, eventSource: eventSource });
    return eventSource.close;
  }, []);

It keeps adding tweets to array whenever a message is received from the server. This effect runs whenever the tweets, eventSource or the hashtag change.

  React.useEffect(() => {
    if (state.eventSource) {
      state.eventSource.onmessage = (event: any) => {
        const tweet = JSON.parse(event.data);
        let tweets = [...state.tweets, tweet];
        setState({ ...state, tweets: tweets });
      };
    }
  }, [state.tweets, state.eventSource, state.hashtag]);

Finally, the render function looks like below:

return (
    
      
        
          
            Tracked Keyword:
            {state.hashtag}
          
        
        
          
        
         {
            e.preventDefault();
          }}
        >
          
             setState({ ...state, hashtag: e.target.value })}
              className="form-control"
              placeholder={state.hashtag}
              aria-label={state.hashtag}
              aria-describedby="basic-addon2"
            />
            
              
              
            
          
        
        
          {tweets
            .filter(tweet => tweet !== undefined)
            .reverse()
            .slice(0, 49)
            .map((tweet: Tweet) => (
              
                
                  
                  
                    {tweet.userName}
                  
                
                {tweet.originalText}
                
                
                  {tweet.createdAt}
                
              
            ))}
        
      
      
        
        
        
      
    
  );

Running the app

Now before running the app make sure to update application.yaml file with the required authentication keys that will allow you to authenticate correctly when calling the Twitter API to retrieve tweets. You probably need to create a Twitter developer account and create an application.
Start afterward the backend server using mvn spring-boot:run and the frontend npm start.

That's it folks! If you've any remark or suggestion, leave it in the comment below or fill a Github issue.