<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Laytoun' thoughts!]]></title><description><![CDATA[Failure sucks but instructs!]]></description><link>https://aboullaite.me/</link><image><url>https://aboullaite.me/favicon.png</url><title>Laytoun&apos; thoughts!</title><link>https://aboullaite.me/</link></image><generator>Ghost 5.47</generator><lastBuildDate>Mon, 06 Apr 2026 10:35:17 GMT</lastBuildDate><atom:link href="https://aboullaite.me/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Building RAG Update: Hybrid Search, Reranking & Production Hardening]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>When I published the original series in November 2025, I was happy with where the system landed. It had semantic caching, fallback strategies, distributed tracing, autoscaling, solid production patterns throughout. But as I kept working with it and preparing a talk around the same material, I kept spotting areas where</p>]]></description><link>https://aboullaite.me/rag-revisited-2026/</link><guid isPermaLink="false">69bfad8096cd710001aeb6e5</guid><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sun, 22 Mar 2026 10:14:17 GMT</pubDate><media:content url="https://aboullaite.me/content/images/2026/03/blog-cover-rag-update.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://aboullaite.me/content/images/2026/03/blog-cover-rag-update.jpg" alt="Building RAG Update: Hybrid Search, Reranking &amp; Production Hardening"><p>When I published the original series in November 2025, I was happy with where the system landed. It had semantic caching, fallback strategies, distributed tracing, autoscaling, solid production patterns throughout. But as I kept working with it and preparing a talk around the same material, I kept spotting areas where the system could go further.</p>
<p>Four months later, I finally made those improvements. This post covers what changed, why, and what I learned along the way.</p>
<h2 id="where-were-taking-things-further">Where We&apos;re Taking Things Further</h2>
<p>The original system was functional and resilient, but there were natural next steps I&apos;d been thinking about since day one:</p>
<ol>
<li>
<p><strong>From fallback to fusion.</strong> The system had both Weaviate (vector) and OpenSearch (lexical), but OpenSearch only kicked in when Weaviate <em>failed</em>. The natural evolution: combine their results to get the best of both worlds, all the time.</p>
</li>
<li>
<p><strong>Adding a reranking step.</strong> Whatever the vector DB returned as top-5, that&apos;s what the LLM saw. Adding a second pass to pick the <em>best</em> candidates from a broader pool is a well-known quality boost.</p>
</li>
<li>
<p><strong>Making LLM parameters configurable.</strong> Temperature and max tokens were baked into the Java code. For experimentation and tuning, these should live in configuration.</p>
</li>
<li>
<p><strong>Upgrading the embedding model.</strong> <code>all-MiniLM-L6-v2</code> served us well, but it&apos;s from 2021 and the field has moved fast. Time for a newer model.</p>
</li>
<li>
<p><strong>Tuning for real-world load.</strong> Some of the original timeout and caching values were optimized for local development. Under sustained traffic, they needed adjustment.</p>
</li>
</ol>
<p>Let&apos;s walk through each one.</p>
<h2 id="change-1-hybrid-search-with-reciprocal-rank-fusion">Change 1: Hybrid Search with Reciprocal Rank Fusion</h2>
<p>This was the most impactful change. The insight is simple: vector search and keyword search fail in complementary ways.</p>
<p><strong>Vector search</strong> excels at &quot;what does this mean?&quot; but struggles with exact terms. Ask about &quot;SLA for the premium tier&quot; and vector search finds documents about service guarantees and uptime commitments, conceptually right, but it might miss the document that literally contains the acronym &quot;SLA.&quot;</p>
<p><strong>Keyword search (BM25)</strong> does the opposite. It finds exact term matches but misses semantic connections.</p>
<p>The solution: run both in parallel and merge the results.</p>
<h3 id="implementation">Implementation</h3>
<p>The <code>RetrieverService</code> now runs Weaviate and OpenSearch concurrently using <code>Mono.zip()</code>, each with independent 500ms timeouts:</p>
<pre><code class="language-java">private Mono&lt;List&lt;RetrievedDoc&gt;&gt; executeHybridRetrieval(Query query, int topK, Span span) {
    hybridCounter.increment();

    Mono&lt;List&lt;RetrievedDoc&gt;&gt; vectorMono = weaviateGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -&gt; {
                log.warn(&quot;Vector search failed in hybrid mode: {}&quot;, ex.getMessage());
                return Mono.just(List.of());
            });

    Mono&lt;List&lt;RetrievedDoc&gt;&gt; lexicalMono = openSearchGateway.search(query, topK)
            .timeout(Duration.ofMillis(500))
            .onErrorResume(ex -&gt; {
                log.warn(&quot;Lexical search failed in hybrid mode: {}&quot;, ex.getMessage());
                return Mono.just(List.of());
            });

    return Mono.zip(vectorMono, lexicalMono)
            .map(tuple -&gt; mergeWithRRF(tuple.getT1(), tuple.getT2(), topK));
}
</code></pre>
<p>A few things to note:</p>
<ul>
<li><strong>Both searches are independent.</strong> If one fails, the other still returns results. This is strictly better than the old fallback-only approach, we get hybrid quality when both work, and graceful degradation when one doesn&apos;t.</li>
<li><strong>500ms timeout each, not combined.</strong> Since they run in parallel, the total retrieval time is <code>max(vector, lexical)</code>, not <code>vector + lexical</code>.</li>
</ul>
<h3 id="reciprocal-rank-fusion-rrf">Reciprocal Rank Fusion (RRF)</h3>
<p>The merging algorithm is RRF, which is the industry standard for combining ranked lists from different sources:</p>
<pre><code class="language-java">private List&lt;RetrievedDoc&gt; mergeWithRRF(List&lt;RetrievedDoc&gt; vectorResults,
                                         List&lt;RetrievedDoc&gt; lexicalResults, int topK) {
    Map&lt;String, Double&gt; scores = new HashMap&lt;&gt;();
    Map&lt;String, RetrievedDoc&gt; docsByKey = new HashMap&lt;&gt;();

    for (int i = 0; i &lt; vectorResults.size(); i++) {
        RetrievedDoc doc = vectorResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    for (int i = 0; i &lt; lexicalResults.size(); i++) {
        RetrievedDoc doc = lexicalResults.get(i);
        String key = doc.chunk();
        scores.merge(key, 1.0 / (RRF_K + i), Double::sum);
        docsByKey.putIfAbsent(key, doc);
    }

    return scores.entrySet().stream()
            .sorted(Map.Entry.&lt;String, Double&gt;comparingByValue().reversed())
            .limit(topK)
            .map(entry -&gt; {
                RetrievedDoc original = docsByKey.get(entry.getKey());
                return new RetrievedDoc(original.id(), original.chunk(),
                                        entry.getValue(), original.meta());
            })
            .collect(Collectors.toList());
}
</code></pre>
<p>RRF is elegant because it&apos;s rank-based, not score-based. We don&apos;t need to normalize scores across different systems (Weaviate&apos;s cosine distance and OpenSearch&apos;s BM25 scores live on completely different scales). The <code>k=60</code> constant is standard and works well in practice.</p>
<p>Documents that appear high in <em>both</em> lists get the highest combined score. A document ranked #1 in vector and #3 in lexical will outscore one ranked #1 in vector but absent from lexical results, which is exactly what we want.</p>
<h3 id="feature-flag">Feature flag</h3>
<p>Hybrid search is togglable via configuration:</p>
<pre><code class="language-yaml">retriever:
  hybrid-enabled: ${HYBRID_ENABLED:true}
</code></pre>
<p>When disabled, the system falls back to the original behavior: Weaviate primary, OpenSearch on failure only. This was useful for A/B comparison during development.</p>
<h2 id="change-2-reranking">Change 2: Reranking</h2>
<p>Hybrid search gives us better <em>candidates</em>. Reranking picks the <em>best</em> candidates from that list.</p>
<p>The pattern is straightforward: retrieve broadly (top-20), rerank precisely (top-5), send only the best to the LLM. Initial retrieval is optimized for recall (don&apos;t miss relevant docs). Reranking is optimized for precision (only keep the most relevant).</p>
<p>In production, we&apos;d use a cross-encoder model like <code>BAAI/bge-reranker-v2-m3</code> or the Cohere Rerank API. For this demo, I implemented a lightweight reranker using cosine similarity between deterministic embeddings of the query and each chunk:</p>
<pre><code class="language-java">@Component
public class Reranker {

    private final Timer rerankLatency;

    public Reranker(MeterRegistry meterRegistry) {
        this.rerankLatency = Timer.builder(&quot;rag_rerank_latency&quot;)
                .description(&quot;Time spent reranking retrieved documents&quot;)
                .register(meterRegistry);
    }

    public Mono&lt;List&lt;RetrievedDoc&gt;&gt; rerank(String query, List&lt;RetrievedDoc&gt; candidates, int topK) {
        return Mono.fromCallable(() -&gt; {
            Timer.Sample sample = Timer.start();
            try {
                double[] queryEmbedding = DeterministicEmbedding.embed(query);

                List&lt;RetrievedDoc&gt; reranked = candidates.stream()
                        .map(doc -&gt; {
                            double[] chunkEmbedding = DeterministicEmbedding.embed(doc.chunk());
                            double similarity = cosineSimilarity(queryEmbedding, chunkEmbedding);
                            return new RetrievedDoc(doc.id(), doc.chunk(), similarity, doc.meta());
                        })
                        .sorted(Comparator.comparingDouble(RetrievedDoc::score).reversed())
                        .limit(topK)
                        .collect(Collectors.toList());

                log.debug(&quot;Reranked {} candidates down to {}&quot;, candidates.size(), reranked.size());
                return reranked;
            } finally {
                sample.stop(rerankLatency);
            }
        });
    }
}
</code></pre>
<p>The integration into <code>RetrieverService</code> is clean, reranking wraps the retrieval result:</p>
<pre><code class="language-java">int fetchK = properties.isRerankEnabled() ? properties.getRetrieveK() : topK;

Mono&lt;List&lt;RetrievedDoc&gt;&gt; retrieval;
if (properties.isHybridEnabled() &amp;&amp; openSearchGateway.isEnabled()) {
    retrieval = executeHybridRetrieval(query, fetchK, span);
} else {
    retrieval = executeSingleSourceRetrieval(query, fetchK, span);
}

if (properties.isRerankEnabled()) {
    result = result.flatMap(docs -&gt; reranker.rerank(query.text(), docs, topK));
}
</code></pre>
<p>When reranking is enabled, we fetch <code>retrieveK</code> (default 20) candidates instead of the final <code>topK</code> (default 5), then let the reranker narrow down. This gives the reranker a wider pool to work with.</p>
<p>Like hybrid search, reranking is feature-flagged via <code>rerank-enabled</code> in the config.</p>
<h2 id="change-3-embedding-model-upgrade">Change 3: Embedding Model Upgrade</h2>
<p><code>all-MiniLM-L6-v2</code> has been a workhorse since 2021. It scores ~63 on the MTEB benchmark. Its bigger sibling, <code>all-MiniLM-L12-v2</code>, scores higher while keeping the same 384 dimensions, making it a drop-in upgrade.</p>
<p>The change is a single line in <code>deploy/weaviate.yaml</code>:</p>
<pre><code class="language-yaml"># Before
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2

# After
- name: text2vec
  image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L12-v2
</code></pre>
<p>Memory limit bumped from 2Gi to 3Gi to accommodate the larger model. Same dimensions means the Weaviate schema doesn&apos;t change, but we do need to re-ingest all documents since the embeddings will be different (<code>make ingest</code>).</p>
<p>For production systems, I&apos;d recommend going further: <code>intfloat/e5-large-v2</code> (1024 dims) or <code>BAAI/bge-large-en-v1.5</code> score 75-76 on MTEB. But those require schema changes, more memory, and larger storage. The L6&#x2192;L12 swap was the highest ROI for this demo.</p>
<h2 id="change-4-llm-client-improvements">Change 4: LLM Client Improvements</h2>
<p>Two small but useful refinements:</p>
<h3 id="configurable-temperature-and-max-tokens">Configurable temperature and max tokens</h3>
<p>Previously these were hardcoded in the Java source:</p>
<pre><code class="language-java">// Before
.put(&quot;temperature&quot;, 0.7)
.put(&quot;max_tokens&quot;, 512);

// After
.put(&quot;temperature&quot;, properties.getTemperature())
.put(&quot;max_tokens&quot;, properties.getMaxTokens());
</code></pre>
<p>Now driven by <code>application.yaml</code>:</p>
<pre><code class="language-yaml">rag:
  temperature: ${LLM_TEMPERATURE:0.7}
  max-tokens: ${LLM_MAX_TOKENS:512}
</code></pre>
<p>Small change, but it means we can tune generation behavior via ConfigMap without redeploying. Handy for experimenting with different temperature values across environments.</p>
<h3 id="better-token-counting">Better token counting</h3>
<p>The original code counted tokens by splitting on whitespace:</p>
<pre><code class="language-java">// Before: counts words, not tokens
int tokens = Math.max(1, text.split(&quot;\\s+&quot;).length);

// After: rough approximation: 1 token &#x2248; 4 characters
int tokens = Math.max(1, text.length() / 4);
</code></pre>
<p>Neither is perfect without a proper tokenizer, but <code>length / 4</code> is much closer to reality for English text. This feeds into the cost estimation metrics on the observability dashboard, so getting it roughly right matters.</p>
<h2 id="change-5-production-tuning">Change 5: Production Tuning</h2>
<h3 id="timeouts-and-thresholds">Timeouts and thresholds</h3>
<p>The original values were tuned for local development. Under sustained load, some of them needed breathing room:</p>
<table>
<thead>
<tr>
<th>Setting</th>
<th>Before</th>
<th>After</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td>Retrieval timeout</td>
<td>250ms</td>
<td>500ms</td>
<td>Reduced unnecessary fallbacks under load</td>
</tr>
<tr>
<td>LLM generation timeout</td>
<td>1,800ms</td>
<td>5,000ms</td>
<td>Cold models and complex prompts need headroom</td>
</tr>
<tr>
<td>Cache similarity threshold</td>
<td>0.90</td>
<td>0.87</td>
<td>More cache hits, still precise enough</td>
</tr>
<tr>
<td>Cache TTL</td>
<td>600s (10 min)</td>
<td>3,600s (1 hour)</td>
<td>RAG docs don&apos;t change that often</td>
</tr>
</tbody>
</table>
<p>The retrieval timeout change alone reduced the fallback rate from ~15% under load to ~3%. That&apos;s a meaningful quality improvement, every unnecessary fallback means the user gets lexical-only results instead of hybrid.</p>
<h2 id="the-updated-retrieval-flow">The Updated Retrieval Flow</h2>
<p>Here&apos;s how the retrieval pipeline looks now, end to end:</p>
<pre><code>User query arrives at Retriever
    &#x2502;
    &#x251C;&#x2500;&#x2500; Hybrid enabled?
    &#x2502;     YES &#x2192; Run Weaviate + OpenSearch in parallel (500ms each)
    &#x2502;           &#x2192; Merge results with RRF (k=60)
    &#x2502;     NO  &#x2192; Run Weaviate only (500ms timeout)
    &#x2502;           &#x2192; On failure, fallback to OpenSearch
    &#x2502;
    &#x251C;&#x2500;&#x2500; Rerank enabled?
    &#x2502;     YES &#x2192; Take top-20 candidates
    &#x2502;           &#x2192; Rerank by cosine similarity
    &#x2502;           &#x2192; Return top-5
    &#x2502;     NO  &#x2192; Return top-5 directly
    &#x2502;
    &#x2514;&#x2500;&#x2500; Return to Orchestrator
</code></pre>
<p>Every step is independently toggleable via configuration, instrumented with Prometheus metrics (<code>rag_retrieval_hybrid_total</code>, <code>rag_rerank_latency</code>), and traced with OpenTelemetry spans.</p>
<h2 id="whats-next">What&apos;s Next</h2>
<p>These changes addressed the most impactful improvements. The system is meaningfully better, but there&apos;s more on the roadmap:</p>
<ul>
<li>
<p><strong>A proper cross-encoder reranker.</strong> The cosine similarity reranker is a stand-in. A real cross-encoder (<code>bge-reranker-v2-m3</code>) would give much better precision, at the cost of ~80ms latency and an additional inference sidecar.</p>
</li>
<li>
<p><strong>Query routing.</strong> Not every question needs RAG. A router agent that decides per-query whether to use the cache, call a tool, run the RAG pipeline, or just let the LLM answer from its training data, that&apos;s the next architectural evolution.</p>
</li>
<li>
<p><strong>Better embedding model.</strong> <code>all-MiniLM-L12-v2</code> is better than L6, but models like <code>intfloat/e5-large-v2</code> or <code>BAAI/bge-large-en-v1.5</code> would be a step change in retrieval quality.</p>
</li>
<li>
<p><strong>Contextual retrieval.</strong> Anthropic&apos;s technique of prepending chunk-specific context before embedding (e.g., &quot;This chunk is from the autoscaling documentation&quot;) reduces retrieval failures by up to 67%. That&apos;s a significant number worth exploring.</p>
</li>
</ul>
<p>The full code is at <a href="https://github.com/aboullaite/rag-java-k8s?ref=aboullaite.me">github.com/aboullaite/rag-java-k8s</a>. Deploy locally with <code>make dev-up &amp;&amp; make build &amp;&amp; make deploy &amp;&amp; make ingest</code> and try it yourself.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Building Production-Grade RAG Systems: Kubernetes, Autoscaling & LLMs]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>We finally got first drop of snow this week in Stockholm. The eather is getting colder and days shorter. That only motivates me to continue writing my third and final post in the RAG series.<br>
In <a href="https://aboullaite.me/production-rag-java-k8s-part1/">part one</a>, we explored the production challenges of RAG systems. In <a href="https://aboullaite.me/production-rag-java-k8s-part2/">part two</a>, we</p>]]></description><link>https://aboullaite.me/production-rag-java-k8s-part3/</link><guid isPermaLink="false">6921f4d296cd710001aeb664</guid><category><![CDATA[kubernetes]]></category><category><![CDATA[LLM]]></category><category><![CDATA[Java]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sat, 22 Nov 2025 18:13:19 GMT</pubDate><media:content url="https://aboullaite.me/content/images/2025/11/Gemini_Generated_Image_glsj04glsj04glsj.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://aboullaite.me/content/images/2025/11/Gemini_Generated_Image_glsj04glsj04glsj.png" alt="Building Production-Grade RAG Systems: Kubernetes, Autoscaling &amp; LLMs"><p>We finally got first drop of snow this week in Stockholm. The eather is getting colder and days shorter. That only motivates me to continue writing my third and final post in the RAG series.<br>
In <a href="https://aboullaite.me/production-rag-java-k8s-part1/">part one</a>, we explored the production challenges of RAG systems. In <a href="https://aboullaite.me/production-rag-java-k8s-part2/">part two</a>, we dove deep into the architecture and component design. Now let&apos;s talk about the elephant in the room: <strong>why Kubernetes?</strong></p>
<h2 id="the-llm-infrastructure-problem">The LLM Infrastructure Problem</h2>
<p>LLM applications have unique operational requirements that traditional web applications don&apos;t face:</p>
<h3 id="1-gpu-resource-management">1. GPU Resource Management</h3>
<p>Running models like Gemma-2-2B requires GPUs. Not just any GPUs&#x2014;specific GPU types (L4, A100, H100) with minimum VRAM requirements. You need:</p>
<ul>
<li><strong>Dynamic allocation</strong>: Spin up GPU nodes when needed, tear down when idle</li>
<li><strong>Multi-tenancy</strong>: Share expensive GPUs across multiple services (when possible)</li>
<li><strong>Isolation</strong>: Ensure one model&apos;s OOM crash doesn&apos;t kill other workloads</li>
<li><strong>Scheduling</strong>: Route inference requests to GPU-backed pods automatically</li>
</ul>
<p>Kubernetes thicks all the boxes.</p>
<h3 id="2-heterogeneous-scaling">2. Heterogeneous Scaling</h3>
<p>Our RAG pipeline has components with radically different scaling profiles:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Scaling Trigger</th>
<th>Resource Type</th>
<th>Scale Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>Retriever</td>
<td>CPU + Request rate</td>
<td>CPU-intensive</td>
<td>2-30 replicas</td>
</tr>
<tr>
<td>Orchestrator</td>
<td>Connection count</td>
<td>I/O-bound</td>
<td>2-10 replicas</td>
</tr>
<tr>
<td>LLM (KServe)</td>
<td>Inference load</td>
<td>GPU-bound</td>
<td>0-3 replicas (scale-to-zero)</td>
</tr>
<tr>
<td>Vector DB</td>
<td>Query volume</td>
<td>Memory + I/O</td>
<td>Managed/external</td>
</tr>
<tr>
<td>Cache (Redis)</td>
<td>Memory usage</td>
<td>Memory-bound</td>
<td>1 replica (stateful)</td>
</tr>
</tbody>
</table>
<p>Each component needs independent scaling logic. The retriever might scale out to 30 replicas during a traffic spike while the orchestrator stays at 2 replicas. The LLM should scale to zero when idle (saving $$$ on GPU costs) but warm up quickly when requests arrive.</p>
<p>Kubernetes HPA, KEDA, and KServe give you fine-grained control over each layer.</p>
<h3 id="3-network-complexity">3. Network Complexity</h3>
<p>Our RAG pipelines involve complex service-to-service communication:</p>
<ul>
<li>Orchestrator &#x2192; Retriever (HTTP)</li>
<li>Retriever &#x2192; Weaviate (gRPC)</li>
<li>Retriever &#x2192; OpenSearch (REST)</li>
<li>Orchestrator &#x2192; KServe (HTTP with streaming)</li>
<li>Orchestrator &#x2192; Redis (TCP)</li>
<li>All services &#x2192; OTEL Collector (gRPC)</li>
<li>All services &#x2192; Prometheus (HTTP scraping)</li>
</ul>
<p>We need:</p>
<ul>
<li>Service discovery (how does the orchestrator find the retriever?)</li>
<li>Load balancing (distribute requests across retriever replicas)</li>
<li>Retry logic (retry failed requests with backoff)</li>
<li>Circuit breaking (stop calling unhealthy services)</li>
<li>Observability (trace requests across service boundaries)</li>
</ul>
<p>Kubernetes Services, Ingress, and service meshes (Istio, Linkerd) handle this out of the box.</p>
<h3 id="4-deployment-complexity">4. Deployment Complexity</h3>
<p>Production deployments require:</p>
<ul>
<li><strong>Canary releases</strong>: Route 10% of traffic to new versions, monitor metrics, rollback if needed</li>
<li><strong>Blue-green deployments</strong>: Swap entire environments atomically</li>
<li><strong>Rolling updates</strong>: Replace pods gradually without downtime</li>
<li><strong>Rollback</strong>: Revert to previous versions quickly</li>
<li><strong>Health checks</strong>: Readiness and liveness probes to avoid routing to broken pods</li>
<li><strong>Resource limits</strong>: Prevent resource exhaustion and noisy neighbor problems</li>
</ul>
<p>Kubernetes Deployments, StatefulSets, and Rollouts (via Argo) provide these primitives.</p>
<h3 id="5-observability-at-scale">5. Observability at Scale</h3>
<p>When you have 30 retriever pods, 10 orchestrator pods, and multiple LLM replicas, you need:</p>
<ul>
<li><strong>Distributed tracing</strong>: See request flows across services</li>
<li><strong>Metrics aggregation</strong>: Scrape metrics from all pods automatically</li>
<li><strong>Log aggregation</strong>: Centralized logging with correlation IDs</li>
<li><strong>Dashboarding</strong>: Real-time visibility into system health</li>
</ul>
<p>Kubernetes + Prometheus + OpenTelemetry + Grafana + Tempo is the standard stack.</p>
<h2 id="why-not-serverless">Why Not Serverless?</h2>
<p>Serverless functions (Lambda, Cloud Functions, Cloud Run) work for stateless HTTP APIs. But LLM workloads break the serverless model:</p>
<p><strong>Cold starts</strong>: LLMs take 10-60 seconds to load into GPU memory. Cold starts are unacceptable.</p>
<p><strong>Execution time limits</strong>: Serverless has timeouts (AWS Lambda: 15 minutes, Cloud Functions: 60 minutes). Long-running inference or batch processing exceeds these limits.</p>
<p><strong>GPU support</strong>: Limited or expensive. AWS Lambda doesn&apos;t support GPUs. Cloud Run supports GPUs but at premium pricing without scale-to-zero.</p>
<p><strong>Stateful caching</strong>: Semantic caching requires shared state (Redis). Serverless architectures push state to external services, adding latency.</p>
<p><strong>Cost</strong>: Serverless pricing is optimized for bursty, short-lived workloads. LLM inference is compute-intensive and benefits from sustained usage discounts on VMs/GPUs.</p>
<p>All of those arguments can be mitigated. But Kubernetes gives us serverless-like abstractions (KServe scale-to-zero) while maintaining control over GPU resources, state management, and cost.</p>
<h2 id="the-kubernetes-deployment-strategy">The Kubernetes Deployment Strategy</h2>
<p>Let&apos;s walk through deploying this RAG system to Kubernetes, both locally (KinD) and in production (GKE).</p>
<h3 id="production-deployment-on-gke">Production Deployment on GKE</h3>
<p>Google Kubernetes Engine (GKE) provides managed Kubernetes with:</p>
<ul>
<li><strong>Autopilot mode</strong>: Google manages nodes, scaling, security patches</li>
<li><strong>GPU node pools</strong>: L4, A100, H100 GPU support</li>
<li><strong>Regional clusters</strong>: High availability across zones</li>
<li><strong>Integrated logging</strong>: Stackdriver integration</li>
<li><strong>VPC-native networking</strong>: Secure service-to-service communication</li>
</ul>
<p><strong>Creating the cluster</strong>:</p>
<pre><code class="language-bash"># Configure environment
vim .env  # Set GCP_PROJECT, GKE_REGION, REGISTRY

# Create GKE cluster
make gke-cluster
</code></pre>
<p>This provisions:</p>
<ul>
<li><strong>Node pool</strong>: 3 nodes, <code>e2-standard-4</code> (4 vCPU, 16GB RAM)</li>
<li><strong>Autopilot</strong>: Optional (use <code>make gke-autopilot</code> for fully managed)</li>
<li><strong>Region</strong>: <code>europe-west4</code> (Belgium, low latency to EU users)</li>
<li><strong>Network</strong>: VPC-native with private IPs</li>
</ul>
<p><strong>Installing KServe</strong>:</p>
<p>KServe requires manual installation (for now). Run these commands:</p>
<pre><code class="language-bash"># Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager-webhook -n cert-manager

# Install Knative Serving
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-core.yaml

# Install Kourier networking layer
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.20.0/kourier.yaml
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch &apos;{&quot;data&quot;:{&quot;ingress-class&quot;:&quot;kourier.ingress.networking.knative.dev&quot;}}&apos;

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve.yaml
kubectl wait --for=condition=available --timeout=300s deployment/kserve-controller-manager -n kserve
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.16.0/kserve-cluster-resources.yaml

# Configure raw deployment mode (no Knative autoscaling)
kubectl patch configmap/inferenceservice-config -n kserve --type merge \
  --patch &apos;{&quot;data&quot;:{&quot;deploy&quot;:&quot;{\&quot;defaultDeploymentMode\&quot;:\&quot;RawDeployment\&quot;}&quot;}}&apos;
</code></pre>
<p><strong>Create GPU node pool</strong>:</p>
<pre><code class="language-bash">make gke-gpu
</code></pre>
<p>This creates a separate node pool with:</p>
<ul>
<li><strong>GPU type</strong>: NVIDIA L4 (24GB VRAM, cost-effective for inference)</li>
<li><strong>Machine type</strong>: <code>g2-standard-4</code> (4 vCPU, 16GB RAM, 1x L4 GPU)</li>
<li><strong>Nodes</strong>: 1 node (autoscales 0-3 based on demand)</li>
<li><strong>Taints</strong>: <code>nvidia.com/gpu=present:NoSchedule</code> (only GPU workloads land here)</li>
</ul>
<p><strong>Deploy the stack</strong>:</p>
<pre><code class="language-bash"># Build and push images to Artifact Registry
gcloud auth configure-docker europe-north1-docker.pkg.dev
make build

# Deploy all services + LoadBalancer
make gke-deploy

# Ingest sample data
make ingest

# Get external IP
kubectl get svc orchestrator-public -n rag
</code></pre>
<p>Your production RAG system is now live at <code>http://&lt;EXTERNAL_IP&gt;</code>.</p>
<h2 id="autoscaling">Autoscaling</h2>
<p>Autoscaling is where Kubernetes shines. Let&apos;s break down each layer.</p>
<h3 id="horizontal-pod-autoscaler-hpa-cpu-based-scaling">Horizontal Pod Autoscaler (HPA): CPU-Based Scaling</h3>
<p>HPA scales pods based on resource metrics (CPU, memory). For the retriever service:</p>
<pre><code class="language-yaml"># deploy/hpa-retriever.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: retriever-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: retriever
  minReplicas: 2
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
        - type: Pods
          value: 4
          periodSeconds: 30
      selectPolicy: Max
</code></pre>
<p><strong>How it works</strong>:</p>
<ol>
<li>HPA queries Kubernetes Metrics Server for pod CPU usage</li>
<li>If average CPU &gt; 70%, scale up</li>
<li>If average CPU &lt; 70%, scale down (after stabilization window)</li>
<li>Scale-up is aggressive (100% increase or +4 pods, whichever is greater)</li>
<li>Scale-down is gradual (50% decrease every 60 seconds, after 5-minute stabilization)</li>
</ol>
<p><strong>Why these settings?</strong></p>
<ul>
<li><strong>minReplicas: 2</strong>: Ensures redundancy (if one pod crashes, traffic routes to the other)</li>
<li><strong>maxReplicas: 30</strong>: Handles extreme traffic spikes without unbounded cost</li>
<li><strong>averageUtilization: 70</strong>: Headroom for bursts without constant scaling oscillation</li>
<li><strong>scaleUp aggressive, scaleDown gradual</strong>: Prefer over-provisioning during spikes, slow drain during cooldowns</li>
</ul>
<h3 id="keda-event-driven-autoscaling">KEDA: Event-Driven Autoscaling</h3>
<p>HPA is great for CPU/memory scaling, but what about custom metrics? Enter KEDA (Kubernetes Event-Driven Autoscaling).</p>
<p>KEDA scales based on external metrics like:</p>
<ul>
<li>Prometheus queries</li>
<li>Kafka message lag</li>
<li>AWS SQS queue depth</li>
<li>Custom HTTP endpoints</li>
</ul>
<p>For the retriever, we scale based on <strong>requests per second</strong>:</p>
<pre><code class="language-yaml"># deploy/keda-retriever.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: retriever-keda
  namespace: rag
spec:
  scaleTargetRef:
    name: retriever
  minReplicaCount: 2
  maxReplicaCount: 30
  cooldownPeriod: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.observability.svc.cluster.local:9090
        query: |
          sum(rate(http_server_requests_seconds_count{app=&quot;retriever&quot;,namespace=&quot;rag&quot;}[1m]))
        threshold: &quot;15&quot;
</code></pre>
<p><strong>How it works</strong>:</p>
<ol>
<li>KEDA queries Prometheus every 30 seconds</li>
<li>Executes the PromQL query: <code>sum(rate(http_server_requests_seconds_count{app=&quot;retriever&quot;}[1m]))</code></li>
<li>If total RPS &gt; 15 requests/sec, scale up by 1 replica</li>
<li>If total RPS &lt; 15 requests/sec, scale down by 1 replica (after 120-second cooldown)</li>
</ol>
<p><strong>Why RPS-based scaling?</strong></p>
<p>CPU utilization is a lagging indicator. By the time CPU hits 70%, users are already experiencing latency. RPS is a leading indicator&#x2014;traffic increases before CPU saturates.</p>
<p>With 15 RPS threshold:</p>
<ul>
<li>2 replicas handle 30 RPS</li>
<li>10 replicas handle 150 RPS</li>
<li>30 replicas handle 450 RPS</li>
</ul>
<p>In load testing, this scales the retriever from 2&#x2192;20 replicas within 90 seconds when traffic ramps from 10&#x2192;200 RPS.</p>
<h3 id="kserve-scale-to-zero-for-llms">KServe: Scale-to-Zero for LLMs</h3>
<p>GPU costs are expensive. Leaving an L4 GPU idle costs ~$0.60/hour (~$430/month). Scale-to-zero is critical.</p>
<p>KServe (via Knative Serving) provides:</p>
<ul>
<li><strong>Scale to zero</strong>: Terminate pods when idle for 60 seconds</li>
<li><strong>Warm-up on demand</strong>: Spin up pods on first request</li>
<li><strong>Concurrency-based scaling</strong>: Scale based on in-flight requests</li>
</ul>
<pre><code class="language-yaml"># deploy/kserve-vllm.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: rag-llm
  namespace: rag
spec:
  predictor:
    minReplicas: 0        # Scale to zero when idle
    maxReplicas: 3        # GPU instance
    scaleTarget: 20        # Target 20 concurrent request
    scaleMetric: concurrency
    model:
      runtime: vllm-runtime
      modelFormat:
        name: huggingface
      args:
        - --model
        - google/gemma-2-2b-it
        - --dtype
        - auto
        - --max-model-len
        - &quot;4096&quot;
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: &quot;3&quot;
          memory: 12Gi
</code></pre>
<p><strong>Cold start latency</strong>:</p>
<ul>
<li>Model download: ~10-20 seconds (cached after first run)</li>
<li>Model load to GPU: ~15-30 seconds</li>
<li>First inference: ~1-3 seconds</li>
<li><strong>Total cold start</strong>: ~30-50 seconds</li>
</ul>
<p><strong>Mitigation strategies</strong>:</p>
<ol>
<li><strong>Keep-alive</strong>: Background service pings the model every 45 seconds to prevent scale-down</li>
<li><strong>Fallback</strong>: Orchestrator uses deterministic fallback during cold starts</li>
<li><strong>Pre-warming</strong>: Scale to 1 replica before traffic spikes (e.g., before scheduled events)</li>
</ol>
<p>For production workloads with consistent traffic, set <code>minReplicas: 1</code> to avoid cold starts.</p>
<h3 id="combining-hpa-keda-kserve">Combining HPA + KEDA + KServe</h3>
<p>Why use both HPA and KEDA for the retriever?</p>
<p>They complement each other:</p>
<ul>
<li><strong>HPA</strong>: Reacts to CPU saturation (protects against resource exhaustion)</li>
<li><strong>KEDA</strong>: Reacts to request rate (proactive scaling before CPU saturates)</li>
</ul>
<p>Kubernetes merges their recommendations, choosing the higher replica count. During traffic spikes:</p>
<ol>
<li>KEDA detects rising RPS and scales to 3 replicas</li>
<li>CPU usage remains &lt;70% (HPA doesn&apos;t trigger)</li>
<li>If CPU spikes above 70% (e.g., slow queries), HPA overrides and scales to 15 replicas</li>
<li>When traffic drops, KEDA scales down after cooldown</li>
</ol>
<p>This multi-signal approach prevents both under-provisioning (latency spikes) and over-provisioning (wasted cost).</p>
<h2 id="observability-stack-seeing-whats-happening">Observability Stack: Seeing What&apos;s Happening</h2>
<p>You can&apos;t optimize what you can&apos;t measure. The observability stack provides end-to-end visibility.</p>
<h3 id="architecture-overview">Architecture Overview</h3>
<p><img src="https://aboullaite.me/content/images/2025/11/Untitled-diagram-2025-11-22-175452.png" alt="Building Production-Grade RAG Systems: Kubernetes, Autoscaling &amp; LLMs" loading="lazy"></p>
<h3 id="opentelemetry-tracing">OpenTelemetry Tracing</h3>
<p>Every request generates a distributed trace spanning multiple services. The OpenTelemetry Java agent instruments Spring Boot automatically:</p>
<pre><code class="language-yaml"># deploy/retriever.yaml

initContainers:
  - name: otel-agent-downloader
    image: busybox:1.36
    command:
      - sh
      - -c
      - &gt;
        wget -q -O /otel/javaagent.jar
        https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.21.0/opentelemetry-javaagent.jar
    volumeMounts:
      - name: otel-agent
        mountPath: /otel

containers:
  - name: retriever
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://otel-collector.observability.svc.cluster.local:4317
      - name: JAVA_TOOL_OPTIONS
        value: &quot;-javaagent:/otel/javaagent.jar&quot;
    volumeMounts:
      - name: otel-agent
        mountPath: /otel
</code></pre>
<p><strong>Custom span attributes</strong> provide RAG-specific context:</p>
<pre><code class="language-java">// common/src/main/java/me/aboullaite/rag/common/tracing/TracingUtils.java

public class TracingUtils {
    public static void recordCacheHit(Span span, boolean hit) {
        span.setAttribute(&quot;rag.cache.hit&quot;, hit);
    }

    public static void recordRetrievedDocs(Span span, List&lt;RetrievedDoc&gt; docs) {
        span.setAttribute(&quot;rag.retrieval.count&quot;, docs.size());
        if (!docs.isEmpty()) {
            span.setAttribute(&quot;rag.retrieval.source&quot;,
                docs.get(0).metadata().source());
        }
    }

    public static void recordFallback(Span span, String reason) {
        span.setAttribute(&quot;rag.fallback.reason&quot;, reason);
    }

    public static void recordModelUsage(Span span, String model, long ttftMs, int tokens) {
        span.setAttribute(&quot;rag.model.name&quot;, model);
        span.setAttribute(&quot;rag.ttft_ms&quot;, ttftMs);
        span.setAttribute(&quot;rag.tokens.total&quot;, tokens);
    }
}
</code></pre>
<p>In Grafana Tempo, you can:</p>
<ul>
<li>Filter traces by <code>rag.cache.hit=false</code> (cache misses)</li>
<li>Find slow requests by <code>rag.ttft_ms &gt; 1000</code> (first token &gt; 1 second)</li>
<li>Identify fallback triggers by <code>rag.fallback.reason=weaviate-timeout</code></li>
</ul>
<h3 id="prometheus-metrics">Prometheus Metrics</h3>
<p>Spring Boot Actuator exposes Prometheus metrics at <code>/actuator/prometheus</code>. The retriever and orchestrator emit custom metrics:</p>
<pre><code class="language-java">// Orchestrator metrics
this.askLatency = Timer.builder(&quot;rag_orchestrator_latency&quot;)
        .description(&quot;End-to-end /v1/ask latency&quot;)
        .register(meterRegistry);

this.cacheHitCounter = Counter.builder(&quot;rag_cache_hit_total&quot;)
        .description(&quot;Semantic cache hits&quot;)
        .register(meterRegistry);

this.tokensCounter = Counter.builder(&quot;rag_tokens_generated_total&quot;)
        .description(&quot;Total tokens generated by model responses&quot;)
        .register(meterRegistry);

this.costSummary = DistributionSummary.builder(&quot;rag_cost_usd_total&quot;)
        .description(&quot;Approximate request cost in USD&quot;)
        .register(meterRegistry);
</code></pre>
<p>Prometheus scrapes these metrics every 15 seconds:</p>
<pre><code class="language-yaml"># deploy/prometheus.yaml

scrape_configs:
  - job_name: &apos;retriever&apos;
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - rag
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
</code></pre>
<h3 id="grafana-dashboards">Grafana Dashboards</h3>
<p>The Grafana dashboard visualizes key RAG metrics:</p>
<pre><code>&#x250C;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2510;
&#x2502; RAG Pipeline Dashboard                                      &#x2502;
&#x251C;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2524;
&#x2502; Cache Hit Rate:  47.3%  &#x2502;  Avg Latency:     847ms           &#x2502;
&#x2502; Total Requests:  12.4k  &#x2502;  p95 Latency:    1.32s            &#x2502;
&#x2502; Fallback Rate:    8.2%  &#x2502;  p99 Latency:    2.15s            &#x2502;
&#x2502; Estimated Cost: $14.32  &#x2502;  Tokens/sec:      142             &#x2502;
&#x251C;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2524;
&#x2502; [Line chart: Request rate over time (RPS)]                  &#x2502;
&#x2502; [Line chart: Cache hit ratio (percentage)]                  &#x2502;
&#x2502; [Line chart: Latency percentiles (p50/p95/p99)]             &#x2502;
&#x2502; [Bar chart: Replica count by service]                       &#x2502;
&#x2502; [Line chart: Token throughput (tokens/sec)]                 &#x2502;
&#x2502; [Line chart: Estimated cost per request]                    &#x2502;
&#x2514;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2500;&#x2518;
</code></pre>
<p><strong>Key metrics to watch</strong>:</p>
<ol>
<li>
<p><strong>Cache hit ratio</strong>: Should be &gt;40% for typical workloads. If &lt;20%, investigate query distribution or lower similarity threshold.</p>
</li>
<li>
<p><strong>p95 latency</strong>: Should be &lt;2 seconds. If higher, check:</p>
<ul>
<li>Retrieval timeout settings (too aggressive?)</li>
<li>LLM TTFT (model overloaded?)</li>
<li>Network latency (cross-region calls?)</li>
</ul>
</li>
<li>
<p><strong>Fallback rate</strong>: Should be &lt;10%. If higher, investigate:</p>
<ul>
<li>Weaviate performance (slow queries, resource exhaustion)</li>
<li>Timeout settings (too strict?)</li>
</ul>
</li>
<li>
<p><strong>Token throughput</strong>: Tracks LLM utilization. If low despite high traffic, you might need more GPU instances.</p>
</li>
<li>
<p><strong>Cost per request</strong>: Average should be ~$0.001-$0.005 depending on token generation. Spikes indicate cache misses or long responses.</p>
</li>
</ol>
<h2 id="load-testing-and-performance-tuning">Load Testing and Performance Tuning</h2>
<p>Load testing validates autoscaling and identifies bottlenecks. The repo includes a k6 script:</p>
<pre><code class="language-javascript">// scripts/loadtest-k6.js

import http from &apos;k6/http&apos;;
import { check, sleep } from &apos;k6&apos;;

export let options = {
  stages: [
    { duration: &apos;1m&apos;, target: 10 },   // Ramp to 10 RPS
    { duration: &apos;3m&apos;, target: 30 },   // Sustain 30 RPS
    { duration: &apos;1m&apos;, target: 50 },   // Spike to 50 RPS
    { duration: &apos;2m&apos;, target: 10 },   // Cool down to 10 RPS
    { duration: &apos;1m&apos;, target: 0 },    // Drain
  ],
};

export default function () {
  const queries = [
    &apos;How does autoscaling work?&apos;,
    &apos;Explain the caching mechanism&apos;,
    &apos;What is the fallback strategy?&apos;,
    &apos;How do I deploy to Kubernetes?&apos;,
    &apos;What observability tools are used?&apos;,
  ];

  const prompt = queries[Math.floor(Math.random() * queries.length)];
  const payload = JSON.stringify({ prompt, topK: 5 });

  const res = http.post(&apos;http://localhost:8080/v1/ask&apos;, payload, {
    headers: { &apos;Content-Type&apos;: &apos;application/json&apos; },
  });

  check(res, {
    &apos;status is 200&apos;: (r) =&gt; r.status === 200,
    &apos;response time &lt; 3s&apos;: (r) =&gt; r.timings.duration &lt; 3000,
    &apos;has answer&apos;: (r) =&gt; JSON.parse(r.body).answer.length &gt; 0,
  });

  sleep(1);
}
</code></pre>
<p><strong>Run the test</strong>:</p>
<pre><code class="language-bash">make port-forward  # In one terminal
k6 run scripts/loadtest-k6.js  # In another terminal
</code></pre>
<p><strong>Watch in Grafana</strong>:</p>
<ul>
<li>Retriever replicas scale from 2&#x2192;10&#x2192;20 as RPS increases</li>
<li>Cache hit ratio stabilizes around 45% after warm-up</li>
<li>p95 latency stays &lt;2s despite traffic spike</li>
<li>Fallback rate increases slightly during peak (10-15%)</li>
</ul>
<p><strong>Tuning recommendations</strong>:</p>
<ol>
<li><strong>If p95 latency &gt; 2s</strong>: Increase retriever <code>maxReplicas</code> or lower HPA CPU threshold to 60%</li>
<li><strong>If fallback rate &gt; 15%</strong>: Increase Weaviate timeout from 250ms to 500ms</li>
<li><strong>If cache hit ratio &lt; 30%</strong>: Lower similarity threshold from 0.90 to 0.85 (validate answer quality)</li>
<li><strong>If cost per request &gt; $0.01</strong>: Reduce <code>max_tokens</code> in LLM config or improve prompt efficiency</li>
</ol>
<h2 id="cost-optimization-strategies">Cost Optimization Strategies</h2>
<p>Running LLMs in production is expensive. Here&apos;s how to minimize cost without sacrificing quality:</p>
<h3 id="1-semantic-caching">1. Semantic Caching</h3>
<p>Cache hit ratio of 45% means you&apos;re avoiding 45% of LLM calls. At ~$0.002 per request, that&apos;s ~$0.001 saved per cached request. For 100k requests/day:</p>
<ul>
<li>Without cache: $200/day</li>
<li>With 45% cache hit: $110/day</li>
<li><strong>Savings: $90/day = $32,850/year</strong></li>
</ul>
<h3 id="2-kserve-scale-to-zero">2. KServe Scale-to-Zero</h3>
<p>If your traffic has idle periods (e.g., nights, weekends), scale-to-zero saves GPU costs:</p>
<ul>
<li>L4 GPU: $0.60/hour = $14.40/day</li>
<li>If idle 50% of the time: <strong>$7.20/day savings = $2,628/year</strong></li>
</ul>
<h3 id="3-spotpreemptible-instances">3. Spot/Preemptible Instances</h3>
<p>GKE supports spot instances (70% discount on compute):</p>
<ul>
<li>Standard <code>g2-standard-4</code>: $0.35/hour</li>
<li>Spot <code>g2-standard-4</code>: $0.10/hour</li>
<li><strong>Savings: $0.25/hour = $6/day = $2,190/year</strong></li>
</ul>
<p>Caveat: Spot instances can be preempted. Use them for stateless services (retriever, orchestrator) but not for stateful stores (Redis, databases).</p>
<h3 id="4-right-sizing-resources">4. Right-Sizing Resources</h3>
<p>Monitor actual resource usage in Grafana:</p>
<ul>
<li>If retriever CPU consistently &lt;50%, reduce <code>requests.cpu</code> from <code>200m</code> to <code>100m</code></li>
<li>If orchestrator memory consistently &lt;200MB, reduce <code>requests.memory</code> from <code>256Mi</code> to <code>128Mi</code></li>
</ul>
<p>Over-provisioning wastes money. Under-provisioning causes OOM kills. Find the sweet spot.</p>
<h2 id="why-kubernetes-can-be-an-option-for-running-llm-apps">Why Kubernetes can be an option for running LLM apps</h2>
<p>Let&apos;s revisit the original question: <strong>why Kubernetes for LLM workloads?</strong></p>
<p>Because Kubernetes provides:</p>
<ol>
<li><strong>GPU orchestration</strong>: Dynamic allocation, multi-tenancy, isolation</li>
<li><strong>Heterogeneous scaling</strong>: Independent scaling per component (HPA, KEDA, KServe)</li>
<li><strong>Service mesh</strong>: Discovery, load balancing, retries, circuit breaking</li>
<li><strong>Deployment primitives</strong>: Canary, blue-green, rolling updates, rollbacks</li>
<li><strong>Observability integration</strong>: Prometheus, Tempo, Grafana, OpenTelemetry</li>
<li><strong>Cost optimization</strong>: Spot instances, autoscaling, scale-to-zero</li>
<li><strong>Portability</strong>: Run in GCP (GKE), AWS (EKS), Azure (AKS), on-prem</li>
</ol>
<p>LLM applications are distributed systems. They have complex dependencies, heterogeneous resource requirements, and demanding operational SLOs. Kubernetes is purpose-built for this.</p>
<p>Yes, there&apos;s a learning curve. Yes, it&apos;s more complex. But the operational benefits&#x2014;reliability, scalability, observability, cost efficiency&#x2014;are undeniable.</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>We&apos;ve built a production-grade RAG system from the ground up:</p>
<p><strong>Part 1</strong>: We discussed the core challenges (latency, reliability, cost, quality, observability)</p>
<p><strong>Part 2</strong>: We designed a resilient architecture with semantic caching, hybrid retrieval, graceful degradation, and comprehensive instrumentation</p>
<p><strong>Part 3</strong>: We outlined the K8S deployment with intelligent autoscaling (HPA, KEDA, KServe), full observability (Prometheus, Tempo, Grafana), and cost optimization</p>
<p>The patterns here, service isolation, reactive programming, timeouts, fallbacks, tracing, metrics, autoscaling, are how you build systems that serve millions of users.</p>
<p>The code is open source: <a href="https://github.com/aboullaite/rag-java-k8s?ref=aboullaite.me">github.com/aboullaite/rag-java-k8s</a></p>
<p>Thanks for following along this series! If you have questions or want to discuss RAG architectures, Kubernetes patterns, or LLM infrastructure, find me on <a href="https://twitter.com/laytoun?ref=aboullaite.me">Twitter</a> or <a href="https://linkedin.com/in/aboullaite?ref=aboullaite.me">LinkedIn</a>.</p>
<p>Now go build something great.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Building Production-Grade RAG Systems: Architecture Deep Dive]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In the <a href="https://aboullaite.me/production-rag-java-k8s-part1/">first part</a>, we explored the production challenges of RAG systems: latency, reliability, cost, quality, and observability. Now let&apos;s get our hands dirty with the actual architecture and implementation.</p>
<p>The codebase uses Java 25, Spring Boot 3.5.7, reactive programming with WebFlux, and follows production patterns</p>]]></description><link>https://aboullaite.me/production-rag-java-k8s-part2/</link><guid isPermaLink="false">69143a65cda49600011ec317</guid><category><![CDATA[Java]]></category><category><![CDATA[kubernetes]]></category><category><![CDATA[artificial intelligence]]></category><category><![CDATA[LLM]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sun, 16 Nov 2025 13:32:49 GMT</pubDate><media:content url="https://aboullaite.me/content/images/2025/11/Gemini_Generated_Image_ldtvwvldtvwvldtv.jpeg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://aboullaite.me/content/images/2025/11/Gemini_Generated_Image_ldtvwvldtvwvldtv.jpeg" alt="Building Production-Grade RAG Systems: Architecture Deep Dive"><p>In the <a href="https://aboullaite.me/production-rag-java-k8s-part1/">first part</a>, we explored the production challenges of RAG systems: latency, reliability, cost, quality, and observability. Now let&apos;s get our hands dirty with the actual architecture and implementation.</p>
<p>The codebase uses Java 25, Spring Boot 3.5.7, reactive programming with WebFlux, and follows production patterns you&apos;d see in enterprise systems. Every design decision has a reason, and I&apos;ll explain the tradeoffs as we go.</p>
<h2 id="service-boundaries-why-separation-matters">Service Boundaries: Why Separation Matters</h2>
<p>The system is split into three main modules:</p>
<pre><code>common/                # Shared DTOs and tracing helpers
retriever/             # Reactive Weaviate/OpenSearch retriever service
orchestrator/          # Orchestration, caching, LLM routing, SSE
</code></pre>
<p>This separation isn&apos;t arbitrary. Each service has distinct scaling characteristics and failure modes:</p>
<ul>
<li><strong>Retriever</strong> is CPU and I/O intensive, scales horizontally, and needs aggressive timeouts</li>
<li><strong>Orchestrator</strong> manages state (semantic cache), handles user connections (SSE), and coordinates the pipeline</li>
<li><strong>Common</strong> provides shared contracts (DTOs) and telemetry utilities that both services use</li>
</ul>
<p>By isolating these responsibilities, we can scale the retriever independently during traffic spikes while keeping the orchestrator stable. If the retriever service crashes, the orchestrator can still serve cached responses.</p>
<h2 id="the-retriever-service-hybrid-search-with-fallbacks">The Retriever Service: Hybrid Search with Fallbacks</h2>
<p>Let&apos;s start with document retrieval. The retriever service exposes a single endpoint:</p>
<pre><code class="language-java">POST /v1/retrieve
</code></pre>
<p><strong>Request</strong>:</p>
<pre><code class="language-json">{
  &quot;text&quot;: &quot;How does autoscaling work?&quot;,
  &quot;filters&quot;: {&quot;section&quot;: &quot;infrastructure&quot;},
  &quot;topK&quot;: 5
}
</code></pre>
<p><strong>Response</strong>:</p>
<pre><code class="language-json">[
  {
    &quot;id&quot;: &quot;doc-03-autoscaling&quot;,
    &quot;chunk&quot;: &quot;Autoscaling combines HPA and KEDA...&quot;,
    &quot;score&quot;: 0.87,
    &quot;metadata&quot;: {
      &quot;source&quot;: &quot;doc-03-autoscaling.md&quot;,
      &quot;section&quot;: &quot;infrastructure&quot;
    }
  }
]
</code></pre>
<p>The implementation is deceptively simple but primarily designed for resilience:</p>
<pre><code class="language-java">// retriever/src/main/java/me/aboullaite/rag/retriever/service/RetrieverService.java

@Service
public class RetrieverService {

    private final WeaviateGateway weaviateGateway;
    private final OpenSearchGateway openSearchGateway;
    private final RetrieverProperties properties;
    private final Timer retrievalLatency;
    private final Counter fallbackCounter;
    private final Tracer tracer;

    public Mono&lt;List&lt;RetrievedDoc&gt;&gt; retrieve(Query query) {
        int topK = query.topK() &gt; 0 ? query.topK() : properties.getTopKDefault();
        return Mono.defer(() -&gt; executeRetrieval(query, topK));
    }

    private Mono&lt;List&lt;RetrievedDoc&gt;&gt; executeRetrieval(Query query, int topK) {
        Span span = tracer.spanBuilder(&quot;rag.retrieve&quot;)
                .setAttribute(&quot;rag.request.topK&quot;, topK)
                .startSpan();
        Timer.Sample sample = Timer.start(meterRegistry);

        return weaviateGateway.search(query, topK)
                .timeout(Duration.ofMillis(properties.getTimeoutMs()))
                .onErrorResume(throwable -&gt; fallback(query, topK, span, throwable))
                .doOnNext(docs -&gt; TracingUtils.recordRetrievedDocs(span, docs))
                .doOnError(span::recordException)
                .doFinally(signalType -&gt; {
                    sample.stop(retrievalLatency);
                    span.end();
                });
    }

    private Mono&lt;List&lt;RetrievedDoc&gt;&gt; fallback(Query query, int topK, Span parentSpan, Throwable throwable) {
        boolean timeout = throwable instanceof TimeoutException;
        log.warn(&quot;Primary vector search failed (timeout={}): {}&quot;, timeout, throwable.getMessage());
        fallbackCounter.increment();
        TracingUtils.recordFallback(parentSpan, timeout ? &quot;weaviate-timeout&quot; : throwable.getClass().getSimpleName());

        if (!openSearchGateway.isEnabled()) {
            return Mono.just(List.of());
        }

        return openSearchGateway.search(query, topK)
                .doOnNext(docs -&gt; TracingUtils.recordRetrievedDocs(parentSpan, docs));
    }
}
</code></pre>
<h3 id="key-design-decisions">Key Design Decisions</h3>
<p><strong>1. Reactive Streams with Project Reactor</strong></p>
<p>Notice the return type: <code>Mono&lt;List&lt;RetrievedDoc&gt;&gt;</code>. This is Project Reactor&apos;s reactive type for 0-1 values. By using reactive programming:</p>
<ul>
<li>We avoid blocking threads during I/O</li>
<li>Timeouts are first-class citizens (<code>.timeout(Duration.ofMillis(250))</code>)</li>
<li>Error handling composes naturally (<code>.onErrorResume()</code>)</li>
<li>Observability hooks integrate seamlessly (<code>.doOnNext()</code>, <code>.doFinally()</code>)</li>
</ul>
<p>Spring Boot&apos;s WebFlux framework handles request threads efficiently, allowing the retriever to handle hundreds of concurrent requests without thread pool exhaustion.</p>
<p><strong>2. Aggressive Timeouts</strong></p>
<p>The default timeout is <strong>250ms</strong>. That&apos;s intentionally tight. Why?</p>
<ul>
<li>Users expect sub-second responses</li>
<li>Vector databases can have occasional slow queries (large result sets, index rebuilds, etc.)</li>
<li>We&apos;d rather fallback to lexical search than make users wait. That is of cource debatable and depends on what wre we optimazing for!</li>
</ul>
<p>In load testing, this timeout triggers fallback ~5-15% of the time under heavy load, which is acceptable given the graceful degradation.</p>
<p><strong>3. Observability at Every Step</strong></p>
<p>Every retrieval is instrumented:</p>
<ul>
<li><strong>OpenTelemetry Span</strong>: captures timing, document count, and fallback reasons</li>
<li><strong>Prometheus Timer</strong>: records latency histogram for p95/p99 analysis</li>
<li><strong>Prometheus Counter</strong>: tracks fallback frequency</li>
</ul>
<p>When debugging production issues, we can simply filter Tempo traces by <code>rag.fallback.reason=weaviate-timeout</code> to see exactly which requests degraded.</p>
<p><strong>4. Lexical Fallback via OpenSearch</strong></p>
<p>Weaviate is great for semantic search, but sometimes you need exact term matching. OpenSearch provides BM25 ranking, which excels at:</p>
<ul>
<li>Acronyms (e.g., &quot;HPA&quot;, &quot;KEDA&quot;, &quot;SSE&quot;)</li>
<li>Version numbers (e.g., &quot;Java 25&quot;, &quot;Spring Boot 3.5.7&quot;)</li>
<li>Exact phrases (e.g., &quot;Server-Sent Events&quot;)</li>
</ul>
<p>The fallback is a deliberate hybrid retrieval strategy. Some RAG systems use re-rankers to combine vector and lexical signals; here, we use fallback for simplicity while maintaining quality.</p>
<h2 id="the-orchestrator-service-coordination-and-caching">The Orchestrator Service: Coordination and Caching</h2>
<p>The orchestrator is the brain of the system. It coordinates caching, retrieval, prompt assembly, generation, and streaming. Let&apos;s walk through the request flow.</p>
<h3 id="request-flow-diagram">Request Flow Diagram</h3>
<p><img src="https://aboullaite.me/content/images/2025/11/Personal-2025-11-12-075246.png" alt="Building Production-Grade RAG Systems: Architecture Deep Dive" loading="lazy"></p>
<h3 id="semantic-cache-implementation">Semantic Cache Implementation</h3>
<p>The semantic cache is the secret weapon for both latency and cost optimization. Here&apos;s the code:</p>
<pre><code class="language-java">// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/cache/SemanticCacheService.java

@Service
public class SemanticCacheService {

    private static final String CACHE_INDEX = &quot;rag:cache:index&quot;;
    private static final String CACHE_KEY_PREFIX = &quot;rag:cache:&quot;;
    private static final double SIMILARITY_THRESHOLD = 0.90;
    private static final Duration CACHE_TTL = Duration.ofMinutes(10);

    private final RedisTemplate&lt;String, String&gt; redisTemplate;

    public Mono&lt;CacheHit&gt; lookup(String normalizedQuery, double[] embedding) {
        return Mono.fromCallable(() -&gt; {
            Set&lt;String&gt; keys = redisTemplate.opsForSet().members(CACHE_INDEX);
            if (keys == null || keys.isEmpty()) {
                return null;
            }

            double maxSimilarity = 0.0;
            CacheEntry bestMatch = null;

            for (String key : keys) {
                String json = redisTemplate.opsForValue().get(key);
                if (json == null) continue;

                CacheEntry entry = deserialize(json);
                double similarity = SimilarityUtils.cosineSimilarity(embedding, entry.embedding());

                if (similarity &gt; maxSimilarity &amp;&amp; similarity &gt;= SIMILARITY_THRESHOLD) {
                    maxSimilarity = similarity;
                    bestMatch = entry;
                }
            }

            return bestMatch != null ? new CacheHit(bestMatch, maxSimilarity) : null;
        }).subscribeOn(Schedulers.boundedElastic());
    }

    public Mono&lt;Void&gt; put(String normalizedQuery, double[] embedding,
                          GenerationResponse response, List&lt;RetrievedDoc&gt; docs) {
        return Mono.fromRunnable(() -&gt; {
            String key = CACHE_KEY_PREFIX + UUID.randomUUID();
            CacheEntry entry = new CacheEntry(
                normalizedQuery,
                embedding,
                response.answer(),
                response.citations(),
                docs.stream().map(RetrievedDoc::id).toList(),
                System.currentTimeMillis()
            );

            String json = serialize(entry);
            redisTemplate.opsForValue().set(key, json, CACHE_TTL);
            redisTemplate.opsForSet().add(CACHE_INDEX, key);
        }).subscribeOn(Schedulers.boundedElastic()).then();
    }
}
</code></pre>
<h3 id="why-cosine-similarity-threshold-090">Why Cosine Similarity Threshold 0.90?</h3>
<p>This threshold balances precision and recall:</p>
<ul>
<li><strong>Too low (e.g., 0.70)</strong>: You&apos;d match dissimilar queries, returning wrong cached answers</li>
<li><strong>Too high (e.g., 0.98)</strong>: You&apos;d miss legitimate matches, reducing cache hit rate</li>
</ul>
<p>At 0.90, queries like:</p>
<ul>
<li>&quot;How does autoscaling work?&quot;</li>
<li>&quot;Explain the autoscaling mechanism&quot;</li>
<li>&quot;What is the autoscaling strategy?&quot;</li>
</ul>
<p>...all match and reuse the same cached answer. But unrelated queries like &quot;How do I ingest data?&quot; won&apos;t match.</p>
<p>In load testing with realistic query distributions, 0.90 yields ~45% cache hit rate, cutting LLM costs nearly in half.</p>
<p>This is based on my tests and use cases, don&apos;t just rely on these numbers, please perform your own testing. Your numbers can be different.</p>
<h3 id="deterministic-embeddings">Deterministic Embeddings</h3>
<p>For this demo, I&apos;m using deterministic 8-dimensional embeddings generated via SHA-256 hashing:</p>
<pre><code class="language-java">// common/src/main/java/me/aboullaite/rag/common/embedding/DeterministicEmbedding.java

public class DeterministicEmbedding {
    public static double[] embed(String text) {
        byte[] hash = MessageDigest.getInstance(&quot;SHA-256&quot;).digest(text.getBytes());
        double[] embedding = new double[8];
        for (int i = 0; i &lt; 8; i++) {
            embedding[i] = (hash[i] &amp; 0xFF) / 255.0;
        }
        return normalize(embedding);
    }
}
</code></pre>
<p><strong>Why deterministic embeddings?</strong></p>
<ul>
<li>No external embedding service dependency for the demo</li>
<li>Reproducible cache behavior in tests</li>
<li>Instant embedding computation (no API latency)</li>
</ul>
<p>In production, we&apos;d need to use proper sentence embeddings (e.g., <code>all-MiniLM-L6-v2</code> via Hugging Face or OpenAI <code>text-embedding-3-small</code>). The cache logic remains identical, we just have to swap the embedding function.</p>
<h2 id="prompt-assembly-and-citation-tracking">Prompt Assembly and Citation Tracking</h2>
<p>Once documents are retrieved, we need to construct a prompt that:</p>
<ol>
<li>Provides clear instructions to the LLM</li>
<li>Injects retrieved context</li>
<li>Enforces citation requirements</li>
<li>Handles edge cases (no documents, partial results, etc.)</li>
</ol>
<pre><code class="language-java">// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/prompt/PromptAssembler.java

@Component
public class PromptAssembler {

    public PromptBundle assemble(String userPrompt, List&lt;RetrievedDoc&gt; docs) {
        if (docs.isEmpty()) {
            return new PromptBundle(
                noContextPrompt(userPrompt),
                List.of(),
                List.of()
            );
        }

        StringBuilder prompt = new StringBuilder();
        prompt.append(&quot;You are a helpful assistant. Answer the question based ONLY on the provided documents.\n\n&quot;);
        prompt.append(&quot;Documents:\n&quot;);

        List&lt;String&gt; citations = new ArrayList&lt;&gt;();
        List&lt;CitationInfo&gt; citationDetails = new ArrayList&lt;&gt;();

        for (int i = 0; i &lt; docs.size(); i++) {
            RetrievedDoc doc = docs.get(i);
            String citationId = doc.id();
            citations.add(citationId);
            citationDetails.add(new CitationInfo(
                citationId,
                doc.metadata().source(),
                doc.metadata().section()
            ));

            prompt.append(String.format(&quot;[%s] %s\n\n&quot;, citationId, doc.chunk()));
        }

        prompt.append(&quot;Question: &quot;).append(userPrompt).append(&quot;\n\n&quot;);
        prompt.append(&quot;Instructions:\n&quot;);
        prompt.append(&quot;- Answer ONLY using information from the provided documents\n&quot;);
        prompt.append(&quot;- Cite sources using [doc-id] notation\n&quot;);
        prompt.append(&quot;- If the documents don&apos;t contain enough information, say &apos;I don&apos;t know&apos;\n&quot;);
        prompt.append(&quot;- Be concise and accurate\n\n&quot;);
        prompt.append(&quot;Answer:&quot;);

        return new PromptBundle(prompt.toString(), citations, citationDetails);
    }

    private String noContextPrompt(String userPrompt) {
        return &quot;You are a helpful assistant. The user asked: &quot; + userPrompt +
               &quot;\n\nNo relevant documents were found. Please respond with: I don&apos;t know.&quot;;
    }
}
</code></pre>
<h3 id="citation-enforcement">Citation Enforcement</h3>
<p>Notice the explicit instructions:</p>
<ul>
<li>&quot;Answer ONLY using information from the provided documents&quot;</li>
<li>&quot;Cite sources using [doc-id] notation&quot;</li>
<li>&quot;If the documents don&apos;t contain enough information, say &apos;I don&apos;t know&apos;&quot;</li>
</ul>
<p>LLMs are surprisingly good at following these instructions when they&apos;re clear and emphatic. In testing with Gemma-2-2B, citation compliance is &gt;85% for well-formed prompts.</p>
<p>The <code>PromptBundle</code> record encapsulates:</p>
<pre><code class="language-java">public record PromptBundle(
    String prompt,              // Full prompt sent to LLM
    List&lt;String&gt; citations,     // [doc-03-autoscaling, doc-09-infrastructure]
    List&lt;CitationInfo&gt; citationDetails  // Full metadata for UI rendering
) {}
</code></pre>
<p>This separation allows the orchestrator to:</p>
<ul>
<li>Send a clean prompt to the LLM</li>
<li>Return structured citations to the client</li>
<li>Track which documents contributed to each answer (for cache invalidation, analytics, etc.)</li>
</ul>
<h2 id="llm-integration-kserve-vllm">LLM Integration: KServe + vLLM</h2>
<p>The LLM layer uses <strong>KServe</strong> (Kubernetes serving framework) with <strong>vLLM</strong> runtime to host <strong>Gemma-2-2B-it</strong> (instruction-tuned).</p>
<h3 id="why-kserve">Why KServe?</h3>
<p>KServe provides:</p>
<ul>
<li><strong>Autoscaling</strong>: Scale-to-zero when idle, scale-up on demand</li>
<li><strong>GPU management</strong>: Automatic GPU resource allocation</li>
<li><strong>Inference optimization</strong>: vLLM uses PagedAttention for efficient memory usage</li>
<li><strong>Standardized API</strong>: OpenAI-compatible <code>/v1/chat/completions</code> endpoint</li>
</ul>
<p>The InferenceService definition:</p>
<pre><code class="language-yaml"># deploy/kserve-vllm.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: rag-llm
  namespace: rag
spec:
  predictor:
    minReplicas: 0        # Scale to zero when idle
    maxReplicas: 1
    scaleTarget: 1
    scaleMetric: concurrency
    model:
      runtime: vllm-runtime
      modelFormat:
        name: huggingface
      args:
        - --model
        - google/gemma-2-2b-it
        - --dtype
        - auto
        - --max-model-len
        - &quot;4096&quot;
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: &quot;3&quot;
          memory: 12Gi
        requests:
          cpu: &quot;2&quot;
          memory: 8Gi
</code></pre>
<h3 id="vllm-runtime-configuration">vLLM Runtime Configuration</h3>
<p>vLLM is a high-performance inference engine optimized for LLMs. Key features:</p>
<ul>
<li><strong>PagedAttention</strong>: Reduces memory fragmentation, increases throughput</li>
<li><strong>Continuous batching</strong>: Processes multiple requests efficiently</li>
<li><strong>Quantization support</strong>: <code>--dtype auto</code> enables FP16/BF16 for faster inference</li>
</ul>
<p>With Gemma-2-2B on an L4 GPU (24GB), vLLM achieves:</p>
<ul>
<li><strong>Time-to-first-token (TTFT)</strong>: ~100-300ms</li>
<li><strong>Throughput</strong>: ~50-80 tokens/sec</li>
<li><strong>Concurrent requests</strong>: 4-8 (depending on sequence length)</li>
</ul>
<h3 id="llm-client-implementation">LLM Client Implementation</h3>
<p>The orchestrator calls KServe via a reactive client:</p>
<pre><code class="language-java">// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/client/LlmClient.java

@Component
public class LlmClient {

    private final WebClient webClient;
    private final OrchestratorProperties properties;

    public Mono&lt;LlmResponse&gt; generate(String prompt) {
        Map&lt;String, Object&gt; request = Map.of(
            &quot;model&quot;, properties.getModelName(),
            &quot;messages&quot;, List.of(
                Map.of(&quot;role&quot;, &quot;user&quot;, &quot;content&quot;, prompt)
            ),
            &quot;max_tokens&quot;, properties.getMaxTokens(),
            &quot;temperature&quot;, properties.getTemperature()
        );

        long startNano = System.nanoTime();
        AtomicLong ttftNano = new AtomicLong(0);

        return webClient.post()
                .uri(&quot;/v1/chat/completions&quot;)
                .bodyValue(request)
                .retrieve()
                .bodyToMono(Map.class)
                .map(response -&gt; {
                    if (ttftNano.get() == 0) {
                        ttftNano.set(System.nanoTime() - startNano);
                    }

                    String content = extractContent(response);
                    int tokens = estimateTokens(content);
                    long ttftMillis = ttftNano.get() / 1_000_000;

                    return new LlmResponse(content, tokens, ttftMillis);
                })
                .timeout(Duration.ofSeconds(properties.getGenerationTimeoutSeconds()));
    }
}
</code></pre>
<p><strong>Time-to-First-Token (TTFT)</strong> is a critical metric for user experience. Measuring it accurately requires:</p>
<ol>
<li>Start timer when request begins</li>
<li>Capture timestamp on first response byte</li>
<li>Calculate delta in milliseconds</li>
</ol>
<p>This metric appears in OpenTelemetry spans as <code>rag.ttft_ms</code>, allowing us to track degradation trends in Grafana.</p>
<h2 id="streaming-responses-with-server-sent-events">Streaming Responses with Server-Sent Events</h2>
<p>One of the best UX improvements in modern LLM applications is streaming. Instead of waiting 3+ seconds for the complete answer, users see tokens as they&apos;re generated.</p>
<h3 id="sse-endpoint">SSE Endpoint</h3>
<pre><code class="language-java">// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/web/AskController.java

@GetMapping(value = &quot;/v1/ask/stream&quot;, produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux&lt;ServerSentEvent&lt;String&gt;&gt; askStream(
        @RequestParam String prompt,
        @RequestParam(required = false) Map&lt;String, String&gt; filters,
        @RequestParam(required = false, defaultValue = &quot;5&quot;) Integer topK) {

    return askService.askStreaming(prompt, filters, topK)
            .map(chunk -&gt; {
                if (chunk.isComplete()) {
                    return ServerSentEvent.&lt;String&gt;builder()
                            .event(&quot;complete&quot;)
                            .data(toJson(chunk))
                            .build();
                } else {
                    return ServerSentEvent.&lt;String&gt;builder()
                            .event(&quot;token&quot;)
                            .data(chunk.token())
                            .build();
                }
            });
}
</code></pre>
<p><strong>Event Types</strong>:</p>
<ul>
<li><code>token</code>: Individual generated tokens (streamed progressively)</li>
<li><code>complete</code>: Final event containing citations and metadata</li>
</ul>
<p><strong>Client-side consumption</strong> (JavaScript):</p>
<pre><code class="language-javascript">const eventSource = new EventSource(&apos;/v1/ask/stream?prompt=How+does+caching+work&apos;);

eventSource.addEventListener(&apos;token&apos;, (event) =&gt; {
    document.getElementById(&apos;answer&apos;).textContent += event.data;
});

eventSource.addEventListener(&apos;complete&apos;, (event) =&gt; {
    const result = JSON.parse(event.data);
    displayCitations(result.citations);
    eventSource.close();
});
</code></pre>
<p>Progressive rendering dramatically improves perceived performance. Users engage with partial responses while generation continues, reducing perceived wait time by 50-70%.</p>
<h2 id="request-orchestration-putting-it-all-together">Request Orchestration: Putting It All Together</h2>
<p>Here&apos;s the core orchestration logic that ties everything together:</p>
<pre><code class="language-java">// orchestrator/src/main/java/me/aboullaite/rag/orchestrator/service/AskService.java

public Mono&lt;GenerationResponse&gt; ask(String prompt, Map&lt;String, String&gt; filters, Integer topK) {
    String sanitizedPrompt = redact(prompt);  // PII redaction
    double[] embedding = embeddingService.embed(sanitizedPrompt);
    Span span = tracer.spanBuilder(&quot;rag.ask&quot;).startSpan();
    Timer.Sample sample = Timer.start(meterRegistry);

    return cacheService.lookup(sanitizedPrompt, embedding)
            .flatMap(hit -&gt; onCacheHit(hit, span))
            .switchIfEmpty(Mono.defer(() -&gt; {
                cacheMissCounter.increment();
                TracingUtils.recordCacheHit(span, false);
                return generateWithRetrieval(sanitizedPrompt, filters, topK, embedding, span);
            }))
            .doOnError(span::recordException)
            .doFinally(signalType -&gt; {
                sample.stop(askLatency);
                span.end();
            });
}

private Mono&lt;GenerationResponse&gt; generateWithRetrieval(
        String sanitizedPrompt,
        Map&lt;String, String&gt; filters,
        Integer topK,
        double[] embedding,
        Span parentSpan) {
    Query query = new Query(sanitizedPrompt, filters, topK);

    return retrieverClient.retrieve(query)
            .flatMap(docs -&gt; produceAnswer(sanitizedPrompt, docs, embedding, parentSpan))
            .switchIfEmpty(Mono.defer(() -&gt; produceAnswer(sanitizedPrompt, List.of(), embedding, parentSpan)));
}

private Mono&lt;GenerationResponse&gt; produceAnswer(
        String sanitizedPrompt,
        List&lt;RetrievedDoc&gt; docs,
        double[] embedding,
        Span parentSpan) {
    PromptBundle promptBundle = promptAssembler.assemble(sanitizedPrompt, docs);

    return llmClient.generate(promptBundle.prompt())
            .map(response -&gt; toGenerationResponse(response, promptBundle, false, parentSpan))
            .flatMap(response -&gt; cacheService.put(sanitizedPrompt, embedding, response, docs)
                    .thenReturn(response))
            .onErrorResume(ex -&gt; {
                log.warn(&quot;LLM call failed, using fallback: {}&quot;, ex.getMessage());
                fallbackCounter.increment();
                TracingUtils.recordFallback(parentSpan, ex.getClass().getSimpleName());
                GenerationResponse fallback = fallbackResponse(docs, promptBundle.citationDetails());
                return cacheService.put(sanitizedPrompt, embedding, fallback, docs)
                        .onErrorResume(e -&gt; Mono.empty())
                        .thenReturn(fallback);
            });
}
</code></pre>
<h3 id="reactive-composition-explained">Reactive Composition Explained</h3>
<p>The flow uses reactive operators to compose asynchronous operations:</p>
<ol>
<li><strong><code>cacheService.lookup()</code></strong>: Check cache (non-blocking I/O to Redis)</li>
<li><strong><code>.flatMap(hit -&gt; onCacheHit())</code></strong>: If cache hit, return immediately</li>
<li><strong><code>.switchIfEmpty(Mono.defer(() -&gt; ...))</code></strong>: If cache miss, proceed to retrieval</li>
<li><strong><code>retrieverClient.retrieve()</code></strong>: Call retriever service (HTTP call)</li>
<li><strong><code>.flatMap(docs -&gt; produceAnswer())</code></strong>: Generate answer with retrieved docs</li>
<li><strong><code>llmClient.generate()</code></strong>: Call LLM (HTTP streaming)</li>
<li><strong><code>.flatMap(response -&gt; cacheService.put())</code></strong>: Cache the result</li>
<li><strong><code>.onErrorResume(ex -&gt; fallbackResponse())</code></strong>: Graceful degradation on error</li>
<li><strong><code>.doFinally()</code></strong>: Stop timer and close span (always executes)</li>
</ol>
<p>This composition is <strong>non-blocking</strong>. No threads wait on I/O. Spring WebFlux dispatches work efficiently across an event loop, enabling high concurrency with minimal thread overhead.</p>
<h2 id="component-summary">Component Summary</h2>
<p>Let&apos;s recap the key components and their roles:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Responsibility</th>
<th>Technology</th>
<th>Scaling Strategy</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Orchestrator</strong></td>
<td>Request coordination, caching, streaming</td>
<td>Spring WebFlux, Redis</td>
<td>Horizontal (stateless except cache)</td>
</tr>
<tr>
<td><strong>Retriever</strong></td>
<td>Hybrid search (vector + lexical)</td>
<td>Spring WebFlux, Weaviate, OpenSearch</td>
<td>HPA (CPU) + KEDA (RPS)</td>
</tr>
<tr>
<td><strong>Semantic Cache</strong></td>
<td>Similarity-based response caching</td>
<td>Redis, cosine similarity</td>
<td>Vertical (single instance for consistency)</td>
</tr>
<tr>
<td><strong>Vector Store</strong></td>
<td>Semantic document search</td>
<td>Weaviate</td>
<td>Managed/external service</td>
</tr>
<tr>
<td><strong>Lexical Store</strong></td>
<td>Fallback keyword search</td>
<td>OpenSearch</td>
<td>Managed/external service</td>
</tr>
<tr>
<td><strong>LLM Serving</strong></td>
<td>Model inference with GPU</td>
<td>KServe, vLLM, Gemma-2-2B</td>
<td>KServe autoscaling (scale-to-zero)</td>
</tr>
<tr>
<td><strong>Observability</strong></td>
<td>Metrics, traces, dashboards</td>
<td>Prometheus, Tempo, Grafana, OTEL</td>
<td>N/A (infrastructure)</td>
</tr>
</tbody>
</table>
<h2 id="why-this-architecture-scales">Why This Architecture Scales</h2>
<p>The components here aren&apos;t just for demos. they&apos;re a good starting point to build production systems:</p>
<ol>
<li><strong>Service Isolation</strong>: Retriever and orchestrator scale independently</li>
<li><strong>Reactive Programming</strong>: Non-blocking I/O maximizes throughput</li>
<li><strong>Timeouts Everywhere</strong>: Aggressive timeouts prevent cascading failures</li>
<li><strong>Graceful Degradation</strong>: Fallbacks at every layer (cache &#x2192; retrieval &#x2192; generation)</li>
<li><strong>Observability-First</strong>: Traces and metrics built into every code path</li>
<li><strong>Cost Awareness</strong>: Semantic caching reduces LLM spend by ~40-60%</li>
</ol>
<p>When traffic spikes, the retriever scales out (2&#x2192;30 replicas). When traffic drops, KServe scales the LLM to zero. When Weaviate slows down, OpenSearch takes over. When the LLM fails, deterministic fallbacks keep users informed.</p>
<p>Every point of failure has a fallback, to provide a resilient and good experience to users.</p>
<p>The complete code is available at <a href="https://github.com/aboullaite/rag-java-k8s?ref=aboullaite.me">github.com/aboullaite/rag-java-k8s</a>.</p>
<p>Stay tuned for the last part, providing a Kubernetes deep dive.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Building Production-Grade RAG Systems: Understanding the Problem Space]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>I&apos;ve been quiet on this blog for a while now. Truth is, I lost my appetite for writing these past months. Between traveling to conferences, delivering talks, and shipping some cool features at work, the keyboard just didn&apos;t feel the same. There was also this nagging</p>]]></description><link>https://aboullaite.me/production-rag-java-k8s-part1/</link><guid isPermaLink="false">6911ffc7cda49600011ec284</guid><category><![CDATA[Java]]></category><category><![CDATA[artificial intelligence]]></category><category><![CDATA[kubernetes]]></category><category><![CDATA[LLM]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Mon, 10 Nov 2025 16:00:00 GMT</pubDate><media:content url="https://aboullaite.me/content/images/2025/11/unnamed.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><img src="https://aboullaite.me/content/images/2025/11/unnamed.jpg" alt="Building Production-Grade RAG Systems: Understanding the Problem Space"><p>I&apos;ve been quiet on this blog for a while now. Truth is, I lost my appetite for writing these past months. Between traveling to conferences, delivering talks, and shipping some cool features at work, the keyboard just didn&apos;t feel the same. There was also this nagging voice in my head: AI content has taken over the world: why bother writing another blog post when an LLM-generated version will probably be better anyway?</p>
<p>But here&apos;s the thing,  as I kept hacking away on projects, I stumbled across posts that made me pause. Posts that weren&apos;t just technically correct, they had personality, insights, born from battle scars, the kind of stuff one can&apos;t prompt engineer. And I realised: maybe that&apos;s exactly what&apos;s missing. Stories from developers doing real work. So this is my modest attempt at bringing this blog back to life.</p>
<p>This post kicks off a three-part series where I dive deep into something I&apos;ve been building over the past few weeks: production-grade RAG applications. The kind that survives (hopefully) production traffic, handles failures gracefully, and doesn&apos;t bankrupt us on LLM costs. Along the way, I&apos;ll share the lessons I learned (some the hard way).</p>
<p>When we think about building a Retrieval Augmented Generation (RAG) system, the first instinct is often to grab a vector database, throw in some embeddings, connect an LLM, and call it a day. I&apos;ve been there. But production RAG systems are an entirely different   beast. The gap between a proof-of-concept and a system that can handle real user traffic, maintain acceptable latency, and provide reliable answers is wider than what I initially (naively) thought.</p>
<p>This is the first post in a three-part series where I&apos;ll walk you through building a production-inspired RAG pipeline using Java 25, Spring Boot 3.5.7 (Sprintg AI), and Kubernetes. We&apos;ll explore not just the happy path, but the real challenges: graceful degradation, semantic caching, hybrid retrieval strategies, observability, and intelligent autoscaling.</p>
<h2 id="the-rag-promise-and-reality">The RAG Promise and Reality</h2>
<p>Retrieval Augmented Generation fundamentally solves a critical problem with Large Language Models: hallucinations and knowledge staleness. Instead of relying solely on the model&apos;s knowledge, RAG systems ground responses in retrieved documents. The architecture is conceptually simple:</p>
<ol>
<li>User asks a question</li>
<li>System retrieves relevant documents from a knowledge base</li>
<li>Documents are injected into the LLM prompt as context</li>
<li>LLM generates an answer grounded in the retrieved facts</li>
<li>Response is returned with citations</li>
</ol>
<p>But here&apos;s where things get interesting. This simple flow hides a multitude of production concerns that can make or break a system.</p>
<h2 id="what-goes-wrong-in-production-rag-systems">What Goes Wrong in Production RAG Systems?</h2>
<p>There are a few things that screw your sleep, and get you awake at 3am because of an alert:</p>
<h3 id="the-latency-problem">The Latency Problem</h3>
<p>It obviously depends on your SLO, but traditional RAG systems often stack latencies in series: retrieval time + embedding time + LLM inference time + response streaming. Each step adds milliseconds (or worse, seconds) to your user&apos;s waiting time. When your vector database takes 300ms to return results, your LLM takes 2 seconds for first token, and you&apos;re processing embeddings for cache lookups, you&apos;re looking at 3+ seconds before users see anything.</p>
<p>Users might expect sub-second responses. Anything beyond 2 seconds feels broken. Again, it largely depends on your SLOs.</p>
<h3 id="the-reliability-problem">The Reliability Problem</h3>
<p>Vector databases can timeout. LLMs can be overloaded. Networks can fail. In a traditional RAG pipeline, any single component failure means the entire request fails. That&apos;s unacceptable in production.</p>
<p>What happens when Weaviate is under load and the semantic search times out? Do we return an error? Or do we have a fallback strategy that still delivers value to our users?</p>
<h3 id="the-cost-problem">The Cost Problem</h3>
<p>Every LLM call costs money. Every embedding calculation burns CPU cycles. When users ask the same question five different ways (&quot;How do I deploy to K8s?&quot;, &quot;What&apos;s the Kubernetes deployment process?&quot;, &quot;K8s deployment steps?&quot;), we&apos;re essentially paying for the same answer multiple times.</p>
<p>Even worse, we&apos;re making our users wait for responses we&apos;ve already computed.</p>
<h3 id="the-quality-problem">The Quality Problem</h3>
<p>Vector similarity alone isn&apos;t always enough. Sometimes lexical matching (good old BM25) finds documents that semantic search misses&#x2014;especially for exact terms, acronyms, or technical identifiers. Relying solely on embeddings can leave quality on the table.</p>
<h3 id="the-observability-problem">The Observability Problem</h3>
<p>When the RAG pipeline misbehaves&#x2014;returning poor answers, experiencing high latency, or burning through our LLM budget&#x2014;how do we debug it? Traditional application monitoring doesn&apos;t capture the nuances of retrieval quality, cache hit rates, or generation costs.</p>
<p>We need visibility into every stage of the pipeline, from retrieval to generation, with metrics that actually matter for RAG workloads.</p>
<h2 id="the-blueprint">The Blueprint (?)</h2>
<p>I have been searching and looking for different patterns and best practices on how to build a RAG system that addresses each of these concerns. The target architecture for me wasn&apos;t just about making things work! it&apos;s about making them work reliably, cost-effectively, and observably at scale.</p>
<p>Here&apos;s the high-level blueprint:</p>
<p>Let&apos;s break down how this architecture solves each problem:</p>
<p><img src="https://aboullaite.me/content/images/2025/11/Personal-2025-11-10-150053.png" alt="Building Production-Grade RAG Systems: Understanding the Problem Space" loading="lazy"></p>
<h3 id="solving-latency-semantic-caching">Solving Latency: Semantic Caching</h3>
<p>Before doing any expensive operations, we check Redis for semantically similar queries. The cache doesn&apos;t just match exact strings&#x2014;it computes cosine similarity between query embeddings. If a user asks &quot;How does autoscaling work?&quot; and we&apos;ve previously answered &quot;Explain the autoscaling mechanism&quot;, we detect that similarity (threshold <code>0.90</code> for example ) and return the cached response immediately.</p>
<p>This short-circuits the entire pipeline. No retrieval. No LLM call. Sub-100ms response times.</p>
<p>The cache stores:</p>
<ul>
<li>Normalized query text (with PII redaction)</li>
<li>Deterministic query embedding (8-dimensional for demo purposes)</li>
<li>Complete generated answer</li>
<li>Citation list</li>
<li>Retrieved document IDs</li>
<li>Timestamp for observability</li>
</ul>
<p>Cache entries expire after 10 minutes by default, keeping answers fresh as documentation evolves.</p>
<h3 id="solving-reliability-layered-fallbacks">Solving Reliability: Layered Fallbacks</h3>
<p>The system implements graceful degradation at every level:</p>
<p><strong>Retriever Fallback</strong>: Weaviate has a strict timeout budget (<code>250ms</code>). If it doesn&apos;t respond in time, the retriever automatically falls back to OpenSearch for lexical BM25 search. The user still gets an answer... maybe not the semantically perfect one, but a relevant one based on keyword matching.</p>
<p><strong>Generator Fallback</strong>: If the LLM endpoint times out or returns an error, the orchestrator doesn&apos;t fail. Instead, it synthesizes a deterministic answer by summarizing the top retrieved chunks, clearly marking it as partial and including citations. Users get actionable information even when the model is unavailable.</p>
<p><strong>Streaming Resilience</strong>: Server-Sent Events (SSE) provide progressive rendering. Users see tokens as they&apos;re generated, and the final event includes citations and a <code>partial</code> flag indicating any degradation.</p>
<p>Every fallback event is instrumented: emitting OpenTelemetry spans with attributes like <code>rag.fallback.reason=weaviate-timeout</code> so we can measure how often each degradation path triggers.</p>
<h3 id="solving-cost-intelligent-caching-and-deduplication">Solving Cost: Intelligent Caching and Deduplication</h3>
<p>Semantic caching isn&apos;t just about latency&#x2014;it&apos;s about cost. LLM calls are expensive. With a well-tuned cache, we can reduce redundant generation by 40-60% depending on the query distribution.</p>
<p>The cache uses deterministic embeddings for the demo (SHA-256 based hashing producing 8-dimensional vectors), but in production we&apos;d use proper sentence embeddings. The key insight is that cosine similarity &gt; <code>0.90</code> means &quot;close enough&quot; to reuse the answer.</p>
<p>Beyond caching, we instrument every request with estimated cost metrics. At $0.002 per 1K tokens (approximate Gemma-2 pricing), a Grafana dashboard visualizes cost-per-request trends, helping you optimize both caching and prompt engineering.</p>
<h3 id="solving-quality-hybrid-retrieval">Solving Quality: Hybrid Retrieval</h3>
<p>Vector search excels at semantic similarity but can miss exact matches for technical terms, version numbers, or acronyms. OpenSearch provides lexical BM25 ranking that captures these cases.</p>
<p>The retriever service prioritizes vector search (Weaviate) but automatically falls back to lexical search (OpenSearch) when vector queries timeout. This hybrid approach ensures we get the best of both worlds: semantic understanding when available, lexical precision when needed.</p>
<p>In future iterations, we could combine both signals using a re-ranking model, but for many workloads, the fallback strategy alone provides sufficient quality.</p>
<h3 id="solving-observability-opentelemetry-prometheus-grafana">Solving Observability: OpenTelemetry + Prometheus + Grafana</h3>
<p>Every request flows through instrumented code paths. The observability stack captures:</p>
<p><strong>Traces (OpenTelemetry + Tempo)</strong>: End-to-end request traces showing retrieval time, document count, cache decisions, LLM first-token latency, and total tokens generated. Custom span attributes include:</p>
<ul>
<li><code>rag.cache.hit</code>: boolean</li>
<li><code>rag.retrieval.count</code>: number of documents retrieved</li>
<li><code>rag.retrieval.source</code>: &quot;weaviate&quot; or &quot;opensearch&quot;</li>
<li><code>rag.fallback.reason</code>: why degradation occurred</li>
<li><code>rag.model.name</code>: which LLM was used</li>
<li><code>rag.tokens.total</code>: generated token count</li>
<li><code>rag.ttft_ms</code>: time to first token</li>
</ul>
<p><strong>Metrics (Prometheus + Grafana)</strong>: Counters and histograms for:</p>
<ul>
<li><code>rag_orchestrator_latency</code>: p50/p95/p99 end-to-end latency</li>
<li><code>rag_cache_hit_total</code> / <code>rag_cache_miss_total</code>: cache efficiency</li>
<li><code>rag_retrieval_fallback_total</code>: how often fallback triggered</li>
<li><code>rag_tokens_generated_total</code>: token consumption trends</li>
<li><code>rag_cost_usd_total</code>: estimated spend per request</li>
</ul>
<p>Grafana dashboards visualize these metrics alongside autoscaling replica counts, giving operators a complete view of system behavior under load.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>Whether you&apos;re building internal documentation search, customer support automation, or code assistance tools, you&apos;ll probably face the same tradeoffs around latency, reliability, cost, and quality.</p>
<p>The architecture I&apos;m suggesting here worked for me, and might be useful for someone on the internet facing the same challenges. It&apos;s built for <strong>failure</strong>.</p>
<p>The complete code is available at <a href="https://github.com/aboullaite/rag-java-k8s?ref=aboullaite.me">github.com/aboullaite/rag-java-k8s</a>. You can run the entire stack locally with <code>make dev-up</code> or deploy to GKE with <code>make gke-cluster</code>.</p>
<p>Stay tuned. The next post will get hands-on with code and architectural patterns.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[A look into Deep Java Library!]]></title><description><![CDATA[<p>When you think about building machine learning apps, Java is not the first language that comes to mind, probably not even in the top 3 or 5! But Java has proved time and again that it is capable of modernising itself, and even if it&apos;s not the first</p>]]></description><link>https://aboullaite.me/djl-ml-java/</link><guid isPermaLink="false">64862b43cda49600011ec11d</guid><category><![CDATA[Java]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[artificial intelligence]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Mon, 12 Jun 2023 18:10:42 GMT</pubDate><content:encoded><![CDATA[<p>When you think about building machine learning apps, Java is not the first language that comes to mind, probably not even in the top 3 or 5! But Java has proved time and again that it is capable of modernising itself, and even if it&apos;s not the first choice for job for many use cases, it offer a choice for the 10 million developers that are using it.</p><p>A few weeks back I started exploring a new Java library called <a href="djl.ai">DJL</a>, an ope source, engine-agnostic Java framework for deep learning. In this post we&apos;re going to understand some of djl capabilities by building a speech recognition application.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1501526029524-a8ea952b15be?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEwfHxtYWNoaW5lJTIwbGVhcm5pbmd8ZW58MHx8fHwxNjg2NTE0ODY3fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" class="kg-image" alt="Crooked Lake, IN - 7/4/17" loading="lazy" width="5472" height="3648" srcset="https://images.unsplash.com/photo-1501526029524-a8ea952b15be?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEwfHxtYWNoaW5lJTIwbGVhcm5pbmd8ZW58MHx8fHwxNjg2NTE0ODY3fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1501526029524-a8ea952b15be?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEwfHxtYWNoaW5lJTIwbGVhcm5pbmd8ZW58MHx8fHwxNjg2NTE0ODY3fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1501526029524-a8ea952b15be?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEwfHxtYWNoaW5lJTIwbGVhcm5pbmd8ZW58MHx8fHwxNjg2NTE0ODY3fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1501526029524-a8ea952b15be?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDEwfHxtYWNoaW5lJTIwbGVhcm5pbmd8ZW58MHx8fHwxNjg2NTE0ODY3fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 720px) 720px"><figcaption>Photo by <a href="https://unsplash.com/@hharritt?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Hunter Harritt</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></figcaption></figure><h2 id="deep-java-library">Deep Java Library</h2><p>DJL was first released in 2019 by Amazon web services, aiming to offer simple to use easy to get started machine learning framework for java developers. It is offers &#xA0;multiple java APIs for simplifying, training, testing, deploying, analysing, and predicting outputs using deep-learning models.</p><p>DJL APIs abstract away the complexity involved in developing Deep learning models, making them easy to learn and easy to apply. With the bundled set of pre-trained models in <a href="https://github.com/deepjavalibrary/djl/blob/master/docs/model-zoo.md?ref=aboullaite.me">model-zoo</a>, users can immediately start integrating Deep learning into their Java applications.</p><h2 id="showtime">Showtime</h2><p>As I mentioned earlier we&apos;re building a small Speech Recognition application. The backend is built using java 17 and Spring boot 3.1. The Frontend is built with React JS 18.2. Full application code is shared in <a href="https://github.com/aboullaite/djl-demo?ref=aboullaite.me">this repo</a>.</p><!--kg-card-begin: html--><div style="width:100%;height:0;padding-bottom:61%;position:relative;"><iframe src="https://giphy.com/embed/Z5s6b6dwbVo3sdtYj6" width="100%" height="100%" style="position:absolute" frameborder="0" class="giphy-embed" allowfullscreen></iframe></div><p><a href="https://giphy.com/gifs/Z5s6b6dwbVo3sdtYj6?ref=aboullaite.me">via GIPHY</a></p><!--kg-card-end: html--><h2 id="backend-configuration">Backend configuration</h2><!--kg-card-begin: markdown--><p>First of all, we&apos;d need to add the necessary DJL dependencies. I am using DJL version <code>0.22.1</code>, the latest release as of this writing. We&apos;d need two specific djl dependencies for this application</p>
<ul>
<li>djl-api: DJL core api.</li>
<li>pytorch-engine: The DJL implementation for PyTorch Engine, enabling to load and use pytorch built models.</li>
</ul>
<pre><code class="language-xml">    &lt;dependency&gt;
      &lt;groupId&gt;ai.djl&lt;/groupId&gt;
      &lt;artifactId&gt;api&lt;/artifactId&gt;
      &lt;version&gt;${djl.version}&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;ai.djl.pytorch&lt;/groupId&gt;
      &lt;artifactId&gt;pytorch-engine&lt;/artifactId&gt;
      &lt;version&gt;${djl.version}&lt;/version&gt;
    &lt;/dependency&gt;
</code></pre>
<p>We&apos;ll need next to configure DJL, specifying which model we want to use for inference (prediction).<br>
The <code>loadModel</code> method defines a <code>Criteria</code> class to locate the model we want to use. In the Criteria we especified:</p>
<ul>
<li>Engine: Which engine you want your model to be loaded. <code>Pytorch</code> in our case</li>
<li>Input/Output data type: defines desired input (Audio in our example) and output data type (transcription)</li>
<li>model url: Defines where the model is located,</li>
<li>Translator: Specifies custom data processing functionality to be used to ZooModel</li>
</ul>
<p>We then load the pre-trained model using (ModelZoo)[<a href="https://javadoc.io/doc/ai.djl/api/latest/ai/djl/repository/zoo/ModelZoo.html?ref=aboullaite.me">https://javadoc.io/doc/ai.djl/api/latest/ai/djl/repository/zoo/ModelZoo.html</a>] directly using a uri for convinience. The model we&apos;ll be using (wav2vec)[<a href="https://arxiv.org/abs/2006.11477?ref=aboullaite.me">https://arxiv.org/abs/2006.11477</a>] model, a speech model that accepts a float array corresponding to the raw waveform of the speech signal.</p>
<pre><code class="language-java">@Configuration
public class ModelConfiguration {

  private static final Logger LOG = LoggerFactory.getLogger(ModelConfiguration.class);
  
  @Bean
  public ZooModel&lt;Audio, String&gt; loadModel() throws IOException, ModelException, TranslateException {
    // Load model.
    String url = &quot;https://resources.djl.ai/test-models/pytorch/wav2vec2.zip&quot;;
    Criteria&lt;Audio, String&gt; criteria =
        Criteria.builder()
            .setTypes(Audio.class, String.class)
            .optModelUrls(url)
            .optTranslatorFactory(new SpeechRecognitionTranslatorFactory())
            .optModelName(&quot;wav2vec2.ptl&quot;)
            .optEngine(&quot;PyTorch&quot;)
            .build();

    return criteria.loadModel();
  }

  @Bean
  public Supplier&lt;Predictor&lt;Audio, String&gt;&gt; predictorProvider(ZooModel&lt;Audio, String&gt; model) {
    return model::newPredictor;
  }

}
</code></pre>
<p>That&apos;s pretty much all the configuration we need in order to start using our model. The service class sumply make calls the <code>predictor</code> for inference.</p>
<pre><code class="language-java">  @Resource
  private Supplier&lt;Predictor&lt;Audio, String&gt;&gt; predictorProvider;

  public String predict(InputStream stream) throws IOException, ModelException, TranslateException {
    Audio audio = AudioFactory.newInstance().fromInputStream(stream);

    try (var predictor = predictorProvider.get()) {
      return predictor.predict(audio);
      }
    }
</code></pre>
<p>The rest is pretty much simple Spring boot configuration.</p>
<!--kg-card-end: markdown--><p><strong>Frontend Configuration</strong></p><p>The frontend make use of the amazing <a href="https://github.com/jiwenjiang/react-audio-analyser?ref=aboullaite.me">react-audio-analyser library</a>, offering the possibility to record an audio from the browser and convert it to <em>wav</em> format. The rest is pretty much straightforward, only making a REST call to transcription endpoint and showing the result in the browser.</p>]]></content:encoded></item><item><title><![CDATA[Pixie, the missing developer observability tool!]]></title><description><![CDATA[<p>Needless to say how important monitoring and observability is, especially in a cloud native, distributed world! No system should got to production without having monitoring tools in place.<br>On the other hand, the devops movement and cloud native era introduced a plethora of tools to run, deploy and monitor our</p>]]></description><link>https://aboullaite.me/pixie-observability/</link><guid isPermaLink="false">647342fecda49600011ebf64</guid><category><![CDATA[kubernetes]]></category><category><![CDATA[cloud native]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sun, 28 May 2023 16:27:47 GMT</pubDate><content:encoded><![CDATA[<p>Needless to say how important monitoring and observability is, especially in a cloud native, distributed world! No system should got to production without having monitoring tools in place.<br>On the other hand, the devops movement and cloud native era introduced a plethora of tools to run, deploy and monitor our application, which drastically increased there complexity. </p><p>With the increased number of tooling and the complexity of our architectures, developers find themselves in an ever growing challenge to debug their systems, &#xA0; spot bottlenecks, identify hotspots or improve system performance.</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://images.unsplash.com/photo-1456824399588-844440089f4b?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fHBsYW5lJTIwZGFzaGJvYXJkfGVufDB8fHx8MTY4NTI3NjgzNXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" class="kg-image" alt="Ready for Takeoff" loading="lazy" width="4928" height="3264" srcset="https://images.unsplash.com/photo-1456824399588-844440089f4b?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fHBsYW5lJTIwZGFzaGJvYXJkfGVufDB8fHx8MTY4NTI3NjgzNXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1456824399588-844440089f4b?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fHBsYW5lJTIwZGFzaGJvYXJkfGVufDB8fHx8MTY4NTI3NjgzNXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1456824399588-844440089f4b?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fHBsYW5lJTIwZGFzaGJvYXJkfGVufDB8fHx8MTY4NTI3NjgzNXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1456824399588-844440089f4b?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDR8fHBsYW5lJTIwZGFzaGJvYXJkfGVufDB8fHx8MTY4NTI3NjgzNXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 1200px) 1200px"><figcaption>Photo by <a href="https://unsplash.com/@valeon?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Mitchel Boot</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></figcaption></figure><h2 id="enter-pixie">Enter Pixie!</h2><p>I recently stumbled upon a new CNCF tool called <a href="https://www.cncf.io/projects/pixie/?ref=aboullaite.me">Pixie</a>, An open source observability tool to K8S applications. Pixie was contributed by New Relic to CNCF in 2021.</p><p>What triggered my interest for Pixie is, unlike other observability tools (at least that I know of), the focus on developers and DX (developer experience). Pixie offers both a high-level state overview of the k8s cluster, as well as drill down, more tailored, granular and detailed view of the health and performance of your app.</p><p>Pixie uses <a href="https://docs.px.dev/about-pixie/pixie-ebpf?ref=aboullaite.me">eBPF</a> to collect metrics and events, without the need for manual instrumentation (code changes, redeploys ...). Pixie also stores and compute telemetry data in-memory within the cluster. Collected data is retained for up to 24h, with the possibility to export them in the <a href="https://opentelemetry.io/?ref=aboullaite.me">openTelemetry</a> format to your favorite monitoring tool for long term retention.</p><p>The heavy-lifting-done-locally approach that Pixie offers comes with a cost nevertheless. It has the advantage to ensure better security (no data needs to leave your cluster) and scalability. However the performance overhead for node CPU usage is between 2-5% as Pixie claims, and requires at least 1GiB memory requirement per node. </p><h2 id="pixie-in-action">Pixie in action!</h2><p>For the demo, I created a standard K8S cluster in <a href="https://cloud.google.com/kubernetes-engine?ref=aboullaite.me">GKE</a>, since Autopilot mode is still <a href="https://github.com/pixie-io/pixie/issues/278?ref=aboullaite.me">not supported</a> in pixie. </p><!--kg-card-begin: markdown--><p>Installing Pixie is pretty straighforward, just run:</p>
<pre><code class="language-bash">$ bash -c &quot;$(curl -fsSL https://withpixie.ai/install.sh)&quot;
</code></pre>
<p>A prompt will appear asking you to signin or register for a Pixie account.<br>
Once authenticated, we can deploy Pixie on our GKE cluster using:</p>
<pre><code class="language-bash">$ px deploy
</code></pre>
<p>This would install, among other things, <a href="https://docs.px.dev/reference/architecture?ref=aboullaite.me#vizier">Viser</a> Pixie&apos;s data plane, responsible for collecting and processing data within the cluster that is being monitored.</p>
<!--kg-card-end: markdown--><p>For convenience, I reused the manifests from <a href="https://github.com/aboullaite/service-mesh/tree/master/1-deploy-app?ref=aboullaite.me">my service mesh demo</a>, based on <a href="https://github.com/microservices-demo/microservices-demo?ref=aboullaite.me">sock-shop microservices app</a> from Weaveworks.</p><p>Pixie, support interacting with the platform using 3 ways: <a href="https://docs.px.dev/using-pixie/using-cli/?ref=aboullaite.me">CLI</a>, web-based <a href="https://docs.px.dev/using-pixie/using-live-ui?ref=aboullaite.me">Live UI</a> or <a href="https://docs.px.dev/using-pixie/api-quick-start?ref=aboullaite.me">API</a>. Unsurprisingly, using the web UI is the easiest and most intuitive way to interact with Pixie and check you data, especially if you are new to it.</p><p>Once connected to <a href="https://work.withpixie.ai/auth/login?ref=aboullaite.me" rel="noopener noreferrer">Pixie Console UI</a>, you&apos;d need to select which cluster to interact with to and which script to execute. PxL Scripts uses <a href="https://docs.px.dev/reference/pxl?ref=aboullaite.me">Pixie Language</a> (PxL) DSL to query cluster data and transform/visualize metrics. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://aboullaite.me/content/images/2023/05/Screenshot-2023-05-28-at-18.00.38.png" class="kg-image" alt loading="lazy" width="1325" height="643" srcset="https://aboullaite.me/content/images/size/w600/2023/05/Screenshot-2023-05-28-at-18.00.38.png 600w, https://aboullaite.me/content/images/size/w1000/2023/05/Screenshot-2023-05-28-at-18.00.38.png 1000w, https://aboullaite.me/content/images/2023/05/Screenshot-2023-05-28-at-18.00.38.png 1325w" sizes="(min-width: 720px) 720px"><figcaption>Pixie dashboard</figcaption></figure><!--kg-card-begin: markdown--><p>Pixie CLI is as fun as to play with as the web UI, it is rich and interactive. You can use <code>px help</code> to list all Pixie CLI options, and <code>px scripts list</code> to list all built-in scripts. Below is an image of running <code>px live px/http_data</code> script which shows a sample of the HTTP/2 traffic flowing through your cluster. Notice the link above that sends you to the web UI which is very convenient to go back and forth.<br>
<img src="https://aboullaite.me/content/images/2023/05/Screenshot-2023-05-28-at-18.11.14.png" alt="Screenshot-2023-05-28-at-18.11.14" loading="lazy"></p>
<p>A great example of Pixie usage is application profiling, to detect hotspot and analyse CPU spikes. Pixie&apos;s <code>px/pod</code> gives an overview of the high level application metrics (latency, error, throughput ...) and resource utilization for the selected pod. What excited me is the <em>Pod Performance Flamegraph</em> at the end of the page whic is greatly useful to identify performance issues. You can see an example below of the CPU spike in the beginning of the java orders app while the JVM is warming up and <a href="https://aboullaite.me/understanding-jit-compiler-just-in-time-compiler/">JIT</a> is executed, while slowly cooling down as the compilation finishes.<br>
<img src="https://aboullaite.me/content/images/2023/05/Screenshot-2023-05-27-at-18.55.14.png" alt="Screenshot-2023-05-27-at-18.55.14" loading="lazy"></p>
<!--kg-card-end: markdown--><p>Those are just a few of the many features and options that Pixies offers (Which I am still uncovering myself). head over to <a href="https://docs.px.dev/?ref=aboullaite.me">the documentation page</a> to read more about it!</p>]]></content:encoded></item><item><title><![CDATA[What the CRaC ?!]]></title><description><![CDATA[<p>If you&apos;ve been following the news lately in the Java ecosystem (aside from Java 28th anniversary), you should&apos;ve heard of <a href="https://openjdk.org/projects/crac/?ref=aboullaite.me">CRaC</a>. Two big announcements were revealed this week:</p><ul><li>Azul announced earlier this week the general availability of and commercial support for <a href="https://www.azul.com/products/components/crac/?ref=aboullaite.me">Azul Zulu Builds of OpenJDK</a></li></ul>]]></description><link>https://aboullaite.me/what-the-crac/</link><guid isPermaLink="false">6467a076cda49600011ebca1</guid><category><![CDATA[Java]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sat, 20 May 2023 13:34:26 GMT</pubDate><content:encoded><![CDATA[<p>If you&apos;ve been following the news lately in the Java ecosystem (aside from Java 28th anniversary), you should&apos;ve heard of <a href="https://openjdk.org/projects/crac/?ref=aboullaite.me">CRaC</a>. Two big announcements were revealed this week:</p><ul><li>Azul announced earlier this week the general availability of and commercial support for <a href="https://www.azul.com/products/components/crac/?ref=aboullaite.me">Azul Zulu Builds of OpenJDK for Java 17 including CRaC</a> functionality.</li><li>The next release of Spring framework, 6.1, will add support for CRaC. </li></ul><p>If you are wondering what CRaC is all about,I got you covered, read on &#xA0;:)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://images.unsplash.com/photo-1482614312710-79c1d29bda2a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDk5fHxzcGVlZHxlbnwwfHx8fDE2ODQ1MTM3MjR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" class="kg-image" alt="Flying through the water!" loading="lazy" width="3072" height="2048" srcset="https://images.unsplash.com/photo-1482614312710-79c1d29bda2a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDk5fHxzcGVlZHxlbnwwfHx8fDE2ODQ1MTM3MjR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=600 600w, https://images.unsplash.com/photo-1482614312710-79c1d29bda2a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDk5fHxzcGVlZHxlbnwwfHx8fDE2ODQ1MTM3MjR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1000 1000w, https://images.unsplash.com/photo-1482614312710-79c1d29bda2a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDk5fHxzcGVlZHxlbnwwfHx8fDE2ODQ1MTM3MjR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1600 1600w, https://images.unsplash.com/photo-1482614312710-79c1d29bda2a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDk5fHxzcGVlZHxlbnwwfHx8fDE2ODQ1MTM3MjR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2400 2400w" sizes="(min-width: 720px) 720px"><figcaption>Photo by <a href="https://unsplash.com/@joshcala?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Josh Calabrese</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></figcaption></figure><h2 id="what-is-crac">What is CRaC?!</h2><h3 id="explain-it-like-i-am-6-years-old-sort-of">Explain it like I am 6 years old (Sort of!)</h3><p>In a world where streaming services are omnipresent, we can stop watching a video whenever we want, and we expect to resume from (almost) the same position where we left off, even on another device. Imagine if we can apply the same analogy to our running applications: Take a snapshot (pause) of the running state, and restore (resume) it in another server.</p><h3 id="in-more-technical-terms">In more technical terms </h3><p>CRaC stands for Coordinated Restore at Checkpoint. It is an <a href="https://openjdk.org/projects/crac/?ref=aboullaite.me">OpenJDK project</a>, developed by Azule Systems, with the aim to speed up the JVM startup time by capturing/freezing its running state, where all the heavy lifting is performed (loading classes, JIT compilation, code optimizations...), and serializing its state on disk (Checkpoint), to resume it later from that state (Restore) and run it exactly as it was during the time of the freeze.</p><p>CRaC uses <a href="https://criu.org/?ref=aboullaite.me">CRIU</a> technology under the hood to perform its magic. CRIU is a C library facilitating the implementation of checkpoint/restore functionalities for Linux, and the maintainers <a href="https://criu.org/Comparison_to_other_CR_projects?ref=aboullaite.me">claim</a> it is the most feature-rich and up-to-date with the kernel for implementing CR in Linux.</p><!--kg-card-begin: markdown--><p>CRIU is also the technology behind <code>docker checkpoint</code> <a href="https://docs.docker.com/engine/reference/commandline/checkpoint/?ref=aboullaite.me">experimental command</a>, allowing to make serializable snapshots of a running container, and recreated later (even in another host). Podman has a <a href="https://docs.podman.io/en/latest/markdown/podman-container-checkpoint.1.html?ref=aboullaite.me">similar feature</a> with <code>podman container checkpoint</code>. Similarly, CRIU has support for <a href="https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/?ref=aboullaite.me">Kubernetes</a> and <a href="https://criu.org/LXC?ref=aboullaite.me">LXC/LXD</a> as well.</p>
<p>In the Java space, CRIU is also used in <a href="https://www.eclipse.org/openj9/?ref=aboullaite.me">OpenJ9</a> to improve JVM startup time, and empower <a href="https://openliberty.io/blog/2022/09/29/instant-on-beta.html?ref=aboullaite.me">InstantOn</a> Project from <a href="https://openliberty.io/?ref=aboullaite.me">Open Liberty</a></p>
<!--kg-card-end: markdown--><h2 id="showtime">Showtime</h2><p>CRaC is only available on Linux, so in order to run this demo you&apos;d need a Linux machine. I tried to use Docker on Mac but had little success and stumbled upon many issues.</p><p>I am also using this simple Spring boot <a href="https://github.com/sdeleuze/spring-boot-crac-demo?ref=aboullaite.me">code</a> showcasing the upcoming support for CRaC in Spring framework 6.2. Kudos to the Spring team and <a href="https://twitter.com/sdeleuze?ref=aboullaite.me" rel="nofollow me">@sdeleuze</a> for the amazing work.</p><!--kg-card-begin: markdown--><p>First, you&apos;d need to install the recently available Zulu JDK with CRaC support. You can either install it manually or use sdkman using:</p>
<pre><code class="language-bash">$ sdk install java 17.42.21-zulu
</code></pre>
<p>Next, we&apos;d need to build the project by running the below command. This assume that you already cloned the project and it is your current directory.</p>
<pre><code class="language-bash">$ mvn clean verify
</code></pre>
<p>Once it finishes building, we can run our app using</p>
<pre><code class="language-bash">$ java -XX:CRaCCheckpointTo=./crac-files -jar target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
</code></pre>
<p>Notice anything new? The java argument <code>-XX:CRaCCheckpointTo=path</code> indicates to the jvm to enable CRaC and defines the path to store the image.<br>
If everything goes as expected, the app should be running after a few seconds. Make sure to hit it with a few requests to warm up the application:</p>
<pre><code class="language-bash">$ curl localhost:8080
Greetings from Spring Boot!
</code></pre>
<p>Now leave your app running (or run it in the background), and in another terminal, we&apos;re going to use the <code>jcmd</code> command to trigger checkpoint:</p>
<pre><code class="language-bash">$ jcmd target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar JDK.checkpoint
201931:
CR: Checkpoint ...
</code></pre>
<p><code>201931</code> represents PID of our running spring boot app, which should now be stopped as indicated in the logs:</p>
<pre><code class="language-log">2023-05-20T12:02:06.610Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping Spring-managed lifecycle beans before JVM checkpoint
2023-05-20T12:02:06.615Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 2147482623
2023-05-20T12:02:06.617Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean &apos;webServerGracefulShutdown&apos; completed its stop procedure
2023-05-20T12:02:06.617Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase 2147481599
2023-05-20T12:02:06.618Z  INFO 201931 --- [Attach Listener] org.eclipse.jetty.server.Server          : Stopped Server@53f3bdbd{STOPPING}[11.0.15,sto=0]
2023-05-20T12:02:06.624Z  INFO 201931 --- [Attach Listener] o.e.jetty.server.AbstractConnector       : Stopped ServerConnector@1a4927d6{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
2023-05-20T12:02:06.629Z  INFO 201931 --- [Attach Listener] o.e.j.s.h.ContextHandler.application     : Destroying Spring FrameworkServlet &apos;dispatcherServlet&apos;
2023-05-20T12:02:06.632Z  INFO 201931 --- [Attach Listener] o.e.jetty.server.handler.ContextHandler  : Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@35399441{application,/,[file:///tmp/jetty-docbase.8080.3095195653033098747/],STOPPED}
2023-05-20T12:02:06.638Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean &apos;webServerStartStop&apos; completed its stop procedure
2023-05-20T12:02:06.639Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Stopping beans in phase -2147483647
2023-05-20T12:02:06.640Z DEBUG 201931 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Bean &apos;springBootLoggingLifecycle&apos; completed its stop procedure
Killed
</code></pre>
<p>Inspecting your files now, you should see <code>crar-files</code> folder created with different <code>.img</code> files. Those are all the images that were generated from the checkpoint operation. Those images can be inspected using <a href="https://criu.org/CRIT?ref=aboullaite.me">crit</a> tool. If you are using ubuntu, you can install <code>crit</code> command-line as part of the <code>criu</code> package using <code>apt-get install criu</code>.</p>
<p><code>crit</code> is pretty handy to check the content of images folder. We can, for example check what process we checkpointed:</p>
<pre><code class="language-bash">$ crit x crarc-files ps
    PID   PGID    SID   COMM
 201931 201931 201381   java
</code></pre>
<p>We can inspect checkpoint files descriptors:</p>
<pre><code class="language-bash">$ crit x crarc-files fds
          201931
	      0: TTY.36
	      1: TTY.36
	      2: TTY.36
	      3: /root/.sdkman/candidates/java/17.0.7.crac-zulu/lib/modules
	      4: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      5: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      6: /home/maboullaite/spring-boot-crac-demo/target/spring-boot-crac-demo-1.0.0-SNAPSHOT.jar
	      7: /home/maboullaite/spring-boot-crac-demo/crac-files/perfdata
	      8: /dev/random
	      9: /dev/urandom
	    cwd: /home/maboullaite/spring-boot-crac-demo
	   root: /
</code></pre>
<p>We can even extract image info of one of the images using <code>crit show</code></p>
<pre><code class="language-bash">$ crit show cracr-files/core-201931.img
{
    &quot;magic&quot;: &quot;CORE&quot;,
    &quot;entries&quot;: [
        {
            &quot;mtype&quot;: &quot;X86_64&quot;,
            &quot;thread_info&quot;: {
                &quot;clear_tid_addr&quot;: &quot;0x7f19b70d5550&quot;,
                &quot;gpregs&quot;: {
                ...}
         &quot;thread_core&quot;: {
                &quot;futex_rla&quot;: 139748422014304,
                &quot;futex_rla_len&quot;: 24,
                &quot;sched_nice&quot;: 0,
                &quot;sched_policy&quot;: 0,
                &quot;sas&quot;: {
                    &quot;ss_sp&quot;: 0,
                    &quot;ss_size&quot;: 0,
                    &quot;ss_flags&quot;: 2
                },
                &quot;signals_p&quot;: {},
                &quot;creds&quot;: {
                    &quot;uid&quot;: 0,
                    &quot;gid&quot;: 0,
                    &quot;euid&quot;: 0,
                    &quot;egid&quot;: 0,
                    &quot;suid&quot;: 0,
                    &quot;sgid&quot;: 0,
                    &quot;fsuid&quot;: 0,
                    &quot;fsgid&quot;: 0,
                    &quot;cap_inh&quot;: [
                        0,
                        0
                    ],
                    &quot;cap_prm&quot;: [
                        4294967295,
                        511
                    ],
                    &quot;cap_eff&quot;: [
                        4294967295,
                        511
                    ],
                    &quot;cap_bnd&quot;: [
                        4294967295,
                        511
                    ],
                    &quot;secbits&quot;: 0,
                    &quot;groups&quot;: [
                        0
                    ]
                },
                &quot;comm&quot;: &quot;java&quot;
            }
        }
    ]
}
</code></pre>
<p><code>crac-files</code> directory also contains log files, which are pretty handy in issues.<br>
To restore our image and run the app from it&apos;s saved state, we simply run:</p>
<pre><code class="language-bash">$ java -XX:CRaCRestoreFrom=./crac-files
</code></pre>
<p>Which results in a lightning-speed start compared to the previous start.<br>
<img src="https://aboullaite.me/content/images/2023/05/IMG_AB599F58F78D-1.jpeg" alt="IMG_AB599F58F78D-1" loading="lazy"></p>
<!--kg-card-end: markdown--><h2 id="but-what-about-aot-and-graal-native-images">But what about AoT and Graal Native images?</h2><p>Well, it is quite different. While native images achieve very fast startup time and a very small memory footprint, it isn&apos;t the cure to all problems. &#xA0;Native image generation requires that each class you&apos;d need at runtime be made available in build time for the compilation to succeed, which might represent some challenges for java developers. Debugging is also another aspect where native images fall short.<br><br>CRaC (and similar tools) allows us to still benefit from JVM capabilities that we&apos;re familiar with while benefiting from the fast startup needed for many cloud-native workloads. On the other hand, as <a href="https://twitter.com/thomaswue?s=21&amp;t=oF9ZqYERUY0-bcmBvB8dNA&amp;ref=aboullaite.me">Thomas</a> brought to my attention, the size of the snapshot is orders of magnitude bigger compared to the size of the native image.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Maybe you can also mention the size of the snapshot as another disadvantage compared to native image. Furthermore, in scenarios like serverless, there is no possibility to debug with regular Java mechanisms in production on many systems anyway.</p>&#x2014; Thomas Wuerthinger (thomaswue.dev) &#x1F499; (@thomaswue) <a href="https://twitter.com/thomaswue/status/1660030978734145537?ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">May 20, 2023</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</figure><h2 id="final-thoughts">Final Thoughts</h2><p>The general availability of CRaC would help boost the adoption of CR technology in the java space, making the java language even more modern and more suitable for the cloud-native world. Exciting times!<br>Finally, it is worth mentioning that CR technology is not new, &#xA0;<a href="https://www.linuxplumbersconf.org/event/2/contributions/69/attachments/205/374/Task_Migration_at_Scale_Using_CRIU_-_LPC_2018.pdf?ref=aboullaite.me">Google uses it</a> to migrate batch jobs in Borg.</p><p></p><h3 id="ressources">Ressources </h3><ul><li><a href="https://blog.openj9.org/2022/09/26/fast-jvm-startup-with-openj9-criu-support/?ref=aboullaite.me">https://blog.openj9.org/2022/09/26/fast-jvm-startup-with-openj9-criu-support/</a></li><li><a href="https://github.com/CRaC/docs/blob/master/STEP-BY-STEP.md?ref=aboullaite.me">https://github.com/CRaC/docs/blob/master/STEP-BY-STEP.md</a></li><li><a href="https://www.youtube.com/watch?v=bWmuqh6wHgE&amp;ref=aboullaite.me">https://www.youtube.com/watch?v=bWmuqh6wHgE</a></li></ul>]]></content:encoded></item><item><title><![CDATA[My home office setup!]]></title><description><![CDATA[<p>Hello dear reader &#x1F44B;<br>Let me set come context some context first before diving into how I set up my home office. &#xA0;I am a software engineer, a meticulous one you can say! I am sharing my own setup because many friends asked me to do so (and I</p>]]></description><link>https://aboullaite.me/my-home-office-setup/</link><guid isPermaLink="false">6460c00ecda49600011ebb34</guid><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Tue, 31 Aug 2021 13:50:16 GMT</pubDate><content:encoded><![CDATA[<p>Hello dear reader &#x1F44B;<br>Let me set come context some context first before diving into how I set up my home office. &#xA0;I am a software engineer, a meticulous one you can say! I am sharing my own setup because many friends asked me to do so (and I truly thank them for the kind words and encouragement). The items you gonna see below are my own preference based on my own research. So don&apos;t take everything you read here as a recommendation, do your own research and comparisons &#x270C;&#xFE0F; </p><p>With that being said, let&apos;s get back to business! I&apos;ve relocated a couple of months ago to Stockholm, as I joined Spotify. Needless to mention that remote/home work is nowadays the new normal. Worst, it doesn&apos;t look it&apos;s going to change anytime soon. At first, building my home workspace wasn&apos;t really a priority, but as I started to have more back pain, productivity drop and a bit of anxiety, It became &#xA0;necessity.</p><p>After 2 months of gathering gadgets and items from here and there, here&apos;s the final look:</p><figure class="kg-card kg-image-card"><img src="https://aboullaite.me/content/images/2021/08/IMG_1212.png" class="kg-image" alt loading="lazy"></figure><h2 id="list-of-equipment-and-details">List of Equipment and details</h2><ul><li><a href="https://www.dell.com/en-us/work/shop/dell-ultrasharp-34-curved-usb-c-hub-monitor-u3421we/apd/210-axqs/monitors-monitor-accessories?ref=aboullaite.me">Dell UltraSharp 34 Curved USB-C Hub Monitor (U3421WE)</a>: I like the &quot;hub&quot; aspect of this monitor. It&apos;s actually very useful especially for a MacBook pro user like me. Everything plugs into the monitor itself&#x2014;all your USB gear (mouse, keyboard, backup drive, microphone, whatever you&apos;ve got)&#x2014;even ethernet! And then ONE cable connects to your laptop, and even charges it.</li><li><a href="https://www.amazon.com/Aluminum-Wireless-Charging-Transfer-Computer/dp/B089VY18WH?th=1&amp;ref=aboullaite.me">Vaydeer 2 Tiers Aluminum Monitor Stand with Wireless Charging</a>: What I like about this monitor is the modern look, solidity, compactness and also has a middle tray to put notes and other stuffs and best part are four USB&#x2019;s hub + wireless phone charging!</li><li><a href="https://www.raindesigninc.com/mstand.html?ref=aboullaite.me">Rain Design mStand Laptop Stand</a>: &#xA0;Very Solid. I used it with both 15&apos; and 13&apos; sizes of macbook pro. It fits really well, looks nice matching aluminum material. </li><li><a href="https://www.amazon.com/Neewer-Ring-Light-Kit-Self-Portrait/dp/B01LXDNNBW?ref=aboullaite.me">Neewer Ring Light Kit:18&quot;/48cm Outer 55W 5500K Dimmable LED Ring Light</a>: This one is very popular among beauty blogger &#x1F605; However it&apos;s affordable compared to other alternatives, with (very )powerful light. Easy to setup and use.</li><li><a href="https://en.yeelight.com/product/1512.html?ref=aboullaite.me">Yeelight LED Screen Light Bar Pro</a>: This one surprise me, I wasn&apos;t expecting to be that good. I comes both with a nice round remote controller, or you can use the Yeelight app for more light themes. </li><li>I put more lighting under my desk using <a href="https://eu.govee.com/products/rgbic-smart-led-strip-lights?utm_campaign=govee&amp;utm_source=google&amp;utm_medium=cpc&amp;gclid=Cj0KCQjwpreJBhDvARIsAF1_BU2Nj6sWlvCL99rzp1yjBRI_PCJ-yfMcoOW2miY-5oS09U_diovh1W0aApEKEALw_wcB">Govee WiFi LED Strip</a>. It has even a music mode and it&apos;s CRAZY!</li><li><a href="https://www.amazon.com/Tablet-Stand-Adjustable-Lamicall-Reader/dp/B01DBV1OKY?ref=aboullaite.me">Lamicall Tablet Stand</a> for iPad Pro: Robust, feels nice and does its job very well!</li><li><a href="https://www.amazon.com/Sony-Noise-Cancelling-Headphones-WH1000XM3/dp/B07G4MNFS1?ref=aboullaite.me">Sony WH-1000XM3 Noise Cancelling Wireless Headphones</a>: Although I usually use airpods pro for almost all my meetings, I have to admit I like those headphones a lot. The noise canceling is amazing. The sound quality is excellent, and if you use the Headphones app you can change the base and other EQ settings.</li><li><a href="https://www.amazon.com/Headphone-New-Earphone-Supporting-Headphones/dp/B01GJQ7N94?ref=aboullaite.me">New Bee Headphones Stand</a>: It looks nice, fits well and does the job!</li><li><a href="https://www.logitech.com/en-us/products/webcams/c925e-business-webcam.960-001075.html?ref=aboullaite.me">Logitech C925e webcam</a>: Nothing much to say here, It&apos;s a webcam. Not very happy with it though. I may change it later tbh.</li><li>Large <a href="https://www.amazon.com/AmazonBasics-Large-Extended-Gaming-Computer/dp/B06X19FLTC?ref=aboullaite.me">Amazon basics</a> mouse pad. Not fancy, but quite practical,</li><li>Apple MacBook pro 13&apos;, airpods pro, ipad Pro 12.5&apos;, Magic mouse and magic keyboard! You can call me an apple fan-boy &#x1F34E;</li><li>Herman Miller <a href="https://www.hermanmiller.com/en_lac/products/seating/office-chairs/aeron-chairs/?ref=aboullaite.me">chair</a> (Aeron) and <a href="https://www.hermanmiller.com/en_lac/products/tables/sit-to-stand-tables/nevi-sit-to-stand-tables/?ref=aboullaite.me">adjustable desk</a> (Nevi desk).</li></ul>]]></content:encoded></item><item><title><![CDATA[Building Native Covid19 Tracker CLI using Java, PicoCLI & GraalVM]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>When it comes to building CLI apps, Java is not your first ( not even among the top 3) choice that comes to mind. However, one of the amazing things about java is its ecosystem and vibrant community, which means that you can find some tools/libraries for (nearly) everything.</p>
<p><a href="https://golang.org/?ref=aboullaite.me">Golang</a></p>]]></description><link>https://aboullaite.me/java-covid19-cli-picocli-graalvm/</link><guid isPermaLink="false">6460c00ecda49600011ebb33</guid><category><![CDATA[Java]]></category><category><![CDATA[GraalVM]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Mon, 11 May 2020 22:40:09 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>When it comes to building CLI apps, Java is not your first ( not even among the top 3) choice that comes to mind. However, one of the amazing things about java is its ecosystem and vibrant community, which means that you can find some tools/libraries for (nearly) everything.</p>
<p><a href="https://golang.org/?ref=aboullaite.me">Golang</a> particularly excels in this area for several reasons, but one aspect where Go sparkles is the ability to compile a program to a single, small native executable file, that makes the program runs faster and much easier to distribute. However Java Apps have been traditionally hard to distribute since they require the JVM to be already installed on the target machine.</p>
<p>In this post, I describe my experience building a small CLI app to track covid19, using <a href="https://picocli.info/?ref=aboullaite.me">picocli</a> and turning it into a lightweight, standalone binary that is easy to use and distribute, using <a href="https://www.graalvm.org/?ref=aboullaite.me">Graal VM</a>.</p>
<p>The complete source code for this application can be found in <a href="https://github.com/aboullaite/covid-19-picocli?ref=aboullaite.me">this Github repo</a>.</p>
<p><a href="https://asciinema.org/a/GZgh2sqHtTab8j6NXRGGnplnD?ref=aboullaite.me" target="_blank"><img src="https://asciinema.org/a/GZgh2sqHtTab8j6NXRGGnplnD.svg"></a></p>
<h3 id="picocli">PicoCLI</h3>
<p><a href="https://picocli.info/?ref=aboullaite.me">Picocli</a> is a modern library for building command line applications on the JVM.<br>
Picocli aims to be <em>the easiest way to create rich command line applications that can run on and off the JVM</em>. It offers <a href="https://picocli.info/?ref=aboullaite.me#_ansi_colors_and_styles">colored output</a>, <a href="https://picocli.info/autocomplete.html?ref=aboullaite.me">TAB autocompletion</a>, nested subcommands, and comes with couple of grear features compared to  other JVM CLI libraries such as <a href="https://picocli.info/?ref=aboullaite.me#_negatable_options">negatable options</a>,<a href="https://picocli.info/?ref=aboullaite.me#_repeating_composite_argument_groups"> repeating composite argument groups</a>, <a href="https://picocli.info/?ref=aboullaite.me#_repeatable_subcommands">repeating subcommands</a> and <a href="https://picocli.info/?ref=aboullaite.me#_custom_parameter_processing">custom parameter processing</a>.</p>
<p>Picocli based applications can also easily be integrate with Dependency Injection containers. Picocli ships with a <a href="https://github.com/remkop/picocli/tree/master/picocli-spring-boot-starter?ref=aboullaite.me"><code>picocli-spring-boot-starter</code> module</a> that includes a <code>PicocliSpringFactory</code> and Spring Boot auto-configuration to use Spring dependency injection in your picocli command line application.</p>
<p>The <a href="https://micronaut.io/?ref=aboullaite.me">Micronaut</a> microservices framework has <a href="https://docs.micronaut.io/latest/guide/index.html?ref=aboullaite.me#commandLineApps">built-in support</a> for picocli.</p>
<h3 id="covid19trackerapp">Covid-19 Tracker app</h3>
<h4 id="covid19data">Covid-19 Data</h4>
<p>The CLI app gets data from <a href="https://corona.lmao.ninja/?ref=aboullaite.me">Novel COVID API</a>. A free and easy to use API, that gathers data from multiple sources (Johns Hopkins University, the New York Times, Worldometers, and Apple reports)</p>
<h4 id="dependencies">Dependencies</h4>
<p>There are a couple of libraries that I used to build this app. First and foremost, <code>picocli</code> as the heart of the CLI app. I opted for <a href="https://eclipse-ee4j.github.io/jersey/?ref=aboullaite.me">Jersey Client</a> to handle HTTP communication with Rest server and collect data, as well as <a href="https://github.com/FasterXML/jackson?ref=aboullaite.me">Jackson</a> the well know java json library.</p>
<p>The hard part was finding some Ascii based tables and graphs, and honestly my choices were very limited. I ended up using <a href="https://github.com/freva/ascii-table?ref=aboullaite.me">ascii-table</a> to create and customize ASCII tables and <a href="https://github.com/MitchTalmadge/ASCII-Data?ref=aboullaite.me">ascii-data</a> t to generate some nice looking text-based line-graphs.</p>
<p>This what my pom-file dependencies section contians:</p>
<pre><code class="language-xml">...
    &lt;dependency&gt;
        &lt;groupId&gt;info.picocli&lt;/groupId&gt;
        &lt;artifactId&gt;picocli&lt;/artifactId&gt;
        &lt;version&gt;4.2.0&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
        &lt;groupId&gt;org.glassfish.jersey.core&lt;/groupId&gt;
        &lt;artifactId&gt;jersey-client&lt;/artifactId&gt;
        &lt;version&gt;2.30.1&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
        &lt;groupId&gt;org.glassfish.jersey.media&lt;/groupId&gt;
        &lt;artifactId&gt;jersey-media-json-jackson&lt;/artifactId&gt;
        &lt;version&gt;2.30.1&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
        &lt;groupId&gt;com.github.freva&lt;/groupId&gt;
        &lt;artifactId&gt;ascii-table&lt;/artifactId&gt;
        &lt;version&gt;1.1.0&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
        &lt;groupId&gt;com.mitchtalmadge&lt;/groupId&gt;
        &lt;artifactId&gt;ascii-data&lt;/artifactId&gt;
        &lt;version&gt;1.4.0&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
        &lt;groupId&gt;org.glassfish.jersey.inject&lt;/groupId&gt;
        &lt;artifactId&gt;jersey-hk2&lt;/artifactId&gt;
        &lt;version&gt;2.30.1&lt;/version&gt;
    &lt;/dependency&gt;
</code></pre>
<h4 id="showmethecode">Show me the code</h4>
<p>Now that I&apos;ve everything the app need, let&apos;s have a look at the code. Below is the main class:</p>
<pre><code class="language-java">@Command(description = &quot;Track covid-19 from your command line&quot;,
        name = &quot;cov19&quot;, mixinStandardHelpOptions = true, version = &quot;cov19 1.0&quot;)
public class Covid19Cli implements Callable&lt;Integer&gt; {
    @Option(names = {&quot;-c&quot;, &quot;--country&quot;}, description = &quot;Country to display data for&quot;, defaultValue = &quot;all&quot;)
    String country;
    @Option(names = {&quot;-g&quot;, &quot;--graph&quot;}, description = &quot;show data as graph history of last 30 days&quot;)
    boolean graph;
    @Option(names = {&quot;-a&quot;, &quot;--all&quot;}, description = &quot;show data for all affected countries&quot;)
    boolean all;

    CovidAPI covidAPI = new CovidAPI();

    public static void main(String[] args) {
        int exitCode = new CommandLine(new Covid19Cli()).execute(args);
        System.exit(exitCode);
    }

    public Integer call() throws Exception {
        if (this.all &amp;&amp; !this.country.equals(&quot;all&quot;)){
            System.out.println(Ansi.AUTO.string(&quot;@|bold,red, ****** Cannot combine global (`-a`) and country (`-c`) options ****** |@\n&quot;));
            return 1;
        }

        this.colorise(this.country);
        if(this.graph){
            PrintUtils.printGrapgh(covidAPI.history(this.country));
            return 0;
        }
        if (this.all){
            PrintUtils.printCountryStatTable(covidAPI.allCountryStats());
            return 0;
        }
        if(this.country.equals(&quot;all&quot;)) {
            PrintUtils.printGlobalTable(Arrays.asList(covidAPI.globalStats()));
            return 0;
        }
        PrintUtils.printCountryStatTable(Arrays.asList(covidAPI.countryStats(this.country)));
        return 0;
    }

</code></pre>
<p>A couple of interesting things here:</p>
<ul>
<li>The <code>@Command</code> annotation from picocli enables us to define the general information about the command.</li>
<li><code>mixinStandardHelpOptionsconfig</code> option adds magically <code>--help</code> and <code>--version</code> flag to CLI.</li>
<li>The class implements <code>Callable&lt;Integer&gt;</code>, as <code>picocli</code> needs a predictable way of executing command, parsing params and options and returning exit code.</li>
<li>The <code>execute</code> method shows the usage help or version information if requested by the user</li>
<li>Invalid user input will result in a helpful error message. If the user input was valid, the business logic, present in <code>call</code> method, is invoked.</li>
<li>Finally, the <code>execute</code> method returns an exit status code that can be used to call <code>System.exit</code> if desired. By default, the <code>execute</code> method returns <code>CommandLine.ExitCode.OK (0)</code> on success, <code>CommandLine.ExitCode.SOFTWARE (1)</code> when an exception occurred in the Runnable, Callable or command method, and <code>CommandLine.ExitCode.USAGE (2)</code> for invalid input.</li>
<li>The fields of the class are annotated with <code>@option</code>, to declare what options the application expects. Picocli initializes these fields based on the command line arguments which commonly start with <code>-</code> or <code>--</code>.</li>
<li>Note that some options have one name and some have more</li>
<li>Option can have default values using the <code>defaultValue</code> annotation attribute.</li>
</ul>
<h4 id="buildingandtestingtheapp">Building and testing the app</h4>
<p>As any java app using maven, we run <code>mvn clean package</code> to compile our app and generate the <code>jar</code> file. I used the <a href="https://maven.apache.org/plugins/maven-shade-plugin/index.html?ref=aboullaite.me"><code>maven shade plugin</code></a> to package the artifact in an uber-jar (including all its dependencies).<br>
Now, we can verify that our CLI is working using:</p>
<pre><code class="language-bash">$ java -jar covid-java-cli-1.0-SNAPSHOT.jar --help                                                                                                   

Usage: cov19 [-aghV] [-c=&lt;country&gt;]
Track covid-19 from your command line
  -a, --all                 show data for all affected countries
  -c, --country=&lt;country&gt;   Country to display data for
  -g, --graph               show data as graph history of last 30 days
  -h, --help                Show this help message and exit.
  -V, --version             Print version information and exit.
</code></pre>
<p>So far, the application is working but doesn&apos;t feel too much like an actual CLI. Ideally, we should aim for a more native experience and  simply run <code>./mycly</code> instead calling <code>java -jar</code> each time!<br>
This is what will try to accomplish in the next section with GraalVM.</p>
<h3 id="graalvmbuildinganativeimage">GraalVM, Building a native image</h3>
<p>This was the hardest part while working on this app, for the simple reason that GraalVM native image compiler supports for reflection is <a href="https://github.com/oracle/graal/blob/master/substratevm/CONFIGURE.md?ref=aboullaite.me">partial and requires additional configuration</a>.<br>
This impact my application in 2 ways:</p>
<ul>
<li>Picocli uses reflection to discover classes and methods annotated with <code>@Command</code>, and fields, methods or method parameters annotated with <code>@Option</code>.</li>
<li>Jersey client uses reflection as well</li>
</ul>
<p>Picocli includes a <a href="https://github.com/remkop/picocli/tree/master/picocli-codegen?ref=aboullaite.me"><code>picocli-codegen</code> module</a>, that contains an annotation processor to generate GraalVM configuration files at compile time rather than at runtime. So the first one was easy to fix by adding the below config to my <code>pom.xml</code> file:</p>
<pre><code class="language-xml">             &lt;plugin&gt;
                &lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt;
                &lt;artifactId&gt;maven-compiler-plugin&lt;/artifactId&gt;
                &lt;version&gt;3.8.1&lt;/version&gt;
                &lt;configuration&gt;
                    &lt;annotationProcessorPaths&gt;
                        &lt;path&gt;
                            &lt;groupId&gt;info.picocli&lt;/groupId&gt;
                            &lt;artifactId&gt;picocli-codegen&lt;/artifactId&gt;
                            &lt;version&gt;4.2.0&lt;/version&gt;
                        &lt;/path&gt;
                    &lt;/annotationProcessorPaths&gt;
                &lt;/configuration&gt;
            &lt;/plugin&gt;
</code></pre>
<p>It generate configuration files for reflection, resources and dynamic proxies.</p>
<pre><code class="language-[class=&quot;line-numbers&quot;]">target
&#x251C;&#x2500;&#x2500; classes
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; META-INF
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; native-image
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; picocli-generated
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;         &#x251C;&#x2500;&#x2500; proxy-config.json
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;         &#x251C;&#x2500;&#x2500; reflect-config.json
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0;         &#x2514;&#x2500;&#x2500; resource-config.json

</code></pre>
<p>As for jersey, I had to do some testing and debugging to generate the <code>reflection.json</code> to make our application Graal-enabled! Below a snippet from it:</p>
<pre><code class="language-json[class=&quot;line-numbers&quot;]">...
  {
    &quot;name&quot; : &quot;org.glassfish.jersey.internal.config.ExternalPropertiesConfigurationFeature&quot;,
    &quot;allDeclaredConstructors&quot;: true,
    &quot;allPublicConstructors&quot;: true,
    &quot;allDeclaredFields&quot;: true,
    &quot;allPublicFields&quot;: true,
    &quot;allDeclaredMethods&quot;: true,
    &quot;allPublicMethods&quot;: true
  },
  {
    &quot;name&quot; : &quot;org.glassfish.jersey.message.internal.MessageBodyFactory&quot;,
    &quot;allDeclaredConstructors&quot;: true,
    &quot;allPublicConstructors&quot;: true,
    &quot;allDeclaredFields&quot;: true,
    &quot;allPublicFields&quot;: true,
    &quot;allDeclaredMethods&quot;: true,
    &quot;allPublicMethods&quot;: true
  },
  {
    &quot;name&quot; : &quot;com.fasterxml.jackson.module.jaxb.JaxbAnnotationIntrospector&quot;,
    &quot;allDeclaredConstructors&quot;: true,
    &quot;allPublicConstructors&quot;: true,
    &quot;allDeclaredFields&quot;: true,
    &quot;allPublicFields&quot;: true,
    &quot;allDeclaredMethods&quot;: true,
    &quot;allPublicMethods&quot;: true
  },
</code></pre>
<p>Now that our reflection config is in place, We are pretty done with our application.  The next natural step is to compile our application ahead of time and generate the native binary.</p>
<p>First off, we need to install GraalVM <a href="https://www.graalvm.org/docs/reference-manual/aot-compilation/?ref=aboullaite.me">native-image tool</a> and call it manually. However, recent GraalVM releases added the possibility to build native images right out of maven without running the <code>native-image</code> tool as a separate step after building the <code>uber-jar</code>. In order for it to run, the plugin expects<code>JAVA_HOME</code> to be set as Graal CV, it will not work otherwise.</p>
<pre><code class="language-xml[class=&quot;line-numbers&quot;]">          &lt;plugin&gt;
                &lt;groupId&gt;org.graalvm.nativeimage&lt;/groupId&gt;
                &lt;artifactId&gt;native-image-maven-plugin&lt;/artifactId&gt;
                &lt;version&gt;20.0.0&lt;/version&gt;
                &lt;configuration&gt;
                    &lt;mainClass&gt;me.aboullaite.Covid19Cli&lt;/mainClass&gt;
                    &lt;imageName&gt;cov19-cli&lt;/imageName&gt;
                    &lt;buildArgs&gt;
                        --no-fallback
                        --report-unsupported-elements-at-runtime
                        --allow-incomplete-classpath
                        -H:ReflectionConfigurationFiles=classes/reflection.json
                        -H:+ReportExceptionStackTraces
                        -H:EnableURLProtocols=https
                    &lt;/buildArgs&gt;
                    &lt;skip&gt;false&lt;/skip&gt;
                &lt;/configuration&gt;
                &lt;executions&gt;
                    &lt;execution&gt;
                        &lt;goals&gt;
                            &lt;goal&gt;native-image&lt;/goal&gt;
                        &lt;/goals&gt;
                        &lt;phase&gt;verify&lt;/phase&gt;
                    &lt;/execution&gt;
                &lt;/executions&gt;
            &lt;/plugin&gt;
</code></pre>
<p>Everything is ready. Now we can generate a native image by running <code>mvn clean verify</code>, which will trigger native image compilation. The process will take about a minute to complete.<br>
At the end, we have a native executable under <code>target/cov19-cli</code>.</p>
<pre><code class="language-bash">$ ./target/cov19-cli --help 
                                                                                                                               
Usage: cov19 [-aghV] [-c=&lt;country&gt;]
Track covid-19 from your command line
  -a, --all                 show data for all affected countries
  -c, --country=&lt;country&gt;   Country to display data for
  -g, --graph               show data as graph history of last 30 days
  -h, --help                Show this help message and exit.
  -V, --version             Print version information and exit.

</code></pre>
<h4 id="comparingstartuptime">Comparing startup time</h4>
<p>I couldn&apos;t resist the thought of comparing the startup times of running the application on a normal JIT-based JVM to that of the native image. Below the results I got for on my machine:</p>
<pre><code class="language-bash">$ gtime -p java -jar covid-java-cli-1.0-SNAPSHOT.jar --help                                                                                        
real 0.32
user 0.67
sys 0.09
                                                                          
$ gtime -p ./cov19-cli --help                                                                                                                    
real 0.01
user 0.00
sys 0.00
</code></pre>
<h3 id="finalswords">Finals words</h3>
<p>Building Java-based native CLI tools is becoming possible nowadays with Picocli and GraalVM. Of course, there are several limitations in native-image compiler, mainly the refliction support. Neverthless, the combination of both tools to create CLI tools, without JVM overhead, looks promising.</p>
<h5 id="ressources">Ressources:</h5>
<ul>
<li><a href="https://picocli.info/?ref=aboullaite.me#_introduction">https://picocli.info/#_introduction</a></li>
<li><a href="https://medium.com/graalvm/simplifying-native-image-generation-with-maven-plugin-and-embeddable-configuration-d5b283b92f57?ref=aboullaite.me">https://medium.com/graalvm/simplifying-native-image-generation-with-maven-plugin-and-embeddable-configuration-d5b283b92f57</a></li>
<li><a href="https://www.infoq.com/articles/java-native-cli-graalvm-picocli/?ref=aboullaite.me">https://www.infoq.com/articles/java-native-cli-graalvm-picocli/</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Java 14 features: Text Blocks & Foreign-Memory Access API]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>This is the fourth and last post in the blog post series I wrote covering the features that has been added to java 14, released just a couple of days back.</p>
<blockquote class="twitter-tweet"><p lang="in" dir="ltr">Java 14 / JDK 14: General Availability: <a href="https://t.co/THxJ9llBpj?ref=aboullaite.me">https://t.co/THxJ9llBpj</a> <a href="https://twitter.com/hashtag/jdk14?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#jdk14</a> <a href="https://twitter.com/hashtag/java14?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#java14</a> <a href="https://twitter.com/hashtag/openjdk?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#openjdk</a> <a href="https://twitter.com/hashtag/java?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#java</a></p>&#x2014; Mark Reinhold (@mreinhold) <a href="https://twitter.com/mreinhold/status/1239969686449606658?ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">March</a></blockquote>]]></description><link>https://aboullaite.me/java-14-text-blocks-foreign-memory-access-api/</link><guid isPermaLink="false">6460c00ecda49600011ebb32</guid><category><![CDATA[Java]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sun, 22 Mar 2020 11:46:32 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>This is the fourth and last post in the blog post series I wrote covering the features that has been added to java 14, released just a couple of days back.</p>
<blockquote class="twitter-tweet"><p lang="in" dir="ltr">Java 14 / JDK 14: General Availability: <a href="https://t.co/THxJ9llBpj?ref=aboullaite.me">https://t.co/THxJ9llBpj</a> <a href="https://twitter.com/hashtag/jdk14?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#jdk14</a> <a href="https://twitter.com/hashtag/java14?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#java14</a> <a href="https://twitter.com/hashtag/openjdk?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#openjdk</a> <a href="https://twitter.com/hashtag/java?src=hash&amp;ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">#java</a></p>&#x2014; Mark Reinhold (@mreinhold) <a href="https://twitter.com/mreinhold/status/1239969686449606658?ref_src=twsrc%5Etfw&amp;ref=aboullaite.me">March 17, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>In this post, I will cover 2 more features: <code>Text Blocks</code> (Second Preview) and <code>Foreign-Memory Access API</code> (Incubator).</p>
<h5 id="java14rewfeaturesarticles">Java 14 rew features articles:</h5>
<ul>
<li><a href="https://aboullaite.me/java-14-instanceof-jpackage-npes/">Pattern Matching for <code>instanceof</code>, <code>jpackage</code> &amp; helpful NPEs</a></li>
<li><a href="https://aboullaite.me/java-14-records/">Records</a></li>
<li><a href="https://aboullaite.me/java-14-se-jfrs">Switch Expressions, JFR Event Streaming and more</a></li>
</ul>
<h3 id="jep368textblockssecondpreview">JEP 368: Text Blocks (Second Preview)</h3>
<p>The first preview of Text Blocks was introduced in <a href="https://openjdk.java.net/jeps/355?ref=aboullaite.me">Java 13</a> as a new, more concrete and concise vision for how <a href="https://openjdk.java.net/jeps/326?ref=aboullaite.me">Raw String Literals</a> should work in Java. You can read more about the withdraw of JEP 326 <a href="http://mail.openjdk.java.net/pipermail/jdk-dev/2018-December/002402.html?ref=aboullaite.me">here</a>.</p>
<p>A <code>text block</code> is a multi-line string literal that avoids the need for most escape sequences, automatically formats the string in a predictable way, make inline multi-line Strings more readable and gives the developer control over the format when desired.</p>
<h4 id="usage">Usage</h4>
<p>Text Blocks starts with three double quote characters (<code>&quot;&quot;&quot;</code>), continues with zero or more space, tab, and form feed characters, and concludes with a line terminator. The closing delimiter is a sequence of three double quote characters (<code>&quot;&quot;&quot;</code>).</p>
<pre><code>&quot;&quot;&quot;
Text
Block
Example
&quot;&quot;&quot;
</code></pre>
<p>Optionally, the closing delimiter can appear in line on the closing line:</p>
<pre><code>&quot;&quot;&quot;
Text
Block
Example&quot;&quot;&quot;
</code></pre>
<p>Note that the result type of a text block is still a String. <code>Text blocks</code> just provide us with another way to write String literals in our source code.</p>
<h4 id="indentation">Indentation</h4>
<p>Text Blocks made it a bit easier to indent our code properly. To calculate how many white space characters should be removed from every line, the compiler determines the line with the least white space characters and then shifts the complete text block to the left. The compiler takes whitespace indentation into consideration, differentiating incidental whitespaces from essential whitespaces.</p>
<p>For example, including the trailing blank line with the closing delimiter, the common white space prefix is 11, so eleven white spaces are removed from the start of each line.</p>
<pre><code class="language-java">// spaces (dots) will be removed
        String text= &quot;&quot;&quot;
...........     some text
...........     having fun
...........     with Text Blocks
...........&quot;&quot;&quot;;
</code></pre>
<p>Now, suppose the closing delimiter is moved slightly to the right of the content, in this case 16 white spaces are removed from the start of each line:</p>
<pre><code class="language-java">// spaces (dots) will be removed
        String text= &quot;&quot;&quot;
................some text
................having fun
................with Text Blocks
................   &quot;&quot;&quot;;
</code></pre>
<p>The spaces visualized with dots are considered to be incidental and hence will be removed.</p>
<h4 id="escaping">Escaping</h4>
<p>The use of the escape sequences <code>\&quot;</code> and <code>\n</code> is permitted in a text block, but not necessary or recommended. However, representing the sequence <code>&quot;&quot;&quot;</code> in a text block requires the escaping of at least one <code>&quot;</code> character, to avoid mimicking the closing delimiter.</p>
<pre><code class="language-java">String code =
    &quot;&quot;&quot;
    String text = \&quot;&quot;&quot;
        This is a Text Block inside a Text Block
    \&quot;&quot;&quot;;
    &quot;&quot;&quot;;
</code></pre>
<p>The string represented by a text block is not the literal sequence of characters in the content. Instead, the string represented by a text block is the result of applying the following transformations to the content, in order:</p>
<ul>
<li>Line terminators are normalized to the ASCII LF, character, as follows:
<ul>
<li>An ASCII CR (Carriage Return) character followed by an ASCII LF (Line Feed) character is translated to an ASCII LF character.</li>
<li>An ASCII CR character is translated to an ASCII LF character.</li>
</ul>
</li>
<li>Incidental white space is removed, as if by execution of <code>String::stripIndent</code> on the characters resulting from step 1.</li>
<li>Escape sequences are interpreted, as if by execution of <code>String::translateEscapes</code> on the characters resulting from step 2.</li>
</ul>
<h5 id="newescapesequences">New escape sequences</h5>
<p>With Java 14, escaping in text blocks got 2 more features:</p>
<ul>
<li>The <code>\&lt;line-terminator&gt;</code> escape sequence explicitly suppresses the insertion of a newline character. Very useful when you have long lines of text in the source code that you want to format in a readable way.</li>
<li>The new <code>\s</code> escape sequence simply translates to a single space (<code>\u0020</code>). Which basically tells the compiler to preserve any spaces in front of this escaped space, instead of ignoring them (default behaviour).</li>
</ul>
<h4 id="methods">Methods</h4>
<p>To support the new features of Text Blocks, a couple of methods have been introduced (some of them already mentioned above):</p>
<ul>
<li><code>String::stripIndent()</code>: used to strip away incidental white space from the text block content</li>
<li><code>String::translateEscapes()</code>: used to translate escape sequences</li>
<li><code>String::formatted(Object... args)</code>: simplify value substitution in the text block</li>
</ul>
<h3 id="jep370foreignmemoryaccessapiincubator">JEP 370: Foreign-Memory Access API (Incubator)</h3>
<p>This incubating feature enables efficient, safe and deterministic access to native memory segments out of JVM heap space (off-heap). <a href="https://openjdk.java.net/jeps/370?ref=aboullaite.me">The JEP</a> also states that this foreign memory API is intended as an alternative to currently used approaches (<code>java.nio.ByteBuffer</code> introduced in 2002 with <code>Java 1.4</code> and <code>sun.misc.Unsafe</code> way before).</p>
<p>The foreign-memory access API, part of <a href="https://openjdk.java.net/projects/panama/?ref=aboullaite.me">Project Panama</a>, introduces three main abstractions:</p>
<ul>
<li><code>MemorySegment</code>: is used to model a contiguous memory region with given spatial and temporal bounds.</li>
<li><code>MemoryAddress</code>: can be thought of as an offset within a segment.</li>
<li><code>MemoryLayout</code>: a way to define the layout of a memory segment in a language neutral fashion.</li>
</ul>
<p>To start playing with this API, You need first to add <code>jdk.incubator.foreign</code> module manually and enable preview features.<br>
The simple example below allocates 10 bytes memory out of JVM heap space and prints its base address.</p>
<pre><code class="language-java">import jdk.incubator.foreign.MemoryAddress;
import jdk.incubator.foreign.MemorySegment;

public class FmaExample {
    public static void main(String[] args) {

        MemoryAddress address = MemorySegment.allocateNative(4).baseAddress();
        System.out.print(address);
    }
}

// Prints
WARNING: Using incubator modules: jdk.incubator.foreign
MemoryAddress{ region: MemorySegment{ id=0x1406e03c limit: 4 } offset=0x0 }  
</code></pre>
<p>In the above, we are using the overloaded <code>allocateNative()</code> which takes a <code>long</code> value of the size in bytes and create a new native <code>memory segment</code> that models a newly allocated block of off-heap memory. There are two other versions of this method, one which accepts a <code>MemoryLayout</code> and one which accepts a size in bytes and the byte alignment.</p>
<p>In order to use the memory segment from the example above, <code>memory-access var handle</code> should be used. They are obtained using factory methods in the <code>MemoryHandles</code> class. The example below stores 10 bytes as int at the base of the off-heap memory segment:</p>
<pre><code>import jdk.incubator.foreign.MemoryAddress;
import jdk.incubator.foreign.MemoryHandles;
import jdk.incubator.foreign.MemorySegment;
import java.lang.invoke.VarHandle;
import java.nio.ByteOrder;

public class FmaExample {
    public static void main(String[] args) {

        MemoryAddress address = MemorySegment.allocateNative(10).baseAddress();
        VarHandle handle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
        handle.set(address, 10);

        System.out.println(&quot;Memory Value: &quot; + handle.get(address));
    }
}

// Prints
WARNING: Using incubator modules: jdk.incubator.foreign
Memory Value: 10
</code></pre>
<hr>
<h4 id="ressourcesandfurtherreading">Ressources and further reading:</h4>
<ul>
<li><a href="https://docs.oracle.com/javase/specs/jls/se14/preview/specs/text-blocks-jls.html?ref=aboullaite.me">https://docs.oracle.com/javase/specs/jls/se14/preview/specs/text-blocks-jls.html</a></li>
<li><a href="https://www.baeldung.com/java-text-blocks?ref=aboullaite.me">https://www.baeldung.com/java-text-blocks</a></li>
<li><a href="https://www.jrebel.com/blog/using-text-blocks-in-java-13?ref=aboullaite.me">https://www.jrebel.com/blog/using-text-blocks-in-java-13</a></li>
<li><a href="https://medium.com/@youngty1997/jdk-14-foreign-memory-access-api-overview-70951fe221c9?ref=aboullaite.me">https://medium.com/@youngty1997/jdk-14-foreign-memory-access-api-overview-70951fe221c9</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Java 14 features: Switch Expressions, JFR Event Streaming and more]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>This is the third post in a series of blog posts highlighting some features and improvements that will be introduced in Java 14, expected to go GA in a couple of days.<br>
In this post, We will have a look into Switch expression, JFR streaming, as well as some various</p>]]></description><link>https://aboullaite.me/java-14-se-jfrs/</link><guid isPermaLink="false">6460c00ecda49600011ebb31</guid><category><![CDATA[Java]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Wed, 11 Mar 2020 15:05:13 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>This is the third post in a series of blog posts highlighting some features and improvements that will be introduced in Java 14, expected to go GA in a couple of days.<br>
In this post, We will have a look into Switch expression, JFR streaming, as well as some various minor improvement.</p>
<h5 id="java14rewfeaturesarticles">Java 14 rew features articles:</h5>
<ul>
<li><a href="https://aboullaite.me/java-14-instanceof-jpackage-npes/">Pattern Matching for <code>instanceof</code>, <code>jpackage</code> &amp; helpful NPEs</a></li>
<li><a href="https://aboullaite.me/java-14-records/">Records</a></li>
<li><a href="https://aboullaite.me/java-14-text-blocks-foreign-memory-access-api/">Text Blocks &amp; Foreign-Memory Access API</a></li>
</ul>
<h3 id="switchexpressionstandardjep361">Switch Expression (Standard): JEP 361</h3>
<p>Switch expression was introduced first in JDK 12 as a preview feature, then refined in JDK 13 and it will be made final and permanent in JDK 14.</p>
<h4 id="alookintoswitchexpression">A look into Switch expression</h4>
<p>Often, a switch statement produces a value in each of its case blocks. Switch expressions enable you to use more concise expression syntax: fewer repetitive <code>case</code> and <code>break</code> keywords and less error-prone.<br>
Consider the following example:</p>
<pre><code class="language-java">        WeekDay day = WeekDay.FRIDAY;
        String dayType;
        switch (day) {
            case MONDAY:
            case TUESDAY:
            case WEDNESDAY:
            case THURSDAY:
            case FRIDAY:
                dayType = &quot;Weekday&quot;;
                break;
            case SATURDAY:
            case SUNDAY:
                dayType = &quot;Weekend&quot;;
                break;

            default:
                throw new IllegalArgumentException(&quot;Invalid Day&quot;);
        }
</code></pre>
<p>That&apos;s how we check if a specific day is a Weekday or not, using our good old switch statement. It would be better if we could <strong>return</strong> this information without the need of storing it in the variable <code>dayType</code>; we can do this with a switch expression which is both clearer and safer:</p>
<pre><code class="language-java"> String dayType = switch (day){
            case MONDAY, THURSDAY, WEDNESDAY, TUESDAY, FRIDAY -&gt; &quot;Weekday&quot;;
            case SATURDAY, SUNDAY -&gt; &quot;Weekend&quot;;
            default -&gt; throw new IllegalArgumentException(&quot;Invalid Day&quot;);
        };
</code></pre>
<p>As you can notice, instead of having to break out different cases, we used the new switch lambda-style syntax, which allows the expression on the right to execute if the label matches. This is a more straightforward control flow, free of fall-through (No need for <code>break</code> statements).<br>
Furthermore, the above example used &quot;<em>arrow case</em>&quot; labels with the arrow between label and execution. We could instead use &quot;<em>colon case</em>&quot; labels:</p>
<pre><code class="language-java">        String dayType = switch (day){
            case MONDAY, THURSDAY, WEDNESDAY, TUESDAY, FRIDAY:
                yield &quot;Weekday&quot;;
            case SATURDAY, SUNDAY:
                yield &quot;Weekend&quot;;
            default:
                throw new IllegalArgumentException(&quot;Invalid Day&quot;);
        };
</code></pre>
<p>But what is <code>yield</code>? <code>yield</code> statement has been introduced in <a href="https://openjdk.java.net/jeps/354?ref=aboullaite.me">JDK 13</a>! It takes one argument, which is the value that the case label produces in a switch expression. This is an easy thumb of rule to differentiate between a switch expression and a switch statement.</p>
<h3 id="jfreventstreamingjep349">JFR Event Streaming: JEP 349</h3>
<p><code>Java Flight Recorder</code> has a long history. It was first part of the BEA JRockit JVM. Then After Oracle acquired BEA it became a commercial feature of Oracle JDK. To be finally open sourced with the release of OpenJDK 11 (JEP 328) and also <a href="https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-January/011063.html?ref=aboullaite.me">in the process to be backported to 8</a>.<br>
The arrival of JDK 14 introduces a new feature to JFR: the ability for JFR to produce a continuous stream of events.</p>
<h4 id="whatisjfr">What is JFR?</h4>
<p>JFR is basically a monitoring tool that collects information about the events in a Java Virtual Machine (JVM) during the execution of a Java application. It is designed to affect the performance of a running application as little as possible.</p>
<h4 id="jfrinjdk14">JFR in JDK 14</h4>
<p>With <a href="https://openjdk.java.net/jeps/349?ref=aboullaite.me">JEP 349</a>, a new usage mode for JFR becomes available, which is JFR Event Streaming. This API provides a way for programs to receive callbacks when JFR events occur and respond to them immediately for both in-process and out-of-process applications. Same set of events can be recorded as in the non-streaming way. Therefore, event streaming would be performed at the same time as non-streaming.<br>
Check out the following example:</p>
<pre><code class="language-java">        Configuration config = Configuration.getConfiguration(&quot;default&quot;);
        try (var es = new RecordingStream(config)) {
            es.onEvent(&quot;jdk.GarbageCollection&quot;, System.out::println);
            es.onEvent(&quot;jdk.CPULoad&quot;, System.out::println);
            es.onEvent(&quot;jdk.JVMInformation&quot;, System.out::println);
            es.setMaxAge(Duration.ofSeconds(10));
            es.start();
        }
</code></pre>
<p>This snippet starts JFR on the local JVM using the default recorder settings and print the <code>Garbage Collection</code>, <code>CPU Load</code> and <code>JVM Information</code> events to standard output:</p>
<pre><code>jdk.JVMInformation {
  startTime = 12:13:28.724
  jvmName = &quot;OpenJDK 64-Bit Server VM&quot;
  jvmVersion = &quot;OpenJDK 64-Bit Server VM (14+36-1461) for bsd-amd64 JRE (14+36-1461), built on Feb  6 2020 19:03:05 by &quot;mach5one&quot; with clang 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)&quot;
  jvmArguments = N/A
  jvmFlags = N/A
  javaArguments = &quot;me.aboullaite.JFRStreamTest&quot;
  jvmStartTime = 12:13:28.415
  pid = 72713
}
jdk.CPULoad {
  startTime = 12:13:30.682
  jvmUser = 0.98%
  jvmSystem = 0.08%
  machineTotal = 2.86%
}
</code></pre>
<h3 id="numaawarememoryallocationforg1jep345">NUMA-Aware Memory Allocation for G1: JEP 345</h3>
<p><a href="https://queue.acm.org/detail.cfm?id=2513149&amp;ref=aboullaite.me">NUMA (Non-uniform memory access)</a>is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can share memory locally, improving performance and the ability of the system to be expanded.</p>
<p>This JEP aims to improve G1 performance on large machines by implementing NUMA-aware memory allocation. G1&apos;s heap is organized as a collection of fixed-size regions. A region is typically a set of physical pages, although when using large pages (via <code>-XX:+UseLargePages</code>) several regions may make up a single physical page. If the <code>+XX:+UseNUMA</code> option is specified then, when the JVM is initialized, the regions will be evenly spread across the total number of available NUMA nodes.</p>
<h3 id="nonvolatilemappedbytebuffersjep352">Non-Volatile Mapped Byte Buffers: JEP 352</h3>
<p>This JEP improves <code>FileChannel</code> API to support creating mapped byte buffers on non-volatile memory (persistent memory). The only API change required is a new enumeration employed by <code>FileChannel</code> clients to request mapping of a file located on an NVM-backed file system rather than a conventional, file storage system. The new enumeration values are used when calling the <code>FileChannel::map </code> method to create, respectively, a read-only or read-write MappedByteBuffer mapped over an NVM device file. This feature is only supported in Linux/x64 and Linux/AArch64 platforms.</p>
<h3 id="deprecatethesolarisandsparcportsjep362">Deprecate the Solaris and SPARC Ports: JEP 362</h3>
<p>Solaris/SPARC, Solaris/x64, and Linux/SPARC ports are deprecated and will be removed in a future release. The main motivation is to enable OpenJDK Community contributors to accelerate the development of new features, moving the platform forward.</p>
<h3 id="removetheconcurrentmarksweepcmsgarbagecollectorjep363">Remove the Concurrent Mark Sweep (CMS) Garbage Collector: JEP 363</h3>
<p>CMS Garbage Collected was deprecated in <a href="https://openjdk.java.net/jeps/291?ref=aboullaite.me">Java 9</a>, and it is removed with Java 14.</p>
<h3 id="zgconmacosjep364andwindowsjep365">ZGC on macOS (JEP 364) and Windows (JEP 365)</h3>
<p>ZGC was introduced in Java 11, but it was only supported in Linux. Now it is also available in macOS and Windows. For windows ZGC is not supported on Windows 10 and Windows Server older than version 1803, since older versions lack the required API for placeholder memory reservations.</p>
<h3 id="deprecatetheparallelscavengeserialoldgccombinationjep366">Deprecate the ParallelScavenge + SerialOld GC Combination: JEP 366</h3>
<p>The Parallel Scavenge young and Serial old garbage collector combination is deprecated due to little use and significant amount of maintenance effort.</p>
<h3 id="removethepack200toolsandapijep367">Remove the Pack200 Tools and API: JEP 367</h3>
<p>Pack200 tools and api was deprecated in <a href="https://openjdk.java.net/jeps/336?ref=aboullaite.me">Java 11</a>, and it is removed with Java 14.</p>
<hr>
<h4 id="resourcesandfurtherreading">Resources and further reading:</h4>
<ul>
<li><a href="https://openjdk.java.net/projects/jdk/14/?ref=aboullaite.me">https://openjdk.java.net/projects/jdk/14/</a></li>
<li><a href="https://blog.codefx.org/java/switch-expressions/?ref=aboullaite.me#No-Fall-Through">https://blog.codefx.org/java/switch-expressions/#No-Fall-Through</a></li>
<li><a href="https://docs.oracle.com/en/java/javase/13/language/switch-expressions.html?ref=aboullaite.me">https://docs.oracle.com/en/java/javase/13/language/switch-expressions.html</a></li>
<li><a href="https://blogs.oracle.com/javamagazine/java-flight-recorder-and-jfr-event-streaming-in-java-14?ref=aboullaite.me">https://blogs.oracle.com/javamagazine/java-flight-recorder-and-jfr-event-streaming-in-java-14</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Skaffold, OKE & OCIR!]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>If you&apos;re working on cloud-native apps and containers, you probably already noticed that, amid all features that containers offer, they added somehow a new later of complexity to the development workflow! We spend a great amount of time building container images, pushing them across registries, updating Kubernetes manifests,</p>]]></description><link>https://aboullaite.me/skaffold-oke-ocir/</link><guid isPermaLink="false">6460c00ecda49600011ebb30</guid><category><![CDATA[Docker]]></category><category><![CDATA[Devops]]></category><category><![CDATA[kubernetes]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Fri, 06 Mar 2020 03:31:16 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>If you&apos;re working on cloud-native apps and containers, you probably already noticed that, amid all features that containers offer, they added somehow a new later of complexity to the development workflow! We spend a great amount of time building container images, pushing them across registries, updating Kubernetes manifests, redeploying the application and checking if everything works as intended... even for the smallest changes. The feedback loop gets bigger and bigger!</p>
<p>One of the open source tools that helps to solve this issue, especially while working with kubernetes, is <a href="https://skaffold.dev/?ref=aboullaite.me">Skaffold</a>! Skaffold is a command line tool by Google, that facilitates continuous development for Kubernetes applications. The goal is to help developers to focus on writing and maintaining code rather than managing the repetitive steps required during the edit-debug-deploy inner loop.</p>
<p>In this posts, I&apos;m describing the steps to continuously deploy your cloud native apps, focus on coding and boost productivity, using Skaffold and <a href="https://cloud.oracle.com/?ref=aboullaite.me">Oracle Cloud</a>, mainly OKE and OCIR.</p>
<h3 id="prerequisites">Prerequisites!</h3>
<p>Make sure that you&apos;ve Docker installed in your machine, If not, you can either install Docker Desktop for Mac and Windows, or Docker engine for Linux users. This <a href="https://docs.docker.com/install/?ref=aboullaite.me">link</a> describes the necessary step to guide you through.</p>
<p>Additionally, since we&apos;ll be interacting with K8S, we necessarily need to use  the he Kubernetes command-line tool: <code>kubectl</code>. The complete guide on how to install and configure <code>kubectl</code> can be found <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl/?ref=aboullaite.me">here</a>.</p>
<h3 id="installingskaffold">Installing Skaffold</h3>
<p>Installing Skaffold is pretty straightforward. Below the details to configure  Skaffold on Mac, Windows and Linux:</p>
<h4 id="mac">Mac</h4>
<p>If you&apos;re familiar with <a href="https://brew.sh/?ref=aboullaite.me">Homebrew</a>, just run the following command to setup Skaffold on your machine: <code>brew install skaffold</code>. Otherwise , run the below commands in your terminal that basically download and place the binary in the <code>/usr/local/bin</code> folder:</p>
<pre><code>curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-darwin-amd64
chmod +x skaffold
sudo mv skaffold /usr/local/bin
</code></pre>
<h4 id="linux">Linux</h4>
<p>Linux users can run the following commands to install and configure Skaffold:</p>
<pre><code>curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64
chmod +x skaffold
sudo mv skaffold /usr/local/bin
</code></pre>
<h4 id="windows">Windows</h4>
<p>If you&apos;re using Windows, you need to download the <code>.exe</code> file from <a href="https://storage.googleapis.com/skaffold/releases/latest/skaffold-windows-amd64.exe?ref=aboullaite.me">here</a> and place it under your <code>PATH</code> folder.</p>
<p>More details can be found on <a href="https://skaffold.dev/docs/install/?ref=aboullaite.me"> Skaffold&apos;s Getting Started Guide page</a>.</p>
<h3 id="oraclecloudconfiguration">Oracle Cloud configuration</h3>
<p>Since you reading this, I suppose you already have an Oracle cloud account! If not, head over to the <a href="https://www.oracle.com/cloud/free/?ref=aboullaite.me">Always Free Services</a> page and create one. Yes It&apos;s free... Forever (at least for now)!<br>
Once done, We need to setup a kubernetes cluster and container registry. This can be easily done by accessing the <strong>Developer services</strong> from the side menu, under <strong>Solutions and Platform</strong> you can create and configure your OKE cluster and private OCIR! A detailled step by step guide, in-depth description of the processes can be found <a href="https://www.oracle.com/webfolder/technetwork/tutorials/obe/oci/oke-full/index.html?ref=aboullaite.me">here</a> and <a href="https://www.oracle.com/webfolder/technetwork/tutorials/obe/oci/registry/index.html?ref=aboullaite.me">here</a>.</p>
<h4 id="okeconfig">OKE config</h4>
<p>As any cloud service provider, Oracle cloud has their own command line tool to efficiently work, interact and manage with Oracle cloud Services. make sure to install and configure it following this <a href="https://docs.cloud.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm?ref=aboullaite.me">link</a></p>
<p><img src="https://aboullaite.me/content/images/2020/03/Screen-Shot-2020-03-06-at-12.33.22-AM.jpg" alt loading="lazy"><br>
Once the OCI CLI setup completed go to your OKE Cluster page and hit the <strong>Access Kubeconfig</strong> button on top of the page. Following the instructions will help to create the correct <code>kubectl</code> configuration. Worth nothing to mention that the OCI CLI works with multiple contexts, which means that it will keep your previous <code>kubeconfig</code> intact while adding/merging the new config into it. This can be easily verified by running <code>kubectl config view</code> to check <code>kubeconfig</code> settings, or <code>kubectl config get-contexts</code> to list all your contexts.</p>
<p>The last step is to set the default context to your OKE cluster by running: <code>kubectl config use-context &lt;oke-cluster-id&gt;</code></p>
<h4 id="ocirconfig">OCIR config</h4>
<p>You should by now have created your private container registry in Oracle Cloud. Make sure its private, even if not mandatory, it is how the way things should be, from a security standard and an enterprise perspective.<br>
<strong>OCIR</strong> stands for Oracle Cloud Container Registry. It&apos;s basically an Oracle-managed registry for your Docker container images. You can read more about Docker Registry <a href="https://docs.docker.com/registry/introduction/?ref=aboullaite.me">here</a>.</p>
<p>Since our OCIR is private, we need to configure a token to access it for both pushing and pulling our containers images. Head over to your Oracle Cloud console page, click <strong>User Settings</strong> under your profile image, hit the <strong>Auth Tokens</strong> page and then click the <strong>Generate Token</strong> button. Carefully note down the generated token as we will need it in the next steps.</p>
<p>We need afterward to make sure that we can access our registry with the generated token. For that we login to OCIR from Docker cli by typing the following command in your terminal:<br>
<code>docker login &lt;region-key&gt;.ocir.io</code></p>
<p>where <code>&lt;region-key&gt;</code> is the key for the Oracle Cloud Infrastructure Registry region you&apos;re using. This <a href="https://docs.cloud.oracle.com/en-us/iaas/Content/General/Concepts/regions.htm?ref=aboullaite.me">link</a> contains a list of oracle cloud region keys.</p>
<p>You will be prompted to provide a username and password! The username follows the format: <code>&lt;tenancy-namespace&gt;/&lt;username&gt;</code>. If your tenancy is federated with Oracle Identity Cloud Service, use the format <code>&lt;tenancy-namespace&gt;/oracleidentitycloudservice/&lt;username&gt;</code>.  Note that <code>tenancy-namespace</code> is the auto-generated Object Storage namespace string of the tenancy containing the repository from which the application is to pull the image. The password is the <strong>auth token</strong> you copied earlier.<br>
If everything is fine, you should get a <code>Login Succeeded</code> message. If the login fails, try to verify and repeat the step above.</p>
<p>Now that we&apos;re sure that the registry is accessible, we create a <code>Secret</code> that will be used in our K8S manifests to pull the image from it! This can be achieved by running:</p>
<pre><code class="language-shell">## An email address is required, but it doesn&apos;t matter what you specify
$ kubectl create secret docker-registry &lt;secret-name&gt; --docker-server=&lt;region-key&gt;.ocir.io --docker-username=&apos;&lt;tenancy-namespace&gt;/&lt;oci-username&gt;&apos; --docker-password=&apos;&lt;oci-auth-token&gt;&apos; --docker-email=&apos;&lt;email-address&gt;&apos;
</code></pre>
<h3 id="helloworld">Hello World!</h3>
<p>To put everything together, we&apos;ll be using an example from he Skaffold samples to check our setup. The example can be found <a href="https://github.com/GoogleContainerTools/skaffold/tree/master/examples/getting-started?ref=aboullaite.me">here</a>. The folder contains a single file go application, that prints <code>Hello World!</code> every seconds. To containerize the app, The Dockerfile uses multistage build feature to build the app in the first stage (builder) and copy and run the generated binary in the production/second stage.</p>
<p>The example also provide a simple <code>k8s-pod.yaml</code> to run the app in the K8S cluster. This file need to be updated by specifying the Docker secret created to access the OCIR, using <code>imagePullSecrets</code>. Below the updated file:</p>
<pre><code>apiVersion: v1
kind: Pod
metadata:
  name: getting-started
spec:
  containers:
  - name: getting-started
    image: skaffold-example
  imagePullSecrets:
  - name: ocirsecret
</code></pre>
<p>Finally, you can either change the <code>skaffold.yaml</code> file to match the new registry or use <code>--default-repo</code> flag to prefix the image name with the OCIR registry, with no no manual YAML editing! The Skaffold config file contains many stages specifying the steps to build and deploy your application. More details can be found <a href="https://skaffold.dev/docs/pipeline-stages/?ref=aboullaite.me">here</a></p>
<p>Now, you can continuously develop, deploy and test your changes using:</p>
<pre><code>$ skaffold dev --default-repo=&lt;region-key&gt;.ocir.io/tenancy-namespace&gt;/&lt;project-id&gt;
</code></pre>
<p>You can make changes to the <code>main.go</code> file and skaffold will build a new image, push it to OCIR, deploy it on OKE and print you the logs!<br>
<img src="https://aboullaite.me/content/images/2020/03/Screen-Shot-2020-03-06-at-4.24.43-AM.png" alt loading="lazy"></p>
<hr>
<p>Ressources:</p>
<ul>
<li><a href="https://cloud.google.com/blog/products/application-development/kubernetes-development-simplified-skaffold-is-now-ga?ref=aboullaite.me">https://cloud.google.com/blog/products/application-development/kubernetes-development-simplified-skaffold-is-now-ga</a></li>
<li><a href="https://docs.cloud.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypullingimagesfromocir.htm?ref=aboullaite.me">https://docs.cloud.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypullingimagesfromocir.htm</a></li>
<li><a href="https://www.ateam-oracle.com/continuous-deployments-with-skaffold-on-oracle-cloud-infrastructure-container-engine-for-kubernetes-oke?ref=aboullaite.me">https://www.ateam-oracle.com/continuous-deployments-with-skaffold-on-oracle-cloud-infrastructure-container-engine-for-kubernetes-oke</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[5 reasons to attend DevNexus 2020]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>We are writing these lines from  2 different beautiful cities; Brussels for me where I&apos;m attending, for the first time the amazing <a href="https://fosdem.org/2020/?ref=aboullaite.me">FOSDEM</a>; and Copenhagen for my dear friend, java champion and boss <a href="https://twitter.com/badrelhouari?ref=aboullaite.me">Badr El Houari</a> after participating in <a href="https://jspirit.org/?ref=aboullaite.me">jspirit unconference</a>, as he likes to describe himself lately:</p>]]></description><link>https://aboullaite.me/5-reasons-to-attend-devnexus-2020/</link><guid isPermaLink="false">6460c00ecda49600011ebb2f</guid><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Sun, 02 Feb 2020 18:19:05 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>We are writing these lines from  2 different beautiful cities; Brussels for me where I&apos;m attending, for the first time the amazing <a href="https://fosdem.org/2020/?ref=aboullaite.me">FOSDEM</a>; and Copenhagen for my dear friend, java champion and boss <a href="https://twitter.com/badrelhouari?ref=aboullaite.me">Badr El Houari</a> after participating in <a href="https://jspirit.org/?ref=aboullaite.me">jspirit unconference</a>, as he likes to describe himself lately: unconference advocate!</p>
<p>Our next conference this year will be <a href="https://www.jfokus.se/?ref=aboullaite.me">Jfokus</a> in Stockholm, then I&apos;ll fly afterward to Atlanta to attend one of the conferences that I wanted to attend since a while: <a href="https://devnexus.com/?ref=aboullaite.me">DevNexus</a>.</p>
<p>In this post, We share with you top 5 reason on why I&apos;m attending, and why you should join me to this year&apos;s Devnexus.</p>
<h3 id="jugleaderssummit">JUG Leaders Summit</h3>
<p>I am attending DevNexus not as a speaker (unfortunately), but as a JUG leader representing MoroccoJUG (a great honor). DevNexus team organizes this year a  GLOBAL JUGS LEADERS SUMMIT!</p>
<p>MoroccoJUG is the only active JUG in Morocco, a previous member of the JCP, and had been at the forefront of Adopt-a-JSR from the very start. In fact the JUG was awarded as an &quot;Outstanding Adopt-a-JSR Participant&quot; for their contributions to Java EE 7.</p>
<p>The JUG leaders summit is an amazing opportunity to meet fellow JUG leaders and discuss common challenges, exchange ideas, give feedback, learn tips on how to build and engaged community run successful events.</p>
<p>The JUG leaders summit is organized during the first day of the conference, Feb 19, and I am already super excited to be part of it :)</p>
<h3 id="byjavacommunityforjavacommunity">By Java Community for Java community</h3>
<p>Devnexus is the largest independent Java platform conference in the USA, run by the <a href="https://ajug.org/?ref=aboullaite.me">Atlanta JUG</a>. Devnexus become an annual attendance of over 2000 software developers and one of the leading technology events held annually around the globe.</p>
<p>I heard a lot of cool things about the conference and how the organizers aim to connect developers from all over the world and promoting open-source values. Besides the technical talks, there will be many opportunities at the conference to meet with the community and to network.</p>
<h3 id="meetingtheusualsuspect">Meeting the usual suspect</h3>
<p>Usual suspect are everywhere! That&apos;s a fact. But Devnexus is one of the annual rendezvous for many usual suspect to meet, hang out, share knowledge, learn from each other, and have fun. Never underestimate the power of a little fun mixed with some interesting people!</p>
<p><img src="https://media.giphy.com/media/b2omCv2khTGiA/giphy.gif" alt loading="lazy"></p>
<p>I typically spend as much time talking to people in the hallways as I do attending talks, It&apos;s a great way to build new relationships and make connections with attendees from diverse backgrounds and with a lot to share.</p>
<h3 id="greatspeakerlineup">Great speaker lineup</h3>
<p>with many of rock star <a href="https://devnexus.com/speakers/?ref=aboullaite.me">speakers</a> and 14 concurrent tracks and <a href="https://devnexus.com/schedule?ref=aboullaite.me">150+ individual sessions</a>, Devnexus brings to participants an unparalleled opportunities for both learning about the latest technology trends and diving deep into technologies that interest them.</p>
<p>The sheer amount of content at Devnexus is nothing less than astounding! With such many tracks and diverse sessions, happening simultaneously from early morning until late evening, covering a wide range of technology trends, no matter what happens, there&#x2019;ll be something that you will learn and take from this conference.</p>
<h3 id="greatlocation">Great Location</h3>
<p>Atlanta is the No. 1 filming location for movies and TV shows in the world, according to <a href="https://www.filmla.com/?ref=aboullaite.me">FilmL.A</a>. Atlanta is a city that has many people buzzing. Millions swing by yearly to Georgia&apos;s capital city to feel its historical significance and get a taste of it&#x2019;s vibrant culture.</p>
<p>Devnexus is an opportunity for me to visit Atlanta and discover its southern charm, dynamic culture and rich history.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Java 14 new features: Records]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>This is the second article in the blog post series discussing the new features introduced in java 14. Today&apos;s article is focused on <code>Records</code> that aims to provide a compact &amp; concise way for declaring data classes.</p>
<h5 id="java14rewfeaturesarticles">Java 14 rew features articles:</h5>
<ul>
<li><a href="https://aboullaite.me/java-14-instanceof-jpackage-npes/">Pattern Matching for <code>instanceof</code>, <code>jpackage</code> &amp;</a></li></ul>]]></description><link>https://aboullaite.me/java-14-records/</link><guid isPermaLink="false">6460c00ecda49600011ebb2e</guid><category><![CDATA[Java]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Wed, 29 Jan 2020 15:56:04 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>This is the second article in the blog post series discussing the new features introduced in java 14. Today&apos;s article is focused on <code>Records</code> that aims to provide a compact &amp; concise way for declaring data classes.</p>
<h5 id="java14rewfeaturesarticles">Java 14 rew features articles:</h5>
<ul>
<li><a href="https://aboullaite.me/java-14-instanceof-jpackage-npes/">Pattern Matching for <code>instanceof</code>, <code>jpackage</code> &amp; helpful NPEs</a></li>
<li><a href="https://aboullaite.me/java-14-se-jfrs">Switch Expressions, JFR Event Streaming and more</a></li>
<li><a href="https://aboullaite.me/java-14-text-blocks-foreign-memory-access-api/">Text Blocks &amp; Foreign-Memory Access API</a></li>
</ul>
<h3 id="why">Why ?</h3>
<blockquote>
<p>Java is verbose!</p>
</blockquote>
<p>You&apos;ve for sure already heard this statement before, from your colleagues, at a conference, probably in meetups, or you already saw it in twitter or reddit!<br>
Brian Goetz](<a href="https://twitter.com/BrianGoetz?ref=aboullaite.me">https://twitter.com/BrianGoetz</a>), Java Language Architect at Oracle, wrote a detailled post in that matter, stating that, as example, developers who want to create simple data carrier classes in a way that are easy to understand have to write a lot of low-value, repetitive, error-prone code: <code>constructors</code>, <code>accessors</code>, <code>equals()</code>, hashCode(), <code>toString()</code>...<br>
To avoid the frustration, some make use of IDE capabilities to do the legwork of writing the boilerplate, but fail to consider much beyond functionality of the code itself to help the reader distill the design intent. Others use some libraries such as <a href="https://projectlombok.org/?ref=aboullaite.me">Lombok</a>, While the lazy ones just omit the those methods, leading to surprising behavior and poor debuggability.</p>
<h3 id="anewtypedeclarationrecord">A new type declaration: Record!</h3>
<p>Records are a special kind of lightweight classes in java, intended to be simple data carriers, similar to what exist in other languages (such as <code>case</code> classes in Scala, <code>data</code> classes in Kotlin and <code>record</code> classes in C#). The aim is to extend the Java language syntax and create a way to say that the type represents only data. By making this statement, We&apos;re telling the compiler to do all the work for us and produce the methods without any effort from outside.</p>
<h4 id="showmethecode">Show me the code</h4>
<p>Let start with the following <code>Person</code> record:</p>
<pre><code class="language-java">public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){}
</code></pre>
<p>The record class  is an immutable, transparent carrier for a fixed set of fields known as the record <code>components</code> that provides a <code>state</code> description for the record. Each component gives rise to a <code>final</code> field that holds the provided value and an <code>accessor</code> method to retrieve the value. The field name and the accessor name match the name of the component.</p>
<p>Let&apos;s now try to compile the <code>Person</code> class. Since records still a <code>preview language feature</code>, which means that we need to enable the preview flag:</p>
<pre><code>javac --enable-preview -source 14 Person.java
</code></pre>
<p>Now if we examen the class file with <code>javap</code>, you can see that the compiler has <strong>autogenerated</strong> a bunch of boilerplate code:</p>
<pre><code>$ javap Person                                                                                                                                                                             Compiled from &quot;Person.java&quot;
public final class Person extends java.lang.Record {
  public Person(java.lang.String, java.lang.String, int, java.lang.String, java.util.Date);
  public java.lang.String toString();
  public final int hashCode();
  public final boolean equals(java.lang.Object);
  public java.lang.String firstName();
  public java.lang.String lastName();
  public int age();
  public java.lang.String address();
  public java.util.Date birthday();
}
</code></pre>
<p>Notice a couple of things here:</p>
<ul>
<li>a private <code>final</code> field, with the same name and type, for each component in the state description;</li>
<li>a public read <code>accessor</code> method, with the same name and type, for each component in the state description;</li>
<li>a public <code>constructor</code>, whose signature is the same as the state description, which initializes each field from the corresponding argument;</li>
<li>implementations of <code>equals</code> and <code>hashCode</code> that say two records are equal if they of the same type and contain the same state;</li>
<li>implementation of <code>toString</code> that includes all the components, with their names.</li>
</ul>
<p>Looking further and examining the byte code, we notice that both <code>hashCode</code>, <code>equals</code> and <code>toString</code> rely on <code>invokedynamic</code> to dynamically invoke the appropriate method containing the implicit implementation.</p>
<pre><code> public java.lang.String toString();
    descriptor: ()Ljava/lang/String;
    flags: (0x0001) ACC_PUBLIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokedynamic #32,  0             // InvokeDynamic #0:toString:(LPerson;)Ljava/lang/String;
         6: areturn
      LineNumberTable:
        line 2: 0

  public final int hashCode();
    descriptor: ()I
    flags: (0x0011) ACC_PUBLIC, ACC_FINAL
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokedynamic #36,  0             // InvokeDynamic #0:hashCode:(LPerson;)I
         6: ireturn
      LineNumberTable:
        line 2: 0

  public final boolean equals(java.lang.Object);
    descriptor: (Ljava/lang/Object;)Z
    flags: (0x0011) ACC_PUBLIC, ACC_FINAL
    Code:
      stack=2, locals=2, args_size=2
         0: aload_0
         1: aload_1
         2: invokedynamic #40,  0             // InvokeDynamic #0:equals:(LPerson;Ljava/lang/Object;)Z
         7: ireturn
      LineNumberTable:
        line 2: 0
</code></pre>
<h3 id="canidefineadditionalmethodsfields">Can I define additional methods, fields...</h3>
<p>The short answer to this question is Yes, you can add static fields/methods! But the question is however, should you ?!<br>
Keep in mind that the goal behind Records is to enable developers to group related fields together as a single immutable data item without the need to write verbose code. Which means that whenever you feel the temptation to add more fields/methods to your <code>record</code>, think if a full class makes more sens and should be used instead.<br>
For example, we can define a method that returns a <code>Person</code>&apos;s full name:</p>
<pre><code class="language-java">public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){
public String fullName(){
    return firstName + &quot; &quot; + lastName;
 }
}
</code></pre>
<h4 id="compactconstructor">Compact constructor</h4>
<p>Additionally, Records introduced <code>Compact Constructor</code>, with the aim that only validation and/or normalization code need to be given in the constructor body. The remaining initialization code is supplied by the compiler.<br>
For example, if we want to validate a <code>Person</code> age to make sure that it&apos;s not negative, the code would looks similar to:</p>
<pre><code class="language-java">public record Person(
    String firstName,
    String lastName,
    int age,
    String address,
    Date birthday
){
public Person{
    if (age &lt; 0) { 
        throw new IllegalArgumentException( &quot;Age must be greater than 0!&quot;); 
     }
   }
}
</code></pre>
<p>Notice that no explicit parameter list is given for the compact constructor, but is derived from the record component list.</p>
<h3 id="finalword">Final word</h3>
<p>Records address a common issue with using classes as wrappers for data. Plain data classes are significantly reduced from several lines of code to a one-liner.<br>
Keep in mind that Records are a preview language feature, which means that, although it is fully implemented, it is not yet standardized in the JDK.</p>
<hr>
<h5 id="ressources">Ressources:</h5>
<ul>
<li><a href="https://openjdk.java.net/jeps/359?ref=aboullaite.me">https://openjdk.java.net/jeps/359</a></li>
<li><a href="https://cr.openjdk.java.net/~briangoetz/amber/datum.html?ref=aboullaite.me">https://cr.openjdk.java.net/~briangoetz/amber/datum.html</a></li>
<li><a href="http://cr.openjdk.java.net/~gbierman/jep359/jep359-20191125/specs/records-jls.html?ref=aboullaite.me#jls-8.10.5">http://cr.openjdk.java.net/~gbierman/jep359/jep359-20191125/specs/records-jls.html#jls-8.10.5</a></li>
<li><a href="https://blogs.oracle.com/javamagazine/records-come-to-java?ref=aboullaite.me">https://blogs.oracle.com/javamagazine/records-come-to-java</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Tweets Sentiment Analysis using Stanford CoreNLP]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>We&apos;re living in an era where <a href="https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data?ref=aboullaite.me">data become the most valuable resource</a>! Nearly every app in the market now, tries to understand its users, their behaviours, preferences, reactions and words! How many times, just after mentioning a watch &#x231A; in a private conversation with your friend on messenger,</p>]]></description><link>https://aboullaite.me/stanford-corenlp-java/</link><guid isPermaLink="false">6460c00ecda49600011ebb2d</guid><category><![CDATA[Java]]></category><category><![CDATA[Spring Boot]]></category><category><![CDATA[NLP]]></category><dc:creator><![CDATA[Mohammed Aboullaite]]></dc:creator><pubDate>Wed, 08 Jan 2020 13:04:30 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>We&apos;re living in an era where <a href="https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data?ref=aboullaite.me">data become the most valuable resource</a>! Nearly every app in the market now, tries to understand its users, their behaviours, preferences, reactions and words! How many times, just after mentioning a watch &#x231A; in a private conversation with your friend on messenger, your Facebook feed starts popping up ads about watches from different vendors?! It Happens EVERY single time!</p>
<p>Understanding this kind data, classifying and representing it is the challenge that Natural Language Processing (NLP) tries to solve.<br>
In this article, I describe how I built a small application to perform sentiment analysis on tweets, using <a href="https://stanfordnlp.github.io/CoreNLP?ref=aboullaite.me">Stanford CoreNLP library</a>, <a href="http://twitter4j.org/?ref=aboullaite.me">Twitter4J</a>,  <a href="https://spring.io/projects/spring-boot?ref=aboullaite.me">Spring Boot</a> and <a href="https://reactjs.org/?ref=aboullaite.me">ReactJs</a>! The code is available on <a href="https://github.com/aboullaite/sentiment-analysis?ref=aboullaite.me">GitHub</a>.<br>
<img src="https://aboullaite.me/content/images/2020/01/sentiment-analysys-twitter-1.gif" alt loading="lazy"></p>
<h3 id="application">Application</h3>
<p>For everything related to Machine learning, java is generally not a popular choice. However, given the language popularity, there are some libraries and frameworks for pretty much everything!<br>
The application uses Stanford CoreNLP library java api to analyse tweets extracted by <a href="http://twitter4j.org/?ref=aboullaite.me">Twitter4J</a> library. The backend server is developed using spring (boot), and the frontend built using ReactJS.<br>
As main functionalities, the application enable based on a keyword to either analyse live twitter stream data and classify it, or perform a search and post-analyse the tweets. The default behaviour is the streaming mode, but we can easily switch to the search mode simply by a click of button!</p>
<h4 id="stanfordcorenlp">Stanford CoreNLP</h4>
<p>The Stanford CoreNLP is a Java natural language analysis library that provides statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, that can be incorporated into applications with human language technology needs.</p>
<p>Stanford CoreNLP integrates many NLP tools, including the <code>part-of-speech</code> (POS) tagger, the <code>named entity recognizer</code> (NER), the <code>parser, the coreference resolution</code> system, the <code>sentiment analysis</code> tools, and provides model files for analysis for multiples languages.</p>
<p>The snippet below shows <code>analyse(String tweet)</code> method from <code>SentimentAnalyzerService</code> class which runs sentiment analysis on a single tweet, scores it from 0 to 4 based on whether the analysis comes back with <code>Very Negative</code>, <code>Negative</code>, <code>Neutral</code>, <code>Positive</code> or <code>Very Positive</code> respectively.</p>
<pre><code class="language-java">public int analyse(String tweet) {

        Properties props = new Properties();
        props.setProperty(&quot;annotators&quot;, &quot;tokenize, ssplit, pos, parse, sentiment&quot;);
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation annotation = pipeline.process(tweet);
        for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
            Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
            return RNNCoreAnnotations.getPredictedClass(tree);
        }
        return 0;
    }
</code></pre>
<h4 id="fetchingtweets">Fetching Tweets</h4>
<p>I made use of the popular open source java library Twitter4J to fetch tweets. It provides a convenient API for accessing the <a href="https://developer.twitter.com/en/docs?ref=aboullaite.me">Twitter API</a>.<br>
The <code>TwitterService</code> class contains the main methods interacting with Twitter API to search for tweets based on keywords:</p>
<ul>
<li><code>fetchTweets</code> builds a <code>Query</code> to search tweets containing a specific keyword. it has a second parameter <code>count</code>, which specifies the number of tweets to return per page, up to a max of 100. I also filter the twitter search results to make sure no retweet or tweet replies are returned.</li>
</ul>
<pre><code class="language-java">public Flux&lt;TwitterStatus&gt; fetchTweets(String keyword, int count) throws TwitterException {
        Twitter twitter = this.config.twitter(this.config.twitterFactory());
        Query query = new Query(keyword.concat(&quot; -filter:retweets -filter:replies&quot;));
        query.setCount(count);
        query.setLocale(&quot;en&quot;);
        query.setLang(&quot;en&quot;);
        return Flux.fromStream( twitter.search(query).getTweets().stream()).map(status -&gt; this.cleanTweets(status));

    }
</code></pre>
<ul>
<li><code>streamTweets</code> collects live tweets matching a specific keyword</li>
</ul>
<pre><code class="language-java">    public Flux&lt;TwitterStatus&gt; streamTweets(String keyword){
        TwitterStream stream = config.twitterStream();
        FilterQuery tweetFilterQuery = new FilterQuery();
        tweetFilterQuery.track(new String[]{keyword});
        tweetFilterQuery.language(new String[]{&quot;en&quot;});
        return Flux.create(sink -&gt; {
            stream.onStatus(status -&gt; sink.next(this.cleanTweets(status)));
            stream.onException(sink::error);
            stream.filter(tweetFilterQuery);
            sink.onCancel(stream::shutdown);
        });
    }
</code></pre>
<p>Both methods fetch only tweets in english and returns a <a href="https://github.com/reactor/reactor-core?ref=aboullaite.me">reactor</a> <code>Flux</code>, capable of emitting a stream of 0 or more items, and then optionally either completing or erroring.</p>
<p>You should have noticed the call to <code>cleanTweets</code> before passing the tweets to the analyzer service. This method perform some cleanup on tweet text, removing the unneeded elements like links, hashtags, usernames ...</p>
<pre><code class="language-java">    private TwitterStatus cleanTweets(Status status){
        TwitterStatus twitterStatus = new TwitterStatus(status.getCreatedAt(), status.getId(), status.getText(), null, status.getUser().getName(), status.getUser().getScreenName(), status.getUser().getProfileImageURL());
        // Clean up tweets
        String text = status.getText().trim()
                // remove links
                .replaceAll(&quot;http.*?[\\S]+&quot;, &quot;&quot;)
                // remove usernames
                .replaceAll(&quot;@[\\S]+&quot;, &quot;&quot;)
                // replace hashtags by just words
                .replaceAll(&quot;#&quot;, &quot;&quot;)
                // correct all multiple white spaces to a single white space
                .replaceAll(&quot;[\\s]+&quot;, &quot; &quot;);
        twitterStatus.setText(text);
        twitterStatus.setSentimentType(analyzerService.analyse(text));
        return twitterStatus;
    }
</code></pre>
<h4 id="showingtheanalyzeddata">Showing the analyzed data</h4>
<p>Now that we&apos;ve our backend service ready, the final step is to consume our resources. Both endpoints implement SSE (Server Sent Events), which is a HTTP standard that allows a web application to handle an unidirectional event stream and receive updates whenever server emits data.</p>
<p>I used ReactJs with Typescript to build the Web UI components and consume the exposed REST endpoints. The main component is <code>TweetList</code> that handles the calls and share data with other components.</p>
<p>Once loaded, the component open an event stream with the server, calling the <code>/stream</code> endpoint, looking for all tweets containing <code>Java</code> keyword and saving them into array. It runs the effect and clean it up only once.</p>
<pre><code class="language-javascript">React.useEffect(() =&gt; {
    const eventSource = new EventSource(
      state.API_URL + &quot;stream/&quot; + state.hashtag
    );
    eventSource.onmessage = (event: any) =&gt; {
      const tweet = JSON.parse(event.data);
      let tweets = [...state.tweets, tweet];
      setState({ ...state, tweets: tweets });
    };
    eventSource.onerror = (event: any) =&gt; eventSource.close();
    setState({ ...state, eventSource: eventSource });
    return eventSource.close;
  }, []);
</code></pre>
<p>It keeps adding tweets to array whenever a message is received from the server. This effect runs whenever the <code>tweets</code>, <code>eventSource</code> or the <code>hashtag</code> change.</p>
<pre><code class="language-javascript">  React.useEffect(() =&gt; {
    if (state.eventSource) {
      state.eventSource.onmessage = (event: any) =&gt; {
        const tweet = JSON.parse(event.data);
        let tweets = [...state.tweets, tweet];
        setState({ ...state, tweets: tweets });
      };
    }
  }, [state.tweets, state.eventSource, state.hashtag]);
</code></pre>
<p>Finally, the render function looks like below:</p>
<pre><code class="language-javascript">return (
    &lt;Row&gt;
      &lt;Col xs={12} md={8}&gt;
        &lt;Col md={10}&gt;
          &lt;h2&gt;
            Tracked Keyword:
            &lt;Badge variant=&quot;secondary&quot;&gt;{state.hashtag}&lt;/Badge&gt;
          &lt;/h2&gt;
        &lt;/Col&gt;
        &lt;Col md={2}&gt;
          &lt;Spinner animation=&quot;grow&quot; variant=&quot;primary&quot; /&gt;
        &lt;/Col&gt;
        &lt;form
          onSubmit={e =&gt; {
            e.preventDefault();
          }}
        &gt;
          &lt;div className=&quot;input-group mb-3&quot;&gt;
            &lt;input
              type=&quot;text&quot;
              name=&quot;hashtag&quot;
              value={state.hashtag}
              onChange={e =&gt; setState({ ...state, hashtag: e.target.value })}
              className=&quot;form-control&quot;
              placeholder={state.hashtag}
              aria-label={state.hashtag}
              aria-describedby=&quot;basic-addon2&quot;
            /&gt;
            &lt;div className=&quot;input-group-append&quot;&gt;
              &lt;Button
                variant=&quot;outline-primary&quot;
                type=&quot;submit&quot;
                onClick={() =&gt; {
                  setState({
                    ...state,
                    eventSource: newSearch(true, state, setState),
                    tweets: []
                  });
                }}
              &gt;
                Stream
              &lt;/Button&gt;
              &lt;Button
                variant=&quot;primary&quot;
                type=&quot;submit&quot;
                onClick={() =&gt; {
                  setState({
                    ...state,
                    eventSource: newSearch(false, state, setState),
                    tweets: []
                  });
                }}
              &gt;
                Search
              &lt;/Button&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        &lt;/form&gt;
        &lt;div id=&quot;tweets&quot;&gt;
          {tweets
            .filter(tweet =&gt; tweet !== undefined)
            .reverse()
            .slice(0, 49)
            .map((tweet: Tweet) =&gt; (
              &lt;Alert
                key={tweet.id}
                variant={sentiment[tweet.sentimentType] as &quot;success&quot;}
              &gt;
                &lt;Alert.Heading&gt;
                  &lt;img src={tweet.profileImageUrl} /&gt;
                  &lt;a
                    href={&quot;https://twitter.com/&quot; + tweet.screenName}
                    className=&quot;text-muted&quot;
                  &gt;
                    {tweet.userName}
                  &lt;/a&gt;
                &lt;/Alert.Heading&gt;
                {tweet.originalText}
                &lt;hr /&gt;
                &lt;p className=&quot;mb-0&quot;&gt;
                  &lt;Moment fromNow&gt;{tweet.createdAt}&lt;/Moment&gt;
                &lt;/p&gt;
              &lt;/Alert&gt;
            ))}
        &lt;/div&gt;
      &lt;/Col&gt;
      &lt;Col xs={4} md={4}&gt;
        &lt;Desc tweets={tweets.length} /&gt;
        &lt;Doughnut tweets={tweets} /&gt;
        &lt;Color /&gt;
      &lt;/Col&gt;
    &lt;/Row&gt;
  );
</code></pre>
<h4 id="runningtheapp">Running the app</h4>
<p>Now before running the app make sure to update <code>application.yaml</code> file with the required authentication keys that will allow you to authenticate correctly when calling the Twitter API to retrieve tweets. You probably need to create a <a href="https://developer.twitter.com/?ref=aboullaite.me">Twitter developer account</a> and create an application.<br>
Start afterward the backend server using <code>mvn spring-boot:run</code> and the frontend <code>npm start</code>.</p>
<p>That&apos;s it folks! If you&apos;ve any remark or suggestion, leave it in the comment below or fill a <a href="https://github.com/aboullaite/sentiment-analysis?ref=aboullaite.me">Github issue</a>.</p>
<hr>
<h4 id="ressource">Ressource:</h4>
<ul>
<li><a href="https://www.quora.com/How-does-the-sentiment-analysis-in-Stanford-NLP-work-Is-there-a-way-for-Stanford-NLP-to-take-the-overall-sentiment-of-multiple-sentences?ref=aboullaite.me">https://www.quora.com/How-does-the-sentiment-analysis-in-Stanford-NLP-work-Is-there-a-way-for-Stanford-NLP-to-take-the-overall-sentiment-of-multiple-sentences</a></li>
<li><a href="https://blog.openshift.com/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java/?ref=aboullaite.me">https://blog.openshift.com/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java/</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>