<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jeremylem.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jeremylem.github.io/" rel="alternate" type="text/html" /><updated>2026-02-18T12:00:37+00:00</updated><id>https://jeremylem.github.io/feed.xml</id><title type="html">Cognitive offloading</title><subtitle>Emptying the mind, one article at a time</subtitle><author><name>Jeremy L.</name></author><entry><title type="html">AWS KMS XKS Without an HSM: Threshold Cryptography Across Cloud Providers</title><link href="https://jeremylem.github.io/blogging/2026/02/17/AWS_KMS_XKS_Threshold_Crypto.html" rel="alternate" type="text/html" title="AWS KMS XKS Without an HSM: Threshold Cryptography Across Cloud Providers" /><published>2026-02-17T00:00:00+00:00</published><updated>2026-02-17T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2026/02/17/AWS_KMS_XKS_Threshold_Crypto</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2026/02/17/AWS_KMS_XKS_Threshold_Crypto.html"><![CDATA[<p>In a <a href="/blogging/2026/02/11/AWS_KMS_XKS_SoftHSM.html">previous post</a>, I backed AWS KMS with a SoftHSM on my local machine via an SSH tunnel, based on the <a href="https://github.com/aws-samples/aws-kms-xks-proxy">AWS XKS proxy sample</a>. It proved the concept: the encryption key lives on hardware I control, and AWS KMS calls my proxy for every cryptographic operation.</p>

<p>This post takes a different approach. Instead of storing the AES key in one place and guarding it, I split the key so it never exists in any single location. Two independent services on two different cloud providers each hold a share. They cooperate to perform encryption, but neither can derive the key alone. No HSM required.</p>

<h2 id="the-problem-with-a-single-key-in-a-single-place">The Problem with a Single Key in a Single Place</h2>

<p>HSMs are expensive, complex to operate, and tied to a specific vendor and facility. If you want data sovereignty guarantees across European jurisdictions, you need HSMs in those jurisdictions, and you need to double the hardware for resilience, with all the procurement and compliance overhead that entails.</p>

<p>There is a way to make it simpler.</p>

<h2 id="split-key-across-cloud-providers">Split Key Across Cloud Providers</h2>

<p>The idea is 2-of-3 threshold cryptography over elliptic curves (P-256). Three key shares are generated in a one-time ceremony. Any two shares can derive the AES-256 encryption key for a given key identifier. One share alone is mathematically useless.</p>

<p>For normal operation, two shares participate:</p>

<ul>
  <li><strong>Share 1</strong> lives in a Scaleway Serverless Container (fr-par, Paris). This is the XKS proxy that AWS KMS calls. Scaleway provides the HTTPS endpoint; the container runs plain HTTP behind Scaleway’s TLS termination.</li>
  <li><strong>Share 2</strong> lives on an Exoscale compute instance (ch-gva-2, Geneva). A minimal service that computes one elliptic curve operation per request, protected by mutual TLS.</li>
  <li><strong>Share 3</strong> is offline backup. Stored securely, never deployed. Used only if one of the other two services is permanently lost.</li>
</ul>

<p>When AWS KMS needs to encrypt or decrypt, the XKS proxy computes its part locally, asks the EU service for the second part over mTLS, combines the two parts mathematically, and derives the AES-256 key. The key exists in memory for the duration of one request, then is zeroed.</p>

<p>The critical property: the private key is never reconstructed, not even transiently. Each party computes its own partial result on its own share. The proxy combines these partial results using Lagrange interpolation on the elliptic curve. The math guarantees this produces the same shared secret as if the full private key had been used, but no single party ever held that full key.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AWS KMS
  |
  |  HTTPS + SigV4
  v
Scaleway (fr-par)                      Exoscale (ch-gva-2)
  XKS Proxy — Holds: Share 1            EU Share Service — Holds: Share 2
  |                                      |
  | 1. V = hash_to_curve(keyId)          |
  | 2. partial_1 = share_1 * V           |
  |                                      |
  |------------ mTLS -----------------&gt; |
  |    "compute share_2 * V"             | 3. partial_2 = share_2 * V
  | &lt;----------------------------------- |
  |                                      |
  | 4. Lagrange combine partials         |
  | 5. HKDF -&gt; AES-256 key              |
  | 6. AES-GCM encrypt/decrypt          |
  | 7. Zeroize key                       |
  |
  v
AWS KMS (response)
</code></pre></div></div>

<p>The virtual point <code class="language-plaintext highlighter-rouge">V</code> is derived deterministically from the key identifier using hash-to-curve (RFC 9380). This means the same <code class="language-plaintext highlighter-rouge">keyId</code> always produces the same AES-256 key – a requirement for KMS, where encrypt and decrypt must use the same key.</p>

<h2 id="what-each-party-sees">What Each Party Sees</h2>

<table>
  <thead>
    <tr>
      <th>Party</th>
      <th>Knows</th>
      <th>Cannot derive</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>XKS Proxy — Scaleway (Share 1)</td>
      <td>Its share, the virtual point, its own partial result</td>
      <td>The AES key without Share 2’s partial</td>
    </tr>
    <tr>
      <td>EU Share Service — Exoscale (Share 2)</td>
      <td>Its share, the virtual point, its own partial result</td>
      <td>The AES key, the plaintext, the ciphertext, anything about the AWS request</td>
    </tr>
    <tr>
      <td>AWS KMS</td>
      <td>Ciphertext, AAD</td>
      <td>The AES key (no shares, no partials)</td>
    </tr>
    <tr>
      <td>Attacker with 1 share</td>
      <td>One share scalar</td>
      <td>The AES key (need 2 of 3)</td>
    </tr>
  </tbody>
</table>

<p>The Exoscale service is deliberately minimal. It receives a point, multiplies it by its share scalar, and returns the result. It never sees plaintext, ciphertext, additional authenticated data, or the derived AES key. It doesn’t even know whether the request is for an encrypt or decrypt operation.</p>

<h2 id="what-this-protects-and-what-it-doesnt">What This Protects and What It Doesn’t</h2>

<p>Like the previous HSM-based approach, this is about <strong>data at rest</strong>. When you upload a file to S3 with KMS encryption, the file is encrypted before it hits storage. When you download it, S3 asks KMS to decrypt, KMS asks my proxy, my proxy cooperates with Exoscale, and the plaintext is returned through the chain. At each step of that chain, the data passes through systems in the clear – AWS sees the plaintext when it serves the GET request. There is no way around this when using server-side encryption.</p>

<p>The protection is against someone gaining access to the stored data without going through the live system. If S3 storage is exfiltrated, the encrypted objects are useless without the key. And the key cannot be derived without the cooperation of two independent services.</p>

<h3 id="the-monitoring-advantage">The monitoring advantage</h3>

<p>This is where split keys differ meaningfully from a single HSM. Every decryption operation requires the Exoscale service to participate. Every request is logged: which key path was used, when, and from which client. If someone – including the operator of the XKS proxy – tries to mass-decrypt stored data, the Exoscale service sees a sudden flood of requests. This is visible in Exoscale’s logs, independently of anything happening on the AWS or Scaleway side.</p>

<p>Here are the actual logs from the deployed system. First, the Scaleway XKS proxy starting up and handling requests:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Scaleway XKS Proxy (fr-par)
18:07:01  Share store initialized, share_id=1, key_count=5
18:07:01  EU client mTLS configured (client cert + custom CA)
18:07:01  EU client initialized, base_url=https://92.39.60.218:8443
18:07:01  XKS Proxy listening, addr=0.0.0.0:8080
18:20:57  Encrypt   kmsRequestId=f6be4217-...  key_id=test-key-1
18:21:06  Decrypt   kmsRequestId=b836a273-...  key_id=test-key-1
18:34:23  Decrypt   kmsRequestId=544d2f24-...  key_id=test-key-1
18:34:26  Decrypt   kmsRequestId=45b3ecaa-...  key_id=test-key-1
18:35:34  Decrypt   kmsRequestId=1ab54f6b-...  key_id=test-key-1
</code></pre></div></div>

<p>And the Exoscale EU share service, which sees only partial ECDH requests — no plaintext, no ciphertext, no indication of whether it’s an encrypt or decrypt:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Exoscale EU Share Service (ch-gva-2)
17:02:54  Share store initialized, share_id=2, key_count=5
17:02:54  EU Share Service listening (mTLS), addr=0.0.0.0:8443
17:20:57  Partial ECDH request  request_id=6a7621ce772b8015  key_path=encryption/key-1
17:21:06  Partial ECDH request  request_id=1d5d05530cdc70a4  key_path=encryption/key-1
17:34:23  Partial ECDH request  request_id=8defb91565227a92  key_path=encryption/key-1
17:34:26  Partial ECDH request  request_id=ed1cf2052a008935  key_path=encryption/key-1
17:35:34  Partial ECDH request  request_id=22c75b772d54ffaa  key_path=encryption/key-1
</code></pre></div></div>

<p>Every operation leaves a trace on both providers independently. The KMS request UUID from AWS flows through to the Scaleway proxy logs. The Exoscale service logs the key derivation path but has no visibility into what the operation is for.</p>

<p>More importantly, the Exoscale operator can act on anomalies. If they see thousands of partial computation requests in a few minutes for key paths that normally see a handful per hour, they can shut down the instance. The XKS proxy still holds its share, but one share alone is useless – mathematically, it reveals nothing about the key.</p>

<p>This is a property that no single-HSM setup provides. When the key is in one place, whoever controls that place can silently decrypt everything. With split keys, any large-scale decryption is necessarily visible to the other party, and either party can pull the plug.</p>

<h2 id="performance-the-250ms-budget">Performance: The 250ms Budget</h2>

<p>AWS KMS requires the XKS proxy to respond within 250ms. The threshold protocol adds a network round-trip to a second cloud provider in a different country. Can it stay within budget?</p>

<p>The XKS proxy runs on Scaleway Serverless Containers (fr-par, Paris). The EU share service runs on Exoscale (ch-gva-2, Geneva). Paris to Geneva is ~340km, with network round-trips typically 5-15ms.</p>

<h3 id="warm-requests">Warm requests</h3>

<p>With connection reuse (the typical case after the first request), the round-trip to Geneva is fast. The elliptic curve operations take about 1ms on each side. The bottleneck is network latency, not computation. All operations stay well within the 250ms budget.</p>

<h3 id="cold-start">Cold start</h3>

<p>The Scaleway container starts in under a second. KMS periodic health checks keep it warm. Since both the XKS proxy and the EU share service are static Rust binaries in distroless Docker images, startup is fast. The XKS proxy initializes its share store, configures the mTLS client to Exoscale, and begins listening – all in under a second.</p>

<h3 id="exoscale-service">Exoscale service</h3>

<p>The EU share service runs on an Exoscale compute instance with rustls handling mTLS directly (no TLS termination proxy). It performs a single scalar-point multiplication per request (~1ms of computation). The mTLS connection ensures only the XKS proxy – holding the correct client certificate signed by the pinned CA – can reach it.</p>

<h2 id="the-sovereignty-argument">The Sovereignty Argument</h2>

<p>The real value isn’t performance – it’s the trust model.</p>

<p>With a traditional HSM-backed XKS, the key is in one place. Whoever controls that place controls the key. With threshold cryptography, the key is split across providers:</p>

<ul>
  <li><strong>No single cloud provider</strong> can derive the key. Scaleway holds one share but needs Exoscale’s cooperation. Exoscale holds one share but never sees the key or the data. AWS holds no shares at all.</li>
  <li><strong>Revocation is instant.</strong> Shut down the Exoscale instance, and AWS KMS can no longer encrypt or decrypt. The kill switch is a single command to a Swiss provider that has no relationship with AWS or Scaleway.</li>
  <li><strong>No HSM vendor lock-in.</strong> The system is pure software. The shares are 34-byte values. They can be deployed to any compute platform that runs a Rust binary.</li>
</ul>

<h2 id="three-providers-three-countries">Three Providers, Three Countries</h2>

<p>The current deployment already spans two providers and two countries: Scaleway (Paris, France) for the XKS proxy and Exoscale (Geneva, Switzerland) for the EU share service. But the 2-of-3 threshold scheme is designed for three:</p>

<table>
  <thead>
    <tr>
      <th>Share</th>
      <th>Provider</th>
      <th>Location</th>
      <th>Jurisdiction</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Share 1</td>
      <td>Scaleway Serverless Containers</td>
      <td>fr-par (Paris)</td>
      <td>France</td>
      <td>Active — XKS proxy</td>
    </tr>
    <tr>
      <td>Share 2</td>
      <td>Exoscale Compute</td>
      <td>ch-gva-2 (Geneva)</td>
      <td>Switzerland</td>
      <td>Active — EU share service</td>
    </tr>
    <tr>
      <td>Share 3</td>
      <td>IONOS or Hetzner</td>
      <td>Frankfurt</td>
      <td>Germany</td>
      <td>Cold standby</td>
    </tr>
  </tbody>
</table>

<p>Any two of the three can derive the key. This gives you:</p>

<ul>
  <li><strong>Provider resilience.</strong> If one provider has an outage, the other two can still operate. Reconfigure the XKS proxy to call the remaining provider. Since the Lagrange interpolation is parameterized by share IDs, switching from Share 2 to Share 3 is a configuration change, not a key change. The derived AES key is identical regardless of which two shares are used.</li>
  <li><strong>Jurisdictional resilience.</strong> If one country’s regulator orders a provider to freeze or hand over data, they get one share – which is mathematically useless without a second share from a different jurisdiction. The other two providers can continue operating.</li>
  <li><strong>No single point of legal compulsion.</strong> A French court order to Scaleway yields Share 1. A Swiss court order to Exoscale yields Share 2. Neither alone produces the key. Compelling two jurisdictions simultaneously requires international cooperation – a significantly higher bar than a single domestic order.</li>
  <li><strong>Mutual oversight.</strong> Each share service logs every request. Unusual decryption patterns are visible to at least two independent operators in two different countries. A government quietly compelling mass decryption through one provider is immediately visible to the other.</li>
</ul>

<h3 id="cross-border-latency">Cross-border latency</h3>

<p>Paris to Geneva is ~340km. Paris to Frankfurt is ~480km. Network round-trips between these cities are typically 5-15ms.</p>

<p>The current Paris-Geneva deployment already works within the 250ms budget. Adding a third share in Frankfurt would not change the latency significantly – the XKS proxy only calls one partner per request, and Paris to Frankfurt is a comparable distance.</p>

<h3 id="swiss-neutrality">Swiss neutrality</h3>

<p>Switzerland is particularly interesting as a share location. It’s not an EU member state, so it’s outside the reach of EU-wide data access orders. Swiss data protection law (nDSG/LPD) is independently strong. Providers like Infomaniak and Exoscale specifically market sovereignty and data localization. A share stored with a Swiss provider adds jurisdictional diversity that goes beyond having multiple EU locations.</p>

<h3 id="the-operational-model">The operational model</h3>

<p>In normal operation, only two shares are active. The third is cold standby. If the XKS proxy needs to switch to a different partner:</p>

<ol>
  <li>Activate the standby share service (deploy the container to the standby provider)</li>
  <li>Update the XKS proxy’s configuration: new URL and the new share ID</li>
  <li>The Lagrange coefficients adjust automatically based on the share IDs involved</li>
  <li>The derived AES key is identical – the math guarantees it</li>
</ol>

<h2 id="cost">Cost</h2>

<p>The entire system runs on near-free infrastructure:</p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Scaleway Serverless Container (XKS proxy)</td>
      <td>Free tier (400k GB-s/month)</td>
    </tr>
    <tr>
      <td>Exoscale Compute (EU share service)</td>
      <td>~$15/month (standard.small)</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>~$15/month</strong></td>
    </tr>
  </tbody>
</table>

<p>Scaleway provides the HTTPS endpoint for free with its serverless containers. The Exoscale VM is the only fixed cost.</p>

<p>Compare this to a Thales Luna HSM ($15-25k/year) or a cloud-managed HSM billed per key per hour. The threshold approach costs almost nothing at low to moderate volumes.</p>

<h2 id="what-this-does-not-replace">What This Does Not Replace</h2>

<p>This is not FIPS 140-3 Level 3. There is no tamper-resistant hardware. If you need a compliance checkbox that says “FIPS”, you need an HSM.</p>

<p>And as with any server-side encryption scheme, the data passes through AWS in the clear during normal operations. AWS could retain plaintext, log decrypted data, or be compelled to do so. This is inherent to the model: you’re asking AWS to encrypt and decrypt on your behalf.</p>

<p>What threshold cryptography adds is control over the key at rest and visibility into its use. The key cannot be derived without active cooperation from two independent parties on two different cloud providers in two different countries. Unusual access patterns are visible to both. And either party can revoke access instantly by shutting down their service.</p>

<p>For the actual security properties most people want from external key management – control over key material, revocation capability, multi-jurisdiction sovereignty, defense against silent key extraction – threshold cryptography provides stronger guarantees than a single HSM. A compromised HSM exposes all its keys. A compromised share exposes nothing.</p>

<h2 id="regulatory-landscape">Regulatory Landscape</h2>

<p>The threshold approach maps to several EU regulations, existing and upcoming.</p>

<h3 id="dora-digital-operational-resilience-act">DORA (Digital Operational Resilience Act)</h3>

<p>DORA Article 7 (RTS) requires key lifecycle management, controls against loss and unauthorized access proportional to risk, key replacement procedures, and certificate registers. The threshold approach addresses each:</p>

<ul>
  <li><strong>Loss protection</strong> – 2-of-3 means losing one share doesn’t lose the key. Offline backup share provides recovery.</li>
  <li><strong>Unauthorized access</strong> – no single party holds the full key. Compromising one provider reveals nothing mathematically useful.</li>
  <li><strong>Key replacement</strong> – re-run the ceremony, generate new shares, redeploy.</li>
  <li><strong>Third-party concentration risk</strong> (DORA Art. 28+) – shares on separate providers in separate jurisdictions directly addresses this. DORA specifically calls out concentration risk with single ICT providers.</li>
  <li><strong>Auditability</strong> – every operation logged independently on each provider.</li>
  <li><strong>Revocability</strong> – either party can shut down their service instantly.</li>
</ul>

<h3 id="nis2-directive">NIS2 Directive</h3>

<p>Article 21.2(h) requires encryption and key management proportional to risk for essential and important entities (energy, healthcare, transport, digital infrastructure). AES-256 meets the minimum. Splitting the key across jurisdictions is a stronger control than storing it in one place.</p>

<h3 id="gdpr">GDPR</h3>

<p>Article 32 requires “appropriate technical measures” including encryption. Article 34 says if data is encrypted and the key is not exposed, breach notification to individuals may not be required. With threshold splitting, the cloud provider never holds the key – they are a processor with no access to key material. If S3 storage is breached, the encrypted objects are useless without cooperation of two independent parties.</p>

<h3 id="eu-data-act">EU Data Act</h3>

<p>Article 28 (applicable since September 2025) requires cloud providers to take “all reasonable measures, including encryption” to prevent unlawful access to data, especially from third-country government requests. This is where threshold splitting has its strongest argument.</p>

<p>The US CLOUD Act allows US law enforcement to compel American companies to hand over data stored abroad. But the CLOUD Act is encryption-neutral – it does not compel providers to decrypt data they cannot decrypt. A subpoena to AWS is mathematically useless.</p>

<h3 id="eucs-european-cybersecurity-certification-scheme-for-cloud-services">EUCS (European Cybersecurity Certification Scheme for Cloud Services)</h3>

<p>Still being finalized by ENISA. Earlier drafts included a “high+” level requiring encryption keys to be held outside the cloud provider’s control and within EU jurisdiction. Even though sovereignty requirements were dropped from the latest draft, individual member states can still impose them. The threshold approach is ready for the strictest interpretation: keys are split across European providers, none held by a US-subject entity.</p>

<hr />]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[In a previous post, I backed AWS KMS with a SoftHSM on my local machine via an SSH tunnel, based on the AWS XKS proxy sample. It proved the concept: the encryption key lives on hardware I control, and AWS KMS calls my proxy for every cryptographic operation.]]></summary></entry><entry><title type="html">AWS KMS External Key Store: Keep Your Encryption Keys Out of the Cloud</title><link href="https://jeremylem.github.io/blogging/2026/02/11/AWS_KMS_XKS_SoftHSM.html" rel="alternate" type="text/html" title="AWS KMS External Key Store: Keep Your Encryption Keys Out of the Cloud" /><published>2026-02-11T00:00:00+00:00</published><updated>2026-02-11T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2026/02/11/AWS_KMS_XKS_SoftHSM</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2026/02/11/AWS_KMS_XKS_SoftHSM.html"><![CDATA[<p>A proof of concept demonstrating AWS KMS encryption backed by a local SoftHSM key, using the <a href="https://docs.aws.amazon.com/kms/latest/developerguide/keystore-external.html">XKS Proxy API</a>. The cryptographic master key lives on a machine I fully own, not in AWS, not in a managed service, but on my personal hardware.</p>

<h2 id="motivation">Motivation</h2>

<p>I wanted to back AWS KMS with a hardware security module under my physical control. SoftHSM on a local machine serves as a stand-in for a real HSM for this POC. The EC2 instance and SSH reverse tunnel exist only because AWS KMS requires an HTTPS endpoint with a valid TLS certificate to reach the XKS proxy. The actual key material never leaves the local machine.</p>

<hr />

<h2 id="architecture">Architecture</h2>

<h3 id="the-intended-model">The intended model</h3>

<p>The idea behind XKS is straightforward: you run an HSM on your premises, front it with a service layer that implements the <a href="https://github.com/aws-samples/aws-kms-xks-proxy">XKS Proxy API</a>, and AWS KMS calls that proxy whenever it needs to encrypt or decrypt. The proxy talks to any Hardware Security Module that supports PKCS#11 v2.40 – Thales Luna, Entrust nShield, or anything else that speaks the standard. AWS publishes the API spec and a reference implementation in Rust, so you can build and run your own.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S3 (SSE-KMS) → KMS → XKS Proxy → HSM (PKCS#11 v2.40)
</code></pre></div></div>

<p>In production, the proxy sits next to the HSM on the same network. Simple.</p>

<h3 id="my-workaround">My workaround</h3>

<p>I don’t have an HSM or a server with a public IP and a TLS certificate. So I stitched together a workaround to validate the concept: SoftHSM on my Mac as the HSM stand-in, an EC2 instance running the XKS proxy behind an ALB for TLS termination, and p11-kit remoting the PKCS#11 calls back to my machine over an SSH reverse tunnel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S3 (SSE-KMS) → KMS → XKS Proxy (EC2) → p11-kit client
                                              ↓
                                     SSH reverse tunnel
                                              ↓
                                     p11-kit server (Mac) → SoftHSM
</code></pre></div></div>

<p>Not what you’d run in production. But it proves the point: when S3 encrypts an object, the actual cryptographic operation happens on my machine, with a key that never leaves it.</p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Location</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>xks-proxy</td>
      <td>EC2 (aarch64, t4g.nano)</td>
      <td><a href="https://github.com/aws-samples/aws-kms-xks-proxy">AWS reference XKS proxy</a> (Rust)</td>
    </tr>
    <tr>
      <td>ALB</td>
      <td>AWS</td>
      <td>TLS termination with valid certificate</td>
    </tr>
    <tr>
      <td>p11-kit 0.26.2</td>
      <td>EC2 + Mac</td>
      <td>PKCS#11 remoting over Unix sockets</td>
    </tr>
    <tr>
      <td>SoftHSM v2</td>
      <td>Mac (Homebrew)</td>
      <td>Software HSM holding the AES-256 key</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="setup">Setup</h2>

<h3 id="create-the-softhsm-aes-256-key">Create the SoftHSM AES-256 key</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>softhsm2-util <span class="nt">--init-token</span> <span class="nt">--slot</span> 0 <span class="nt">--label</span> foo <span class="nt">--pin</span> 1234 <span class="nt">--so-pin</span> 0000
softhsm2-util <span class="nt">--show-slots</span>

pkcs11-tool <span class="nt">--module</span> /opt/homebrew/lib/softhsm/libsofthsm2.so <span class="se">\</span>
  <span class="nt">--login</span> <span class="nt">--pin</span> 1234 <span class="se">\</span>
  <span class="nt">--keygen</span> <span class="nt">--key-type</span> AES:32 <span class="nt">--label</span> foo <span class="se">\</span>
  <span class="nt">--token-label</span> foo
</code></pre></div></div>

<p>Expected output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Secret Key Object; AES length 32
  label:      foo
  Usage:      encrypt, decrypt, sign, verify, wrap, unwrap
  Access:     never extractable, local
</code></pre></div></div>

<h3 id="deploy-the-infrastructure">Deploy the infrastructure</h3>

<p>A single <code class="language-plaintext highlighter-rouge">create.sh</code> script handles everything: cross-compiling xks-proxy for aarch64 via <code class="language-plaintext highlighter-rouge">cargo zigbuild</code>, deploying CloudFormation (ALB, EC2, Route53, CloudWatch), SCPing the binary and config to EC2, and starting the systemd service.</p>

<h3 id="the-p11-kit-version-issue">The p11-kit version issue</h3>

<p>This is where I spent most of my time. The AL2023 repo ships p11-kit <strong>0.24.1</strong>, which lacks AES-GCM RPC serialization entirely, every encrypt/decrypt call fails with <code class="language-plaintext highlighter-rouge">CKR_MECHANISM_INVALID</code>.</p>

<table>
  <thead>
    <tr>
      <th>Version</th>
      <th>AES-GCM RPC</th>
      <th>Object Handles</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0.24.1 (AL2023 repo)</td>
      <td>No</td>
      <td>N/A</td>
      <td><code class="language-plaintext highlighter-rouge">CKR_MECHANISM_INVALID</code></td>
    </tr>
    <tr>
      <td>0.25.x</td>
      <td>Yes</td>
      <td>Broken across sessions</td>
      <td><code class="language-plaintext highlighter-rouge">CKR_OBJECT_HANDLE_INVALID</code></td>
    </tr>
    <tr>
      <td><strong>0.26.2</strong></td>
      <td><strong>Yes</strong></td>
      <td><strong>Works</strong></td>
      <td><strong>Working</strong></td>
    </tr>
  </tbody>
</table>

<p>Version 0.26.2 must be built from source on EC2. And it must match on both ends – a version mismatch between client (EC2) and server (Mac) causes RPC protocol errors.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># On EC2 (t4g.nano, 512MB RAM)</span>
<span class="nb">sudo </span>dnf <span class="nb">install</span> <span class="nt">-y</span> meson ninja-build gcc libtasn1-devel libffi-devel
curl <span class="nt">-sL</span> https://github.com/p11-glue/p11-kit/releases/download/0.26.2/p11-kit-0.26.2.tar.xz | <span class="nb">tar </span>xJ
<span class="nb">cd </span>p11-kit-0.26.2
meson setup _build <span class="nt">--prefix</span><span class="o">=</span>/usr <span class="nt">--libdir</span><span class="o">=</span>/usr/lib64
ninja <span class="nt">-C</span> _build <span class="nt">-j1</span>    <span class="c"># -j1 required: t4g.nano OOMs on parallel builds</span>
<span class="nb">sudo </span>ninja <span class="nt">-C</span> _build <span class="nb">install</span>
</code></pre></div></div>

<h3 id="start-the-pkcs11-tunnel">Start the PKCS#11 tunnel</h3>

<p>Two terminals on the Mac:</p>

<p><strong>Terminal 1 – p11-kit server:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>p11-kit server <span class="nt">--provider</span> /opt/homebrew/lib/softhsm/libsofthsm2.so <span class="s2">"pkcs11:"</span>
</code></pre></div></div>

<p><strong>Terminal 2 – SSH reverse tunnel:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">P11_KIT_SERVER_ADDRESS</span><span class="o">=</span>unix:path<span class="o">=</span>/var/folders/.../pkcs11-XXXX

ssh <span class="nt">-i</span> ~/Downloads/EC2Tutorial2.pem <span class="se">\</span>
  <span class="nt">-R</span> /home/ec2-user/.p11-kit.sock:<span class="k">${</span><span class="nv">P11_KIT_SERVER_ADDRESS</span><span class="p">#unix</span>:path<span class="p">=</span><span class="k">}</span> <span class="se">\</span>
  ec2-user@&lt;EC2_IP&gt;
</code></pre></div></div>

<p>This forwards the EC2 Unix socket to the Mac’s p11-kit server socket. From EC2’s perspective, PKCS#11 operations on <code class="language-plaintext highlighter-rouge">/home/ec2-user/.p11-kit.sock</code> transparently reach SoftHSM on the Mac.</p>

<h3 id="test-end-to-end">Test end-to-end</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Health check</span>
curl https://xks.lemaire.tel/ping

<span class="c"># Upload with XKS encryption</span>
<span class="nb">echo</span> <span class="s2">"hello from softhsm"</span> <span class="o">&gt;</span> /tmp/test.txt
aws s3 <span class="nb">cp</span> /tmp/test.txt s3://xks-proxy-poc-test/test.txt <span class="se">\</span>
  <span class="nt">--region</span> eu-west-3 <span class="se">\</span>
  <span class="nt">--sse</span> aws:kms <span class="se">\</span>
  <span class="nt">--sse-kms-key-id</span> cd0608a9-0726-4187-b1b1-d0b08370d8f9

<span class="c"># Download (decryption goes through SoftHSM)</span>
aws s3 <span class="nb">cp </span>s3://xks-proxy-poc-test/test.txt /tmp/downloaded.txt <span class="nt">--region</span> eu-west-3
<span class="nb">cat</span> /tmp/downloaded.txt
<span class="c"># hello from softhsm</span>
</code></pre></div></div>

<hr />

<h2 id="a-rust-undefined-behavior-bug-in-the-xks-proxy">A Rust Undefined Behavior Bug in the XKS Proxy</h2>

<p>The interesting problem I hit wasn’t infrastructure, it was a compiler optimization bug in the AWS reference implementation.</p>

<p>The <code class="language-plaintext highlighter-rouge">GetKeyMetadata</code> handler declares stack variables as immutable (<code class="language-plaintext highlighter-rouge">let key_type = 0;</code>) and passes pointers to them via <code class="language-plaintext highlighter-rouge">set_ck_ulong()</code> for <code class="language-plaintext highlighter-rouge">C_GetAttributeValue</code> to write into. The C function writes the correct values (e.g., <code class="language-plaintext highlighter-rouge">key_type=31</code> for CKK_AES), but the Rust compiler’s release-mode optimizer treats the immutable bindings as compile-time constants and inlines <code class="language-plaintext highlighter-rouge">0</code> wherever they’re subsequently read.</p>

<p>Impact: <code class="language-plaintext highlighter-rouge">keyspec(0, 0)</code> = <code class="language-plaintext highlighter-rouge">"RSA_0"</code> instead of <code class="language-plaintext highlighter-rouge">keyspec(31, 32)</code> = <code class="language-plaintext highlighter-rouge">"AES_256"</code>. KMS rejected the key metadata.</p>

<p>This only manifests in release builds. Debug builds work fine because the optimizer doesn’t inline the constants.</p>

<p><strong>To fix:</strong> Force the compiler to re-read from actual memory after the C call:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">key_type</span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">read_volatile</span><span class="p">(</span><span class="o">&amp;</span><span class="n">key_type</span><span class="p">)</span> <span class="p">};</span>
<span class="k">let</span> <span class="n">key_size</span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">read_volatile</span><span class="p">(</span><span class="o">&amp;</span><span class="n">key_size</span><span class="p">)</span> <span class="p">};</span>
</code></pre></div></div>

<p>The root cause is Rust undefined behavior: writing through a raw pointer derived from an immutable reference. The proper long-term fix would be <code class="language-plaintext highlighter-rouge">UnsafeCell</code> or <code class="language-plaintext highlighter-rouge">MaybeUninit</code> in the <code class="language-plaintext highlighter-rouge">rust-pkcs11</code> crate’s <code class="language-plaintext highlighter-rouge">CK_ATTRIBUTE::set_ck_ulong</code> implementation.</p>

<hr />

<h2 id="what-xks-does-and-does-not-protect">What XKS Does and Does Not Protect</h2>

<h3 id="xks-protects-against">XKS protects against</h3>

<ul>
  <li><strong>Future unauthorized access by AWS.</strong> Disconnect the tunnel, shut down the proxy, and AWS cannot decrypt data going forward.</li>
  <li><strong>Regulatory and sovereignty requirements.</strong> Cryptographic master keys remain under your control, with an audit trail of all key operations.</li>
  <li><strong>Cloud provider key management concerns.</strong> The master key material stays entirely outside AWS.</li>
</ul>

<h3 id="xks-does-not-protect-against">XKS does not protect against</h3>

<ul>
  <li><strong>AWS actively retaining or exfiltrating data.</strong> They could copy plaintext before encryption or retain data encryption keys.</li>
  <li><strong>Legal compulsion.</strong> If DEKs were retained in AWS systems, AWS could be compelled to produce them.</li>
  <li><strong>Retrospective decryption.</strong> If AWS legitimately decrypted an object to serve a GET request, they could have retained the plaintext.</li>
</ul>

<h3 id="bottom-line">Bottom line</h3>

<p>External key management is about <strong>governance</strong>, <strong>operational control</strong>, and <strong>trust reduction</strong>. It is not about protecting against a malicious provider – they see plaintext anyway. If the cloud provider itself is your threat model, use client-side encryption before uploading, or don’t use cloud.</p>

<hr />

<h2 id="cloud-provider-comparison">Cloud Provider Comparison</h2>

<h3 id="aws--external-key-store-xks">AWS – External Key Store (XKS)</h3>

<p><strong>DIY implementation: YES.</strong> AWS publishes the <a href="https://github.com/aws/aws-kms-xksproxy-api-spec">XKS Proxy API specification</a> and a reference implementation. You can build your own proxy backed by any PKCS#11-compatible HSM or key manager.</p>

<h3 id="gcp--cloud-external-key-manager-ekm">GCP – Cloud External Key Manager (EKM)</h3>

<p><strong>DIY implementation: NO.</strong> No public API specification, no reference implementation. You must use certified partner solutions (Thales CipherTrust, Fortanix DSM).</p>

<h3 id="azure--no-external-key-store-equivalent">Azure – No external key store equivalent</h3>

<p><strong>DIY implementation: N/A.</strong> Azure offers BYOK (keys end up in Azure), Managed HSM (single-tenant, still in Azure), and Dedicated HSM (Thales Luna in Azure). No equivalent to XKS or EKM where keys remain outside the cloud.</p>

<hr />

<h2 id="source-code">Source Code</h2>

<p>Available on GitHub: <a href="https://github.com/jlemaire/aws-kms-xks-poc">aws-kms-xks-poc</a></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws-kms-xks-poc/
  cloudformation.yaml          # EC2 + ALB + Route53 + CloudWatch
  configuration/settings.toml  # xks-proxy config
  create.sh                    # One-shot deploy script
</code></pre></div></div>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[A proof of concept demonstrating AWS KMS encryption backed by a local SoftHSM key, using the XKS Proxy API. The cryptographic master key lives on a machine I fully own, not in AWS, not in a managed service, but on my personal hardware.]]></summary></entry><entry><title type="html">Virtual Me 2.0: 21x Faster Cold Starts with Swift and S3 Vectors</title><link href="https://jeremylem.github.io/blogging/2026/02/05/Virtual_Me_2_Swift_S3_Vectors.html" rel="alternate" type="text/html" title="Virtual Me 2.0: 21x Faster Cold Starts with Swift and S3 Vectors" /><published>2026-02-05T00:00:00+00:00</published><updated>2026-02-05T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2026/02/05/Virtual_Me_2_Swift_S3_Vectors</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2026/02/05/Virtual_Me_2_Swift_S3_Vectors.html"><![CDATA[<h2 id="the-evolution">The Evolution</h2>

<p>A couple of weeks ago, I built <a href="https://chat.lemaire.tel">Virtual Me</a>, a RAG-powered chatbot that answers questions about my professional experience. It worked, but I wasn’t fully satisfied.</p>

<p><strong>The v1.0 stack:</strong></p>
<ul>
  <li>Python Lambda with custom DynamoDB vector store</li>
  <li>LangGraph orchestration</li>
  <li>Manual embedding generation and chunking</li>
  <li>4-second cold starts</li>
  <li>1,488 lines of application code</li>
</ul>

<p>It was over-engineered to avoid using ElasticSearch.</p>

<p>So I rebuilt it from scratch. <strong>Virtual Me 2.0</strong> is now simpler, faster, and cheaper.</p>

<hr />

<h2 id="what-changed">What Changed</h2>

<h3 id="architecture-simplification">Architecture Simplification</h3>

<p><strong>Before (v1.0):</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Lambda (Python) → Custom Vector Search (DynamoDB) → Bedrock
</code></pre></div></div>

<p><strong>After (v2.0):</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Lambda (Swift) → Bedrock Knowledge Base → S3 Vectors → Bedrock
</code></pre></div></div>

<ul>
  <li><strong>S3 Vectors</strong> replaced my custom hacked DynamoDB similarity search</li>
  <li><strong>Bedrock Knowledge Base</strong> handles document ingestion, chunking, and embedding automatically</li>
  <li><strong>Swift runtime</strong> replaced Python for faster cold starts</li>
</ul>

<h3 id="the-numbers">The Numbers</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>v1.0</th>
      <th>v2.0</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Cold Start</strong></td>
      <td>~4000ms</td>
      <td>~190ms</td>
      <td><strong>21x faster</strong></td>
    </tr>
    <tr>
      <td><strong>Application Code</strong></td>
      <td>1,488 lines</td>
      <td>205 lines</td>
      <td><strong>86% reduction</strong></td>
    </tr>
    <tr>
      <td><strong>Memory</strong></td>
      <td>512MB</td>
      <td>256MB</td>
      <td><strong>50% reduction</strong></td>
    </tr>
    <tr>
      <td><strong>Monthly Cost</strong></td>
      <td>~$3-5</td>
      <td>~$2-3</td>
      <td><strong>40% cheaper</strong></td>
    </tr>
  </tbody>
</table>

<p>The cold start improvement is measured via CloudWatch Logs Insights:</p>
<ul>
  <li>P50: 177ms</li>
  <li>P99: 214ms</li>
  <li>Range: 173-214ms</li>
</ul>

<hr />

<h2 id="why-swift">Why Swift?</h2>

<p>I first explored Swift server-side when I worked on a mobile app in my previous job. At the time, I experimented with Vapor and Kitura. I’m really happy to see the language reached the cloud.</p>

<p>I chose Swift for Lambda because of its compiled nature and minimal runtime overhead. I also wanted to revisit Swift on the cloud after exploring Vapor and Kitura years ago. I see real value in running memory-efficient code for cost reduction, and using a strongly typed language is essential to catch errors now that so much of our code is AI-generated.</p>

<p>Swift uses <a href="https://docs.swift.org/swift-book/documentation/the-swift-programming-language/automaticreferencecounting/">Automatic Reference Counting (ARC)</a> for memory management, which means no garbage collection pauses like JVM languages (Java, Kotlin) - memory is deallocated deterministically when references drop to zero.</p>

<p><strong>Cold start breakdown:</strong></p>
<ol>
  <li><strong>Init Duration</strong>: Time to initialize the Lambda execution environment</li>
  <li><strong>Duration</strong>: Actual function execution time</li>
</ol>

<p>Python’s interpreted nature means importing libraries (boto3, langchain, etc.) adds significant overhead to every cold start. Swift compiles to a native binary with all dependencies linked.</p>

<p><strong>The result is above what I was expecting:</strong> 190ms cold starts vs 4000ms in Python.</p>

<h3 id="swift-lambda-code">Swift Lambda Code</h3>

<p>Here’s the complete Lambda handler (simplified):</p>

<div class="language-swift highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">import</span> <span class="kt">AWSLambdaRuntime</span>
<span class="kd">import</span> <span class="kt">AWSLambdaEvents</span>
<span class="kd">import</span> <span class="kt">SotoBedrockAgentRuntime</span>

<span class="kd">@main</span>
<span class="kd">struct</span> <span class="kt">VirtualMeLambda</span> <span class="p">{</span>
    <span class="kd">static</span> <span class="kd">func</span> <span class="nf">main</span><span class="p">()</span> <span class="k">async</span> <span class="k">throws</span> <span class="p">{</span>
        <span class="k">let</span> <span class="nv">runtime</span> <span class="o">=</span> <span class="kt">LambdaRuntime</span> <span class="p">{</span> 
            <span class="p">(</span><span class="nv">event</span><span class="p">:</span> <span class="kt">APIGatewayV2Request</span><span class="p">,</span> <span class="nv">context</span><span class="p">:</span> <span class="kt">LambdaContext</span><span class="p">)</span> <span class="k">async</span> <span class="k">throws</span> <span class="o">-&gt;</span> <span class="kt">APIGatewayV2Response</span> <span class="k">in</span>
            <span class="k">try</span> <span class="k">await</span> <span class="nf">handleRequest</span><span class="p">(</span><span class="nv">event</span><span class="p">:</span> <span class="n">event</span><span class="p">)</span>
        <span class="p">}</span>
        <span class="k">try</span> <span class="k">await</span> <span class="n">runtime</span><span class="o">.</span><span class="nf">run</span><span class="p">()</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kd">func</span> <span class="nf">handleRequest</span><span class="p">(</span><span class="nv">event</span><span class="p">:</span> <span class="kt">APIGatewayV2Request</span><span class="p">)</span> <span class="k">async</span> <span class="k">throws</span> <span class="o">-&gt;</span> <span class="kt">APIGatewayV2Response</span> <span class="p">{</span>
    <span class="k">guard</span> <span class="k">let</span> <span class="nv">body</span> <span class="o">=</span> <span class="n">event</span><span class="o">.</span><span class="n">body</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="nf">errorResponse</span><span class="p">(</span><span class="mi">400</span><span class="p">,</span> <span class="s">"Missing request body"</span><span class="p">)</span>
    <span class="p">}</span>
    
    <span class="k">let</span> <span class="nv">request</span> <span class="o">=</span> <span class="k">try</span> <span class="kt">JSONDecoder</span><span class="p">()</span><span class="o">.</span><span class="nf">decode</span><span class="p">(</span><span class="kt">ChatRequest</span><span class="o">.</span><span class="k">self</span><span class="p">,</span> <span class="nv">from</span><span class="p">:</span> <span class="kt">Data</span><span class="p">(</span><span class="n">body</span><span class="o">.</span><span class="n">utf8</span><span class="p">))</span>
    <span class="k">let</span> <span class="nv">question</span> <span class="o">=</span> <span class="n">request</span><span class="o">.</span><span class="n">messages</span><span class="o">.</span><span class="n">last</span><span class="p">?</span><span class="o">.</span><span class="n">text</span> <span class="p">??</span> <span class="s">""</span>
    
    <span class="c1">// Call Bedrock Knowledge Base</span>
    <span class="k">let</span> <span class="nv">answer</span> <span class="o">=</span> <span class="k">try</span> <span class="k">await</span> <span class="nf">retrieveAndGenerate</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
    
    <span class="k">let</span> <span class="nv">responseBody</span> <span class="o">=</span> <span class="k">try</span> <span class="kt">JSONEncoder</span><span class="p">()</span><span class="o">.</span><span class="nf">encode</span><span class="p">(</span><span class="kt">ChatResponse</span><span class="p">(</span><span class="nv">text</span><span class="p">:</span> <span class="n">answer</span><span class="p">))</span>
    <span class="k">return</span> <span class="kt">APIGatewayV2Response</span><span class="p">(</span>
        <span class="nv">statusCode</span><span class="p">:</span> <span class="o">.</span><span class="n">ok</span><span class="p">,</span>
        <span class="nv">headers</span><span class="p">:</span> <span class="p">[</span><span class="s">"Content-Type"</span><span class="p">:</span> <span class="s">"application/json"</span><span class="p">],</span>
        <span class="nv">body</span><span class="p">:</span> <span class="kt">String</span><span class="p">(</span><span class="nv">data</span><span class="p">:</span> <span class="n">responseBody</span><span class="p">,</span> <span class="nv">encoding</span><span class="p">:</span> <span class="o">.</span><span class="n">utf8</span><span class="p">)</span>
    <span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>No custom vector search. No manual embedding generation. Just call Bedrock Knowledge Base and return the result.</p>

<hr />

<h2 id="s3-vectors-native-vector-storage">S3 Vectors: Native Vector Storage</h2>

<p>S3 Vectors is AWS’s managed vector database. It integrates directly with Bedrock Knowledge Base.</p>

<p><strong>What it handles:</strong></p>
<ul>
  <li>Vector indexing (automatic)</li>
  <li>Similarity search (sub-100ms)</li>
  <li>Scaling (automatic)</li>
  <li>Cost (pay per query, not per GB stored)</li>
</ul>

<p><strong>What I don’t have to build:</strong></p>
<ul>
  <li>Cosine similarity calculations</li>
  <li>Vector normalization</li>
  <li>Index management</li>
  <li>Query optimization</li>
</ul>

<p>My v1.0 DynamoDB implementation had ~400 lines of code for vector search. S3 Vectors replaced all of it. Not free like DynamoDB, but affordable for my budget.</p>

<hr />

<h2 id="infrastructure-aws-sam">Infrastructure: AWS SAM</h2>

<p>I migrated from Terraform to AWS SAM (Serverless Application Model) for better Lambda development workflow.</p>

<p><strong>Why SAM:</strong></p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">sam build</code> - Automatic dependency packaging</li>
  <li>Built-in best practices (IAM, X-Ray, CORS)</li>
  <li>Faster iteration cycle</li>
</ul>

<p>Nested stacks keep the templates modular. Each stack is ~150 lines and specialized.</p>

<h3 id="local-testing-with-swift-lambda-runtime">Local Testing with Swift Lambda Runtime</h3>

<p>The Swift AWS Lambda Runtime automatically starts a local HTTP server when not running in a Lambda execution environment. This makes testing incredibly simple:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Start local server on http://127.0.0.1:7000/invoke</span>
<span class="nb">cd </span>sam
make run-local
</code></pre></div></div>

<p>Under the hood, this runs:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">KNOWLEDGE_BASE_ID</span><span class="o">=</span>RPXCA7UUQN <span class="nv">LLM_MODEL</span><span class="o">=</span>nova-2-lite <span class="nv">LLM_TEMPERATURE</span><span class="o">=</span>0.1 swift run
</code></pre></div></div>

<p><strong>Test with curl:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> POST http://127.0.0.1:7000/invoke <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "version": "2.0",
    "routeKey": "POST /chat",
    "body": "{\"messages\":[{\"role\":\"user\",\"text\":\"What is your experience?\"}]}"
  }'</span>
</code></pre></div></div>

<p>This is much simpler than <code class="language-plaintext highlighter-rouge">sam local start-api</code> which requires Docker and emulates the entire API Gateway + Lambda stack. The Swift runtime’s built-in local server connects directly to real AWS services (Bedrock, S3 Vectors) for authentic integration testing.</p>

<hr />

<h2 id="measuring-cold-starts">Measuring Cold Starts</h2>

<p>CloudWatch automatically captures Lambda cold start metrics. I added a Makefile command to query them:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make cold-start-metrics
</code></pre></div></div>

<p>This runs a CloudWatch Logs Insights query:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fields @initDuration, @duration, @memorySize
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart,
        max(@initDuration) as maxColdStart,
        pct(@initDuration, 50) as p50ColdStart,
        pct(@initDuration, 99) as p99ColdStart,
        count() as totalColdStarts
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">@initDuration</code> field only appears on cold starts, making it easy to track.</p>

<hr />

<h2 id="lessons-learned">Lessons Learned</h2>

<h3 id="1-s3-vectors">1. S3 Vectors</h3>

<p>My v1.0 custom DynamoDB vector store wasn’t over-engineering, it was a necessary compromise. ElasticSearch was too expensive for a personal project. DynamoDB’s free tier (25GB storage, 25 RCU/WCU) made it the only viable option for vector storage.</p>

<p>The complexity was the price of staying affordable.</p>

<p>Then S3 Vectors reached general availability on December 2, 2025, search became accessible for personal projects. The managed service eliminated 86% of my code while providing capabilities (sub-100ms similarity search, automatic indexing) that my DynamoDB implementation couldn’t match.</p>

<p><strong>The lesson:</strong> Sometimes complexity is justified by constraints. When those constraints change (new services, pricing models), it’s worth revisiting your architecture.</p>

<h3 id="2-measure">2. Measure</h3>

<p>I assumed Python would be “fast enough” for cold starts. Measuring with CloudWatch proved otherwise. Swift’s 21x improvement was worth the migration effort.</p>

<p><strong>Rule:</strong> Measure before optimizing, but also measure to validate assumptions.</p>

<h3 id="3-compiled--interpreted-for-lambda">3. Compiled &gt; Interpreted for Lambda</h3>

<p>Python’s flexibility comes at a cost: import overhead. Swift’s compiled binary starts instantly.</p>

<p>For Lambda, prefer compiled languages over interpreted ones (Python, Node.js) when cold start matters.</p>

<h3 id="4-infrastructure-as-code">4. Infrastructure as Code</h3>

<p>Terraform worked, but SAM’s Lambda-specific features (automatic packaging) made development faster.</p>

<p><strong>Rule:</strong> Choose IaC tools that match your workload. SAM for serverless, Terraform for multi-cloud.</p>

<hr />

<h2 id="cost-breakdown">Cost Breakdown</h2>

<p><strong>Monthly cost for ~100 conversations:</strong></p>

<table>
  <thead>
    <tr>
      <th>Service</th>
      <th>v1.0</th>
      <th>v2.0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Lambda</td>
      <td>$0.50</td>
      <td>$0.25</td>
    </tr>
    <tr>
      <td>DynamoDB</td>
      <td>$1.50</td>
      <td>$0</td>
    </tr>
    <tr>
      <td>S3 Vectors</td>
      <td>$0</td>
      <td>$0.75</td>
    </tr>
    <tr>
      <td>Bedrock</td>
      <td>$1.50</td>
      <td>$1.00</td>
    </tr>
    <tr>
      <td>Other (S3, CloudFront, Route53)</td>
      <td>$1.00</td>
      <td>$1.00</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>~$4.50</strong></td>
      <td><strong>~$3.00</strong></td>
    </tr>
  </tbody>
</table>

<p>The cost reduction comes from:</p>
<ul>
  <li>50% less Lambda memory (512MB → 256MB)</li>
  <li>No DynamoDB scan operations</li>
  <li>Bedrock Knowledge Base efficiency (fewer API calls)</li>
</ul>

<hr />

<h2 id="try-it-yourself">Try It Yourself</h2>

<p>The complete source code is on GitHub: <a href="https://github.com/jeremylem/virtualme2">github.com/jeremylem/virtualme2</a></p>

<p><strong>Quick start:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/jeremylem/virtualme2
<span class="nb">cd </span>virtualme2/sam
sam build
sam deploy <span class="nt">--guided</span>
</code></pre></div></div>

<p><strong>Local testing:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make run-local          <span class="c"># Start Swift Lambda locally</span>
make test-local         <span class="c"># Send test request</span>
make cold-start-metrics <span class="c"># View CloudWatch metrics</span>
</code></pre></div></div>

<p>Try the live version: <a href="https://chat.lemaire.tel">chat.lemaire.tel</a></p>

<hr />

<p><strong>Tech Stack:</strong> Swift 6.0 · AWS Lambda · Amazon Bedrock · S3 Vectors · AWS SAM · CloudFormation</p>

<p><strong>Performance:</strong> 190ms cold starts · 256MB memory · $3/month</p>

<p><strong>Code:</strong> 86% reduction · 62% total codebase reduction</p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[The Evolution]]></summary></entry><entry><title type="html">Multi-Agent RAG with S3 Vectors and Bedrock AgentCore</title><link href="https://jeremylem.github.io/blogging/2026/02/01/Multi_Agent_RAG_S3_Vectors.html" rel="alternate" type="text/html" title="Multi-Agent RAG with S3 Vectors and Bedrock AgentCore" /><published>2026-02-01T00:00:00+00:00</published><updated>2026-02-01T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2026/02/01/Multi_Agent_RAG_S3_Vectors</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2026/02/01/Multi_Agent_RAG_S3_Vectors.html"><![CDATA[<p>A RAG chatbot for personal notes using S3 Vectors and Bedrock Agents. Features true multi-agent collaboration with critique-driven feedback loop and multi-turn conversations. Built as a learning exercise after re:Invent 2025.</p>

<h2 id="why-i-built-this">Why I Built This</h2>

<p>I already have <a href="[https://github.com/ox00004a/virtualme]">virtualme</a> running in production. It’s a RAG chatbot using DynamoDB for vector storage. It works, stays in free tier, does the job.</p>

<p>But re:Invent 2025 announced two things that caught my attention:</p>

<ul>
  <li>
    <p><strong>Amazon S3 Vectors</strong> (December 2025). Native vector storage with server-side similarity search. Supports cosine, euclidean, and dot product metrics. Up to 10,000 dimensions per vector. Metadata filtering built-in. No more client-side cosine calculations or DynamoDB scan-and-compute patterns.</p>
  </li>
  <li>
    <p><strong>Amazon Bedrock AgentCore</strong> (December 2025). Managed agent infrastructure with tool orchestration. Agents get knowledge base access, action groups, and session memory. Multi-agent setups possible: agents can invoke other agents, share context, or work in parallel. Built-in trace for debugging retrieval and reasoning steps.</p>
  </li>
</ul>

<p>I wanted to understand how these compare to my DynamoDB approach.</p>

<hr />

<h2 id="multi-agent-architecture">Multi-Agent Architecture</h2>

<p>True multi-agent collaboration with specialized roles:</p>

<ul>
  <li><strong>Research Agent</strong>: Searches knowledge base, extracts facts</li>
  <li><strong>Critique Agent</strong>: Evaluates quality, provides feedback (1-10 scoring)</li>
  <li><strong>Formatter Agent</strong>: Creates natural user responses</li>
</ul>

<h3 id="orchestration-pattern">Orchestration Pattern</h3>

<p>Sequential agent handoff with feedback loop:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Query → Research Agent → Critique Agent → Score Check
                                      ↓
                              Score ≥ 7? → Formatter Agent → Response
                                      ↓
                              Score &lt; 7? → Feedback to Research (max 3x)
</code></pre></div></div>

<h3 id="session-management">Session Management</h3>

<ul>
  <li>Shared session IDs across all agent calls</li>
  <li>Each query gets unique session ID</li>
  <li>All 3 agents share conversation memory</li>
  <li>Natural multi-turn conversations</li>
</ul>

<hr />

<h2 id="architecture">Architecture</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                        DEPLOYMENT
                        ==========

./deploy.sh
     |
     v
+------------------+
|  S3 Bucket       |  &lt;-- your .md/.txt notes go here
|  (notes/)        |
+--------+---------+
         |
         v
+------------------+
|  Bedrock         |  reads notes, chunks them
|  Knowledge Base  |  calls Titan Embeddings (1024 dims)
+--------+---------+
         |
         v
+------------------+
|  S3 Vectors      |  &lt;-- vectors stored here
|  Index           |      native similarity search
+------------------+
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                         QUERY FLOW
                         ==========

python client.py "What is DynamoDB?"
     |
     | HTTPS + SigV4 signing
     v
+------------------+
|  API Lambda      |
|  Function URL    |
+--------+---------+
         |
         v
+------------------+
|  Orchestrator    |  manages agent workflow
|  Lambda          |  extracts citations from trace
+--------+---------+
         |
         +---&gt; Research Agent ---&gt; Critique Agent
         |           ^                   |
         |           |     score &lt; 7     |
         |           +-------------------+
         |                   |
         |            score &gt;= 7
         |                   v
         +---&gt; Formatter Agent ---&gt; Response with Sources
</code></pre></div></div>

<h3 id="orchestrator-lambda">Orchestrator Lambda</h3>

<p>The orchestrator is a Python Lambda (<code class="language-plaintext highlighter-rouge">orchestrator.handler</code>) that coordinates the multi-agent workflow. It receives the query, manages the critique loop, and assembles the final response.</p>

<p>Key implementation details:</p>

<ul>
  <li>Calls <code class="language-plaintext highlighter-rouge">bedrock.invoke_agent()</code> with <code class="language-plaintext highlighter-rouge">enableTrace=True</code> to capture retrieval metadata</li>
  <li>Parses <code class="language-plaintext highlighter-rouge">knowledgeBaseLookupOutput.retrievedReferences</code> from trace to extract real S3 URIs</li>
  <li>Extracts filenames from URIs and appends them as sources (agents hallucinate filenames, so this is done server-side)</li>
  <li>Shares same <code class="language-plaintext highlighter-rouge">session_id</code> across all agent calls for conversation continuity</li>
  <li>Returns structured response with <code class="language-plaintext highlighter-rouge">iterations</code>, <code class="language-plaintext highlighter-rouge">final_score</code>, and <code class="language-plaintext highlighter-rouge">sources</code> for observability</li>
</ul>

<hr />

<h2 id="what-i-wanted-to-learn">What I Wanted to Learn</h2>

<h3 id="1-aws-sam">1. AWS SAM</h3>

<p>I’ve used Terraform before. Never SAM.</p>

<p>SAM handles Lambda packaging automatically. You point it at a directory, it zips and uploads:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">OrchestratorLambda</span><span class="pi">:</span>
  <span class="na">Type</span><span class="pi">:</span> <span class="s">AWS::Serverless::Function</span>
  <span class="na">Properties</span><span class="pi">:</span>
    <span class="na">Runtime</span><span class="pi">:</span> <span class="s">python3.13</span>
    <span class="na">Handler</span><span class="pi">:</span> <span class="s">orchestrator.handler</span>
    <span class="na">CodeUri</span><span class="pi">:</span> <span class="s">../lambda/</span> <span class="c1"># SAM packages this</span>
</code></pre></div></div>

<p>Run <code class="language-plaintext highlighter-rouge">sam build</code>, it creates <code class="language-plaintext highlighter-rouge">.aws-sam/build/</code> with deployment artifacts. Run <code class="language-plaintext highlighter-rouge">sam deploy --resolve-s3</code>, it handles the S3 bucket for you.</p>

<p>Good: Less config than raw CloudFormation.
Bad: Another abstraction layer to debug when things break.</p>

<h3 id="2-s3-vectors-vs-dynamodb">2. S3 Vectors vs DynamoDB</h3>

<p><strong>virtualme approach:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Client-side similarity calculation
</span><span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">dynamodb</span><span class="p">.</span><span class="n">scan</span><span class="p">():</span>
    <span class="n">score</span> <span class="o">=</span> <span class="n">cosine_similarity</span><span class="p">(</span><span class="n">query_vector</span><span class="p">,</span> <span class="n">item</span><span class="p">[</span><span class="s">'embedding'</span><span class="p">])</span>
</code></pre></div></div>

<p><strong>S3 Vectors approach:</strong></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">VectorIndex</span><span class="pi">:</span>
  <span class="na">Type</span><span class="pi">:</span> <span class="s">AWS::S3Vectors::Index</span>
  <span class="na">Properties</span><span class="pi">:</span>
    <span class="na">Dimension</span><span class="pi">:</span> <span class="m">1024</span>
    <span class="na">DistanceMetric</span><span class="pi">:</span> <span class="s">cosine</span> <span class="c1"># server-side!</span>
</code></pre></div></div>

<p>No Python similarity code. Bedrock handles the vector search natively.</p>

<p>Trade-off: S3 Vectors has a 2048-byte limit on filterable metadata per record. (see details below)</p>

<h3 id="3-multi-agent-patterns">3. Multi-Agent Patterns</h3>

<p>Understood the difference between:</p>

<ul>
  <li><strong>Bedrock Flow</strong>: Service orchestration, explicit control, no session memory between queries</li>
  <li><strong>Bedrock Agent</strong>: Autonomous decision-making within a role, built-in session management</li>
  <li><strong>AgentCore</strong>: Multi-agent collaboration, agents can delegate to each other</li>
</ul>

<p>For this project, I chose individual agents coordinated by an orchestrator Lambda. This gives full control over the workflow while leveraging agent autonomy within each role.</p>

<hr />

<h2 id="key-technical-choices">Key Technical Choices</h2>

<ol>
  <li><strong>Agent Autonomy</strong>: Each agent makes decisions within its role (search strategy, evaluation criteria, formatting style)</li>
  <li><strong>Shared Sessions</strong>: Context preservation across agent calls via session ID</li>
  <li><strong>Feedback Loops</strong>: Critique-driven improvement (max 3 iterations to limit cost)</li>
  <li><strong>Source Extraction</strong>: Orchestrator extracts real filenames from Bedrock trace (agents were hallucinating sources)</li>
  <li><strong>Modular Architecture</strong>: Infrastructure, agents, and API in separate nested stacks</li>
</ol>

<hr />

<h2 id="things-that-broke">Things That Broke</h2>

<h3 id="the-2048-byte-metadata-limit">The 2048-byte Metadata Limit</h3>

<p>First deployment. Ingestion job fails with <code class="language-plaintext highlighter-rouge">Filterable metadata must have at most 2048 bytes</code>.</p>

<p>Bedrock stores chunk text in filterable metadata by default. Even small paragraphs exceed 2048 bytes. I tried shorter filenames, smaller chunks, different chunking strategies. Nothing worked because the limit is per-record, not total.</p>

<p>The fix is to configure the index to treat Bedrock metadata as non-filterable:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">VectorIndex</span><span class="pi">:</span>
  <span class="na">Type</span><span class="pi">:</span> <span class="s">AWS::S3Vectors::Index</span>
  <span class="na">Properties</span><span class="pi">:</span>
    <span class="na">MetadataConfiguration</span><span class="pi">:</span>
      <span class="na">NonFilterableMetadataKeys</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="s">AMAZON_BEDROCK_TEXT</span>
        <span class="pi">-</span> <span class="s">AMAZON_BEDROCK_METADATA</span>
</code></pre></div></div>

<p>Catch: This must be set at index creation. I had to destroy the stack and redeploy.</p>

<h3 id="agent-source-hallucination">Agent Source Hallucination</h3>

<p>Agents consistently invented plausible-sounding filenames instead of citing actual sources. Asked about SOLID principles, got “From Design Patterns Basics.md” when the real file was “SOLID &amp; Design pattern.md”.</p>

<p>Tried multiple approaches:</p>

<ul>
  <li>Explicit instructions to copy exact filenames</li>
  <li>Critique agent penalizing invented sources</li>
  <li>Different prompt formats</li>
</ul>

<p>None worked reliably. Nova Lite models don’t seem to have clear visibility into the S3 URIs from retrieval results.</p>

<p>The fix to make it work: Extract citations in the orchestrator. Bedrock’s <code class="language-plaintext highlighter-rouge">invoke_agent</code> response includes trace data with <code class="language-plaintext highlighter-rouge">knowledgeBaseLookupOutput.retrievedReferences</code>. Each reference has <code class="language-plaintext highlighter-rouge">location.s3Location.uri</code>. Parse the filename, append to response. Real sources, no hallucination.</p>

<hr />

<h2 id="cold-start-reality">Cold Start Reality</h2>

<p>First query after deployment or idle:</p>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>API Lambda init</td>
      <td>~500ms</td>
    </tr>
    <tr>
      <td>Orchestrator Lambda init</td>
      <td>~500ms</td>
    </tr>
    <tr>
      <td>Bedrock Agent session</td>
      <td>variable</td>
    </tr>
    <tr>
      <td>Knowledge Base connection</td>
      <td>first query slower</td>
    </tr>
  </tbody>
</table>

<p><strong>First query:</strong> 15-30 seconds (everything cold)
<strong>Warm queries:</strong> 3-8 seconds
<strong>After 15 min idle:</strong> Agent session expires, partial cold start</p>

<p>virtualme has the same cold start issues. Serverless trade-off.</p>

<hr />

<h2 id="cost-comparison">Cost Comparison</h2>

<h3 id="this-project-s3-vectors--multi-agent">This Project (S3 Vectors + Multi-Agent)</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Monthly Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>S3 Vectors storage</td>
      <td>&lt; $0.01</td>
    </tr>
    <tr>
      <td>S3 Vectors queries</td>
      <td>&lt; $0.01</td>
    </tr>
    <tr>
      <td>Titan Embeddings (ingestion)</td>
      <td>&lt; $0.01</td>
    </tr>
    <tr>
      <td>Nova Lite (3 agents)</td>
      <td>~$0.30-0.70</td>
    </tr>
    <tr>
      <td>Lambda</td>
      <td>free tier</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>~$0.35-0.75</strong></td>
    </tr>
  </tbody>
</table>

<p>Multi-agent multiplies Bedrock costs: 3 agents per query, up to 3 iterations if critique score &lt; 7. Worst case: 9 model calls per query.</p>

<p><strong>Pricing reference</strong> (Nova Lite): $0.06/1M input tokens, $0.24/1M output tokens.</p>

<h3 id="virtualme-dynamodb">virtualme (DynamoDB)</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Monthly Cost</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DynamoDB</td>
      <td>$0</td>
      <td>Free tier: 25 RCUs/WCUs, 25GB</td>
    </tr>
    <tr>
      <td>Nova Lite</td>
      <td>~$0.10-0.30</td>
      <td>Single agent per query</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>~$0.10-0.30</strong></td>
      <td> </td>
    </tr>
  </tbody>
</table>

<h3 id="session-storage">Session Storage</h3>

<p>Bedrock Agent sessions are configured with <code class="language-plaintext highlighter-rouge">IdleSessionTTLInSeconds: 600</code> (10 minutes). After 10 minutes of inactivity, the session expires and conversation context is lost.</p>

<p>Session storage itself is free. Idle time doesn’t cost anything. What costs tokens is conversation history. The agent includes previous turns in each request:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Turn 1: "What is S3?"           →  ~50 input tokens
Turn 2: "Tell me more"          → ~150 input tokens (includes Turn 1)
Turn 3: "How about pricing?"    → ~300 input tokens (includes Turn 1+2)
</code></pre></div></div>

<p>For light usage (~100 queries/month), the difference is negligible.</p>

<h3 id="final-outcome">Final Outcome</h3>

<p>DynamoDB is cheaper because I hacked it. In virtualme, I scan all items and calculate cosine similarity client-side. This works because my notes corpus produces fewer than 1000 chunks. Beyond that, it would get slow and expensive.</p>

<p>S3 Vectors costs slightly more but removes all that custom code. Native similarity search, no client-side calculations, no manual orchestration. For anything larger than a small personal project, managed infrastructure wins.</p>

<p>Multi-agent adds overhead. Each query involves 3+ model calls. Worth it for complex reasoning tasks, overkill for simple Q&amp;A.</p>

<hr />

<h2 id="model-specialization-future-improvement">Model Specialization (Future Improvement)</h2>

<p>Currently all three agents use the same model (Nova 2 Lite). Using different models per role would improve results:</p>

<ul>
  <li><strong>Different Strengths</strong>: Each model brings different capabilities</li>
  <li><strong>Error Correction</strong>: Claude might catch what Nova misses</li>
  <li><strong>Cost Optimization</strong>: Use expensive models only where needed (e.g., Claude for critique, Nova for research)</li>
  <li><strong>Quality Improvement</strong>: Specialized models for specialized tasks</li>
</ul>

<p>I kept it simple for this learning exercise. Model specialization is a next step.</p>

<hr />

<h2 id="next-learning-path">Next Learning Path</h2>

<h3 id="1-peer-to-peer-agent-communication">1. Peer-to-Peer Agent Communication</h3>

<p>Current architecture uses orchestrator-driven coordination. Agents don’t talk to each other directly.</p>

<p>Next exploration:</p>

<ul>
  <li>Research agent directly asks critique agent for guidance mid-search</li>
  <li>Agents negotiate who handles which part of a complex query</li>
  <li>Dynamic task decomposition without central orchestrator</li>
</ul>

<p>This requires AgentCore’s agent-to-agent invocation. Different trade-offs: more autonomous, but less predictable, should be harder to debug.</p>

<h3 id="2-model-specialization-experiment">2. Model Specialization Experiment</h3>

<ul>
  <li>A/B test different model combinations per agent role</li>
  <li>Compare: Claude Haiku vs Nova Lite 2 vs Nova Micro 2</li>
  <li>Track: cost per query, response quality, iteration count</li>
</ul>

<hr />

<h2 id="what-i-learned">What I Learned</h2>

<ol>
  <li>
    <p><strong>S3 Vectors has edge cases.</strong> The 2048-byte metadata limit is not documented prominently. Cost me a few hours.</p>
  </li>
  <li>
    <p><strong>Different LLMs behave differently.</strong> Nova and Claude interpret the same agent prompts differently. Test with your actual model.</p>
  </li>
  <li>
    <p><strong>Agents hallucinate sources.</strong> Even with explicit instructions, models invent plausible filenames. Extract citations from the API trace, not the model output.</p>
  </li>
  <li>
    <p><strong>SAM is convenient but adds abstraction.</strong> When it works, great. When it breaks, you’re debugging two layers.</p>
  </li>
  <li>
    <p><strong>Multi-agent adds complexity and cost.</strong> 3 agents × 3 iterations = 9 model calls worst case. Worth it for quality, overkill for simple queries.</p>
  </li>
  <li>
    <p><strong>Orchestrator gives control.</strong> Lambda-based orchestration lets you extract trace data, manage iterations, and append real sources. Pure agent-to-agent would lose this visibility.</p>
  </li>
  <li>
    <p><strong>Free tier is hard to beat (in my case).</strong> DynamoDB + custom code is cheaper only because I have fewer than 1000 chunks. This hack won’t scale. For anything larger, managed services like S3 Vectors are the right choice.</p>
  </li>
</ol>

<hr />

<h2 id="resources">Resources</h2>

<ul>
  <li><a href="https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-preview/">S3 Vectors announcement</a></li>
  <li><a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html">Bedrock AgentCore docs</a></li>
  <li><a href="https://github.com/jeremylem/virtualme">virtualme (the DynamoDB approach)</a></li>
</ul>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[A RAG chatbot for personal notes using S3 Vectors and Bedrock Agents. Features true multi-agent collaboration with critique-driven feedback loop and multi-turn conversations. Built as a learning exercise after re:Invent 2025.]]></summary></entry><entry><title type="html">Building a Virtual Me: RAG-Powered Resume Chatbot on AWS</title><link href="https://jeremylem.github.io/blogging/2026/01/11/Virtual_Me_RAG_Chatbot.html" rel="alternate" type="text/html" title="Building a Virtual Me: RAG-Powered Resume Chatbot on AWS" /><published>2026-01-11T00:00:00+00:00</published><updated>2026-01-11T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2026/01/11/Virtual_Me_RAG_Chatbot</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2026/01/11/Virtual_Me_RAG_Chatbot.html"><![CDATA[<h2 id="the-resume-problem">The Resume Problem</h2>

<p>I’ve been frustrated with traditional resumes for a while now. Not because mine is bad, but because the format itself is a bit broken.</p>

<p><strong>Two core problems:</strong></p>

<ol>
  <li>
    <p><strong>Resumes don’t translate experience well.</strong> You spend hours crafting bullet points that technically describe what you did, but they fail to capture the <em>why</em> behind your decisions, the trade-offs you considered, or the context that made your work meaningful. A line like “Implemented serverless RAG pipeline” does not tell you much.</p>
  </li>
  <li>
    <p><strong>Resumes are boring.</strong> Reading a resume is like reading a phone book. It’s a static, one-dimensional artifact that forces recruiters to play detective, inferring things that should be explicit. Want to know some details about they why and the how trade-offs? You’d have to schedule a call.</p>
  </li>
</ol>

<p>I wanted something better. Something that could answer questions about my experience for me.</p>

<p>So I built <strong>Virtual Me</strong>, an AI chatbot that represents me, fee with my actual resume and technical knowledge, using Retrieval Augmented Generation (RAG) .</p>

<p>You can try it here: <a href="https://chat.lemaire.tel">chat.lemaire.tel</a></p>

<p>The rest of this post are some technical details I learnt along the way.</p>

<hr />

<h2 id="requestresponse-flow-with-json-examples">Request/Response Flow with JSON Examples</h2>

<p>Here is the complete end-to-end flow from user input to AI response, including actual JSON payloads exchanged between components, CORS handling, validation, RAG pipeline execution, and error cases.</p>

<h3 id="request-flow-diagram">Request Flow Diagram</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Input
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 1. Web UI (Deep Chat)                                       │
│    POST https://api.lemaire.tel/chat                        │
│    Content-Type: application/json                           │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 2. Route 53                                                  │
│    DNS: api.lemaire.tel → API Gateway Endpoint              │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 3. API Gateway HTTP API                                      │
│    Transforms HTTP → Lambda Event (API Gateway v2 format)   │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 4. Lambda (lambda_function.py)                               │
│    - Validates with Pydantic (ChatRequest)                  │
│    - Truncates to last 20 messages                          │
│    - Invokes run_rag_pipeline(question)                     │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 5. LangGraph Workflow (rag/pipeline.py)                      │
│    ┌──────────────┐      ┌──────────────┐                  │
│    │ retrieve_node│ ───► │ generate_node│                  │
│    └──────────────┘      └──────────────┘                  │
│           │                      │                          │
│           ↓                      ↓                          │
│      DynamoDB              Amazon Bedrock                   │
└─────────────────────────────────────────────────────────────┘
    ↓
Response (reverse path)
</code></pre></div></div>

<h3 id="cors-preflight-handling">CORS Preflight Handling</h3>

<p>Before the actual POST request, browsers send a preflight OPTIONS request when making cross-origin calls (<code class="language-plaintext highlighter-rouge">chat.lemaire.tel</code> → <code class="language-plaintext highlighter-rouge">api.lemaire.tel</code>). The Lambda function explicitly returns 200 OK with CORS headers to allow the request:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># lambda_function.py handles OPTIONS preflight
</span><span class="k">if</span> <span class="n">event</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"requestContext"</span><span class="p">,</span> <span class="p">{}).</span><span class="n">get</span><span class="p">(</span><span class="s">"http"</span><span class="p">,</span> <span class="p">{}).</span><span class="n">get</span><span class="p">(</span><span class="s">"method"</span><span class="p">)</span> <span class="o">==</span> <span class="s">"OPTIONS"</span><span class="p">:</span>
    <span class="k">return</span> <span class="p">{</span>
        <span class="s">"statusCode"</span><span class="p">:</span> <span class="mi">200</span><span class="p">,</span>
        <span class="s">"headers"</span><span class="p">:</span> <span class="p">{</span>
            <span class="s">"Access-Control-Allow-Origin"</span><span class="p">:</span> <span class="s">"https://chat.lemaire.tel"</span><span class="p">,</span>
            <span class="s">"Access-Control-Allow-Methods"</span><span class="p">:</span> <span class="s">"POST, OPTIONS"</span><span class="p">,</span>
            <span class="s">"Access-Control-Allow-Headers"</span><span class="p">:</span> <span class="s">"content-type"</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<h3 id="step-1-web-ui-request-deep-chat-format">Step 1: Web UI Request (Deep Chat Format)</h3>

<p><strong>The Lambda backend is completely stateless.</strong> All conversation history is maintained client-side by the Deep Chat UI and sent with every request.</p>

<p>First message:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"messages"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is your experience with AWS?"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Third message in conversation (includes full history):</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"messages"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What technologies do you work with?"</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ai"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I work with Python, AWS Lambda, Terraform, and Amazon Bedrock..."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Tell me more about your AWS experience"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Why client-side state?</strong> Zero backend storage cost, instant scaling (no session affinity), privacy (conversations never stored), and simplicity.</p>

<h3 id="step-2-api-gateway-event-lambda-input">Step 2: API Gateway Event (Lambda Input)</h3>

<p>API Gateway transforms the HTTP request into a Lambda event (AWS API Gateway v2 format):</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.0"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"routeKey"</span><span class="p">:</span><span class="w"> </span><span class="s2">"POST /chat"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"rawPath"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/chat"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"requestContext"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"accountId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"123456789012"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"apiId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"abc123xyz"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"domainName"</span><span class="p">:</span><span class="w"> </span><span class="s2">"api.lemaire.tel"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"requestId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"abc-123-def-456"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"http"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"POST"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/chat"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"protocol"</span><span class="p">:</span><span class="w"> </span><span class="s2">"HTTP/1.1"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"sourceIp"</span><span class="p">:</span><span class="w"> </span><span class="s2">"203.0.113.42"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"userAgent"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Mozilla/5.0..."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"time"</span><span class="p">:</span><span class="w"> </span><span class="s2">"10/Jan/2026:14:23:45 +0000"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"timeEpoch"</span><span class="p">:</span><span class="w"> </span><span class="mi">1736517825000</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"headers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"content-type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/json"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"host"</span><span class="p">:</span><span class="w"> </span><span class="s2">"api.lemaire.tel"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"origin"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://chat.lemaire.tel"</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"body"</span><span class="p">:</span><span class="w"> </span><span class="s2">"{</span><span class="se">\"</span><span class="s2">messages</span><span class="se">\"</span><span class="s2">:[{</span><span class="se">\"</span><span class="s2">role</span><span class="se">\"</span><span class="s2">:</span><span class="se">\"</span><span class="s2">user</span><span class="se">\"</span><span class="s2">,</span><span class="se">\"</span><span class="s2">text</span><span class="se">\"</span><span class="s2">:</span><span class="se">\"</span><span class="s2">What is your experience with AWS?</span><span class="se">\"</span><span class="s2">}]}"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"isBase64Encoded"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Note</strong>: Simplified for readability. Actual events include additional fields and more headers from the browser.</p>

<h3 id="step-3-lambda-processing">Step 3: Lambda Processing</h3>

<p>Lambda validates, truncates, and extracts the request:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># lambda_function.py
</span><span class="n">body</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">event</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"body"</span><span class="p">,</span> <span class="s">"{}"</span><span class="p">))</span>

<span class="c1"># Validation: Pydantic ensures schema validity. Malformed requests fail early.
</span><span class="n">chat_request</span> <span class="o">=</span> <span class="n">ChatRequest</span><span class="p">(</span><span class="o">**</span><span class="n">body</span><span class="p">)</span>

<span class="c1"># Truncation: Only last 20 messages retained to limit token usage and costs.
# Older context is irrelevant for immediate queries.
</span><span class="n">messages</span> <span class="o">=</span> <span class="n">chat_request</span><span class="p">.</span><span class="n">messages</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">messages</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">20</span><span class="p">:</span>
    <span class="n">messages</span> <span class="o">=</span> <span class="n">messages</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>

<span class="c1"># Extract question and invoke RAG pipeline
</span><span class="n">question</span> <span class="o">=</span> <span class="n">messages</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">text</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">run_rag_pipeline</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>  <span class="c1"># Invokes LangGraph workflow
</span></code></pre></div></div>

<h3 id="step-4-langgraph-internal-state">Step 4: LangGraph Internal State</h3>

<p>The RAG pipeline (<code class="language-plaintext highlighter-rouge">rag/pipeline.py</code>) implements a LangGraph state machine with two nodes:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">retrieve_node</code>: Calls DynamoDBRetriever, fetches chunks, flattens to context string</li>
  <li><code class="language-plaintext highlighter-rouge">generate_node</code>: Injects context into System Prompt, calls Bedrock</li>
</ul>

<p>State flows through the workflow:</p>

<p><strong>Initial State (after retrieve_node):</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
  <span class="s">"question"</span><span class="p">:</span> <span class="s">"What is your experience with AWS?"</span><span class="p">,</span>
  <span class="s">"context"</span><span class="p">:</span> <span class="s">"Jeremy Lemaire</span><span class="se">\n\n</span><span class="s">AWS Solutions Architect..."</span><span class="p">,</span>
  <span class="s">"messages"</span><span class="p">:</span> <span class="p">[</span>
    <span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s">"You are a helpful assistant representing Jeremy..."</span><span class="p">),</span>
    <span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s">"What is your experience with AWS?"</span><span class="p">)</span>
  <span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>After generate_node:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
  <span class="s">"question"</span><span class="p">:</span> <span class="s">"What is your experience with AWS?"</span><span class="p">,</span>
  <span class="s">"context"</span><span class="p">:</span> <span class="s">"..."</span><span class="p">,</span>
  <span class="s">"messages"</span><span class="p">:</span> <span class="p">[</span>
    <span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s">"You are a helpful assistant..."</span><span class="p">),</span>
    <span class="n">HumanMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s">"What is your experience with AWS?"</span><span class="p">),</span>
    <span class="n">AIMessage</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="s">"I have 8 years of experience as an AWS Solutions Architect..."</span><span class="p">)</span>
  <span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="step-5-dynamodb-query-internal">Step 5: DynamoDB Query (Internal)</h3>

<p>The embedding for “What is your experience with AWS?” is computed:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query_embedding</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.123</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.789</span><span class="p">,</span> <span class="p">...]</span>  <span class="c1"># 1024-dimensional vector
</span></code></pre></div></div>

<p>DynamoDB scan retrieves all items and computes cosine similarity client-side:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Results sorted by similarity score
</span><span class="p">[</span>
  <span class="p">{</span>
    <span class="s">"id"</span><span class="p">:</span> <span class="s">"doc_0_abc123"</span><span class="p">,</span>
    <span class="s">"text"</span><span class="p">:</span> <span class="s">"Jeremy Lemaire</span><span class="se">\n\n</span><span class="s">AWS Solutions Architect..."</span><span class="p">,</span>
    <span class="s">"similarity"</span><span class="p">:</span> <span class="mf">0.87</span>
  <span class="p">},</span>
  <span class="p">{</span>
    <span class="s">"id"</span><span class="p">:</span> <span class="s">"doc_3_def456"</span><span class="p">,</span>
    <span class="s">"text"</span><span class="p">:</span> <span class="s">"Certifications:</span><span class="se">\n</span><span class="s">- AWS Certified Solutions Architect..."</span><span class="p">,</span>
    <span class="s">"similarity"</span><span class="p">:</span> <span class="mf">0.82</span>
  <span class="p">},</span>
  <span class="p">{</span>
    <span class="s">"id"</span><span class="p">:</span> <span class="s">"doc_1_ghi789"</span><span class="p">,</span>
    <span class="s">"text"</span><span class="p">:</span> <span class="s">"Technical Skills:</span><span class="se">\n</span><span class="s">- Python, Terraform, AWS Lambda..."</span><span class="p">,</span>
    <span class="s">"similarity"</span><span class="p">:</span> <span class="mf">0.78</span>
  <span class="p">}</span>
<span class="p">]</span>
<span class="c1"># Top 3 are concatenated into the context string
</span></code></pre></div></div>

<h3 id="step-6-bedrock-api-call-internal">Step 6: Bedrock API Call (Internal)</h3>

<p>Generate Node calls Amazon Bedrock:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"modelId"</span><span class="p">:</span><span class="w"> </span><span class="s2">"us.amazon.nova-lite-v2:0"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"messages"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"You are a Virtual Clone representing Jeremy Lemaire. Answer ONLY using information from the CONTEXT below.</span><span class="se">\n\n</span><span class="s2">CONTEXT:</span><span class="se">\n</span><span class="s2">Jeremy Lemaire</span><span class="se">\n\n</span><span class="s2">AWS Solutions Architect..."</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is your experience with AWS?"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"inferenceConfig"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"temperature"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.1</span><span class="p">,</span><span class="w">
    </span><span class="nl">"maxTokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">500</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Bedrock Response:</strong></p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"output"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"message"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"assistant"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I have 8 years of experience as an AWS Solutions Architect..."</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"usage"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"inputTokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">234</span><span class="p">,</span><span class="w">
    </span><span class="nl">"outputTokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">47</span><span class="p">,</span><span class="w">
    </span><span class="nl">"totalTokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">281</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="architecture--lifecycle-cold-vs-warm-start">Architecture &amp; Lifecycle: Cold vs Warm Start</h2>

<h3 id="cold-start-4-seconds">Cold Start (~4 seconds)</h3>
<p>Occurs when no active container exists (after ~15 mins inactivity).</p>
<ol>
  <li><strong>Python Imports</strong>: <code class="language-plaintext highlighter-rouge">import langchain</code> (~1.5s).</li>
  <li><strong>Module Level Initialization</strong>: Global singleton variables initialized.</li>
  <li><strong>Handler Execution</strong>:
    <ul>
      <li>Detects <code class="language-plaintext highlighter-rouge">_vector_store is None</code>.</li>
      <li><strong>Initialization</strong>: Establishes DynamoDB connection (SSL handshake ~0.5s), compiles graph (~0.5s).</li>
      <li><strong>Total Latency</strong>: ~4s</li>
    </ul>
  </li>
</ol>

<h3 id="warm-start-400ms">Warm Start (&lt;400ms)</h3>
<p>Occurs on subsequent requests to the same container.</p>
<ol>
  <li><strong>Container State</strong>: Memory preserved.</li>
  <li><strong>No Imports</strong>: Modules already loaded.</li>
  <li><strong>Persisted Globals</strong>: <code class="language-plaintext highlighter-rouge">_vector_store</code> already initialized.
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_retriever</span><span class="p">():</span>
    <span class="k">global</span> <span class="n">_vector_store</span>
    <span class="k">if</span> <span class="n">_vector_store</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">_vector_store</span>  <span class="c1"># Immediate return
</span></code></pre></div>    </div>
  </li>
  <li><strong>Execution</strong>: Direct <code class="language-plaintext highlighter-rouge">graph.invoke()</code>.
    <ul>
      <li>Runtime limited to API I/O: ~20ms DynamoDB scan + ~300ms Bedrock generation.</li>
    </ul>
  </li>
</ol>

<p><strong>Optimization</strong>: Module-level global variables reuse connections across invocations.</p>

<hr />

<h2 id="technical-key-points">Technical Key Points</h2>

<h3 id="dynamodb-as-a-vector-store">DynamoDB as a Vector Store</h3>

<p>Vector search requires comparing the query against <strong>every single document</strong> to determine semantic proximity.</p>

<ul>
  <li><strong>Implementation</strong>: Full table scan with client-side cosine similarity</li>
  <li><strong>Process</strong>: Loads all 50 chunks into memory (~10ms) and computes cosine similarity in Python</li>
  <li><strong>Scalability</strong>: Effective up to ~1000 chunks</li>
</ul>

<p><strong>Why this works</strong>: For &lt;1000 items, brute-force scanning is practically free and requires zero maintenance. 1000 chunks × 4KB = 4MB data. DynamoDB scans 1MB per request = 4 round-trips (~60ms) + Python cosine calculation (~40ms) = ~100ms total.</p>

<p><strong>When it doesn’t</strong>: Beyond 1000 chunks, you need Approximate Nearest Neighbor (ANN) algorithms via AWS OpenSearch, ChromaDB, or PostgreSQL with pgvector. ANN reduces search from O(N) to O(log n).</p>

<h3 id="binary-embedding-optimization">Binary Embedding Optimization</h3>
<p>Embeddings are <strong>1024 floats</strong> (Titan V2 default).</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="mf">0.123456789</span><span class="p">,</span><span class="w"> </span><span class="mf">0.23456789</span><span class="p">,</span><span class="w"> </span><span class="err">...</span><span class="p">]</span><span class="w"> </span><span class="err">//</span><span class="w"> </span><span class="err">JSON</span><span class="w"> </span><span class="err">representation</span><span class="w"> </span><span class="err">≈</span><span class="w"> </span><span class="mi">12</span><span class="err">KB</span><span class="w"> </span><span class="err">per</span><span class="w"> </span><span class="err">row</span><span class="w">
</span></code></pre></div></div>

<p>Packed into binary using <code class="language-plaintext highlighter-rouge">struct.pack</code>:</p>
<ul>
  <li>Standard float = 4 bytes. 1024 * 4 = <strong>4KB</strong></li>
  <li><strong>Result</strong>: 3x reduction in storage size and read throughput costs</li>
  <li><strong>Impact</strong>: Saves ~8GB for 1M rows</li>
</ul>

<h3 id="langgraph-vs-langchain">LangGraph vs LangChain</h3>

<p><strong>LangChain (The “Traditionnal” Way)</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Hidden state flow. A bit opaque
</span><span class="n">chain</span> <span class="o">=</span> <span class="n">retriever</span> <span class="o">|</span> <span class="n">prompt</span> <span class="o">|</span> <span class="n">llm</span> <span class="o">|</span> <span class="n">parser</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">chain</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="s">"question"</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>With LangGraph</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Clearly defined state schema
</span><span class="k">class</span> <span class="nc">GraphState</span><span class="p">(</span><span class="n">TypedDict</span><span class="p">):</span>
    <span class="n">question</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">context</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">messages</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">BaseMessage</span><span class="p">]</span>

<span class="c1"># Pure function nodes
</span><span class="k">def</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
    <span class="n">docs</span> <span class="o">=</span> <span class="n">retriever</span><span class="p">.</span><span class="n">get_relevant_documents</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s">"question"</span><span class="p">])</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"context"</span><span class="p">:</span> <span class="n">format_docs</span><span class="p">(</span><span class="n">docs</span><span class="p">)}</span>

<span class="k">def</span> <span class="nf">generate</span><span class="p">(</span><span class="n">state</span><span class="p">):</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="n">prompt</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">context</span><span class="o">=</span><span class="n">state</span><span class="p">[</span><span class="s">"context"</span><span class="p">]))</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"messages"</span><span class="p">:</span> <span class="p">[</span><span class="n">response</span><span class="p">]}</span>

<span class="c1"># Explicit control flow
</span><span class="n">workflow</span><span class="p">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s">"retrieve"</span><span class="p">,</span> <span class="s">"generate"</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Advantage</strong>: Full visibility into data transformations at every step.</p>

<hr />

<h2 id="few-performance-tricks">Few performance tricks</h2>

<p>A bit of overengineering for what I need but that was fun to look at.</p>

<h3 id="l1-cold-start-optimization">L1 Cold Start Optimization</h3>

<p>AWS uses <strong>Firecracker</strong> micro-VMs for Lambda execution environments. Once created, the environment is frozen and reused. I exploit this by using global variables to persist state:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># src/rag/dynamodb_retriever.py
</span><span class="n">_vector_store</span> <span class="o">=</span> <span class="bp">None</span>

<span class="k">def</span> <span class="nf">get_vector_store</span><span class="p">():</span>
    <span class="k">global</span> <span class="n">_vector_store</span>
    <span class="k">if</span> <span class="n">_vector_store</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="c1"># EXPENSIVE: Runs only on new environment (~4s)
</span>        <span class="n">_vector_store</span> <span class="o">=</span> <span class="n">DynamoDBVectorStore</span><span class="p">(...)</span>
    <span class="k">return</span> <span class="n">_vector_store</span>
</code></pre></div></div>

<p>This saves <strong>4 seconds</strong> on every warm start.</p>

<h3 id="context-sliding-window">Context Sliding Window</h3>

<p>Unbounded conversation to avoid <strong>Token Explosion</strong>.</p>

<p>If I send 100 messages of history, the 101st request pays for processing all 100 previous turns. Costs grow linearly.</p>

<p><strong>Solution</strong>: Strict cap at 20 messages.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">messages</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">20</span><span class="p">:</span>
    <span class="n">messages</span> <span class="o">=</span> <span class="n">messages</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>
</code></pre></div></div>

<p>Deterministic cost ceiling: Max cost per turn = (20 msgs × avg_tokens) + new_query</p>

<h3 id="rag-hyperparameters">RAG Hyperparameters</h3>

<h4 id="top-k--3">Top K = 3</h4>
<ul>
  <li><strong>Why not 1?</strong> Markdown splitting separates headers from content. Retrieving 3 captures surrounding semantic hierarchy.</li>
  <li><strong>Why not 10?</strong> Signal-to-noise ratio degrades. LLMs are known to ignore information buried in the middle (“Lost in the Middle” problem). 10 chunks = 3x more input tokens.</li>
</ul>

<h4 id="temperature--01">Temperature = 0.1</h4>
<ul>
  <li><strong>Goal</strong>: Determinism.</li>
  <li><strong>Logic</strong>: The LLM should act as a <strong>Retrieval Engine</strong>, not a Creative Writer.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">0.1</code>: “According to the text, Jeremy studied AWS.” (Fact)</li>
      <li><code class="language-plaintext highlighter-rouge">0.9</code>: “Jeremy, a cloud wizard, soared through the AWS skies…” (Hallucination)</li>
    </ul>
  </li>
</ul>

<h3 id="aws-adaptive-retries">AWS Adaptive Retries</h3>

<p>Standard retries (fixed interval) worsen AWS throttling scenarios (Thundering Herd problem).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">botocore.config</span> <span class="kn">import</span> <span class="n">Config</span>

<span class="n">BEDROCK_RETRY_CONFIG</span> <span class="o">=</span> <span class="n">Config</span><span class="p">(</span>
    <span class="n">retries</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'max_attempts'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
        <span class="s">'mode'</span><span class="p">:</span> <span class="s">'adaptive'</span>  <span class="c1"># Dynamic backoff for HTTP 429
</span>    <span class="p">}</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="aws-x-ray-tracing">AWS X-Ray Tracing</h3>

<p>Lambda X-Ray is enabled:</p>
<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">resource</span> <span class="s2">"aws_lambda_function"</span> <span class="s2">"virtual_me"</span> <span class="p">{</span>
  <span class="nx">tracing_config</span> <span class="p">{</span>
    <span class="nx">mode</span> <span class="p">=</span> <span class="s2">"Active"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This automatically traces:</p>
<ul>
  <li>Lambda execution time and cold starts</li>
  <li>DynamoDB Scan operations with latency</li>
  <li>Bedrock InvokeModel calls with token counts and latency</li>
  <li>All boto3 SDK calls</li>
</ul>

<p><strong>No Python SDK required</strong> - Lambda’s X-Ray integration handles it automatically.</p>

<p><strong>API Gateway Limitation</strong>: I use <strong>HTTP API (v2)</strong> instead of REST API (v1) because:</p>
<ul>
  <li><strong>70% cheaper</strong>: $1.00/million vs $3.50/million requests</li>
  <li><strong>Simpler CORS</strong>: Native configuration vs manual OPTIONS handling</li>
</ul>

<p>Trade-off: HTTP API v2 does <strong>not support X-Ray tracing</strong>. Only Lambda traces are captured.</p>

<hr />

<h2 id="resource-utilization">Resource Utilization</h2>

<h3 id="lambda-memory-512mb">Lambda Memory: 512MB</h3>

<p><strong>Breakdown:</strong></p>
<ul>
  <li>Python Runtime + Boto3: ~120MB</li>
  <li>LangChain + Dependencies: ~180MB</li>
  <li>Graph Compilation &amp; Working State: ~50MB</li>
  <li><strong>Total Overhead</strong>: ~350MB</li>
  <li><strong>Safety Margin</strong>: ~160MB (for embedding processing and JSON overhead)</li>
</ul>

<p>Using less than 512MB leads to <code class="language-plaintext highlighter-rouge">Memory Limit Exceeded</code> during LangChain initialization.</p>

<hr />

<h2 id="data-storage-strategy">Data Storage Strategy</h2>

<h3 id="dynamodb-schema">DynamoDB Schema</h3>

<p><strong>Item Structure:</strong></p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"doc_0_abc123"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"I have 8 years..."</span><span class="p">,</span><span class="w">       </span><span class="err">//</span><span class="w"> </span><span class="err">Retrieved</span><span class="w"> </span><span class="err">&amp;</span><span class="w"> </span><span class="err">sent</span><span class="w"> </span><span class="err">to</span><span class="w"> </span><span class="err">LLM</span><span class="w">
  </span><span class="nl">"embedding"</span><span class="p">:</span><span class="w"> </span><span class="err">&lt;Binary</span><span class="w"> </span><span class="err">Blob&gt;</span><span class="p">,</span><span class="w">    </span><span class="err">//</span><span class="w"> </span><span class="err">Search</span><span class="w"> </span><span class="err">key</span><span class="w"> </span><span class="err">(cosine</span><span class="w"> </span><span class="err">similarity)</span><span class="w">
  </span><span class="nl">"metadata"</span><span class="p">:</span><span class="w"> </span><span class="s2">"{</span><span class="se">\"</span><span class="s2">source</span><span class="se">\"</span><span class="s2">: </span><span class="se">\"</span><span class="s2">...</span><span class="se">\"</span><span class="s2">}"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<h3 id="dynamodb-vs-dedicated-vector-db">DynamoDB vs Dedicated Vector DB</h3>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Feature</th>
      <th style="text-align: left">DynamoDB (My Approach)</th>
      <th style="text-align: left">Vector DB (Chroma)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Search Logic</strong></td>
      <td style="text-align: left">Client-side (Python) - fetch everything, compute similarity</td>
      <td style="text-align: left">Server-side - DB engine finds neighbors</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Scalability</strong></td>
      <td style="text-align: left">O(N) - slower as data grows</td>
      <td style="text-align: left">O(log n) - instant even with millions of rows</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Cost</strong></td>
      <td style="text-align: left">High read cost (pay to read every row)</td>
      <td style="text-align: left">Optimized for search</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Maintenance</strong></td>
      <td style="text-align: left">Zero</td>
      <td style="text-align: left">Run dedicated cluster ($hundreds/month)</td>
    </tr>
  </tbody>
</table>

<p><strong>Why DynamoDB</strong>: “Serverless Poor Man’s Vector DB”. For &lt;1000 items, brute-force scanning is practically free and requires zero maintenance.</p>

<hr />

<h2 id="cost-estimate">Cost Estimate</h2>

<p>For personal use (~100 conversations/month): <strong>~$0.60-$1.00/month</strong></p>

<table>
  <thead>
    <tr>
      <th>Service</th>
      <th>Estimated Cost</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Lambda</td>
      <td>$0</td>
      <td>Free tier: 1M requests/month</td>
    </tr>
    <tr>
      <td>API Gateway</td>
      <td>$0</td>
      <td>Free tier: 1M requests/month</td>
    </tr>
    <tr>
      <td>DynamoDB</td>
      <td>$0</td>
      <td>Free tier: 25 RCUs/WCUs, 25GB storage</td>
    </tr>
    <tr>
      <td>Bedrock (Nova Lite)</td>
      <td>$0.10 - $0.30</td>
      <td>~700 queries × 2K input + 300 output tokens</td>
    </tr>
    <tr>
      <td>S3 + CloudFront</td>
      <td>$0.05 - $0.15</td>
      <td>Static frontend hosting</td>
    </tr>
    <tr>
      <td>Route53</td>
      <td>$0.50</td>
      <td>1 hosted zone</td>
    </tr>
  </tbody>
</table>

<p><strong>Idle Cost</strong>: $0. Pure pay-per-request serverless architecture.</p>

<p><strong>Pricing reference</strong> (Nova Lite): $0.06/1M input tokens, $0.24/1M output tokens.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Traditional resumes are broken. They don’t capture context, trade-offs, or technical depth. They’re boring documents that force everyone to play guessing games.</p>

<p>The architecture is fully serverless:</p>
<ul>
  <li><strong>Compute</strong>: Lambda (pay-per-request)</li>
  <li><strong>Storage</strong>: DynamoDB (pay-per-request)</li>
  <li><strong>Inference</strong>: Bedrock (pay-per-token)</li>
  <li><strong>Idle Cost</strong>: $0</li>
</ul>

<p>It’s cheap, fast (400ms warm start), and actually represents how I think about my work - with context and technical nuance.</p>

<p>Try it: <a href="https://chat.lemaire.tel">chat.lemaire.tel</a></p>

<p>Source code: <a href="https://github.com/ox00004a/virtualme">github.com/ox00004a/virtualme</a></p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[The Resume Problem]]></summary></entry><entry><title type="html">Migrating Spring PetClinic to DDD with Spring Modulith and jMolecules</title><link href="https://jeremylem.github.io/blogging/2025/12/30/DDD_Spring_Modulith.html" rel="alternate" type="text/html" title="Migrating Spring PetClinic to DDD with Spring Modulith and jMolecules" /><published>2025-12-30T00:00:00+00:00</published><updated>2025-12-30T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2025/12/30/DDD_Spring_Modulith</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2025/12/30/DDD_Spring_Modulith.html"><![CDATA[<p>The classic Spring PetClinic is everyone’s first Spring app. Simple, clean, easy to follow. But as codebases grow it might turn differently. Business logic scatters across controllers and services. Changes ripple unpredictably. New developers take months to become productive.</p>

<p>I modernized PetClinic as a reference implementation for Domain-Driven Design using Spring Modulith, jMolecules, ByteBuddy, and ArchUnit. Here’s what I learned.</p>

<h2 id="why-ddd-now">Why DDD Now?</h2>

<p>Eric Evans published the Blue Book in 2003. For years, applying DDD to Spring/Hibernate meant wrestling with anemic domain models, entities reduced to data containers with getters and setters, business logic scattered across service layers.</p>

<p>That’s changed. Thanks to <a href="https://odrotbohm.de/">Oliver Drotbohm</a> and the work on Spring Modulith and jMolecules, we can build rich domain models with proper encapsulation, enforce module boundaries at compile time, and prepare a monolith for eventual microservice extraction, all while keeping Spring productive.</p>

<p>DDD is seeing a renaissance for three reasons:</p>

<ul>
  <li><strong>Microservices need boundaries.</strong> Teams discovered that decomposing monoliths without clear domain boundaries leads to distributed monoliths. DDD’s Bounded Contexts provide a principled way to define service boundaries.</li>
  <li><strong>Event-driven architecture.</strong> Modern systems communicate through events. DDD’s Domain Events pattern maps directly to event sourcing and message-driven microservices.</li>
  <li><strong>Better tooling.</strong> Frameworks like Spring Modulith and jMolecules finally make DDD practical in Java without fighting the framework.</li>
</ul>

<h2 id="the-architecture">The Architecture</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────┐
│                    Spring PetClinic                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────┐         ┌──────────────────┐          │
│  │  Owner Module    │         │   Vet Module     │          │
│  │                  │ events  │                  │          │
│  │  - Owner         │────────▶│  - Vet           │          │
│  │  - Pet           │         │  - Specialty     │          │
│  │  - Visit         │         │  - Patient       │          │
│  │  - PetType       │         │    Tracking      │          │
│  └────────┬─────────┘         └────────┬─────────┘          │
│           │                            │                     │
│           ▼                            ▼                     │
│  ┌─────────────────────────────────────────────────┐        │
│  │              Shared Kernel (model)               │        │
│  │         Person, PersonName, NamedEntity          │        │
│  └─────────────────────────────────────────────────┘        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>Modules communicate through events, not direct calls. When you’re ready to extract a microservice, the boundaries are already clean.</p>

<h2 id="the-stack">The Stack</h2>

<p>Four pieces work together:</p>

<table>
  <thead>
    <tr>
      <th>Tool</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Spring Modulith</strong></td>
      <td>Defines and verifies module boundaries</td>
    </tr>
    <tr>
      <td><strong>jMolecules</strong></td>
      <td>DDD building blocks as interfaces (Entity, ValueObject, AggregateRoot)</td>
    </tr>
    <tr>
      <td><strong>ByteBuddy</strong></td>
      <td>Weaves JPA annotations at compile time so domain classes stay clean</td>
    </tr>
    <tr>
      <td><strong>ArchUnit</strong></td>
      <td>Enforces architecture rules in tests</td>
    </tr>
  </tbody>
</table>

<h2 id="tactical-ddd-building-blocks">Tactical DDD Building Blocks</h2>

<h3 id="type-safe-identifiers">Type-Safe Identifiers</h3>

<p>Primitive obsessionm using <code class="language-plaintext highlighter-rouge">Integer</code> or <code class="language-plaintext highlighter-rouge">Long</code> for IDs, leads to accidentally mixing different entity IDs. Wrap identifiers in type-safe records:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="n">record</span> <span class="nf">PetId</span><span class="o">(</span><span class="nd">@Column</span><span class="o">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"id"</span><span class="o">)</span> <span class="no">UUID</span> <span class="n">value</span><span class="o">)</span> <span class="kd">implements</span> <span class="nc">Identifier</span> <span class="o">{</span>
    <span class="kd">public</span> <span class="nf">PetId</span><span class="o">()</span> <span class="o">{</span> <span class="k">this</span><span class="o">(</span><span class="no">UUID</span><span class="o">.</span><span class="na">randomUUID</span><span class="o">());</span> <span class="o">}</span>
<span class="o">}</span>

<span class="kd">public</span> <span class="n">record</span> <span class="nf">OwnerId</span><span class="o">(</span><span class="nd">@Column</span><span class="o">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"id"</span><span class="o">)</span> <span class="no">UUID</span> <span class="n">value</span><span class="o">)</span> <span class="kd">implements</span> <span class="nc">Identifier</span> <span class="o">{</span>
    <span class="kd">public</span> <span class="nf">OwnerId</span><span class="o">()</span> <span class="o">{</span> <span class="k">this</span><span class="o">(</span><span class="no">UUID</span><span class="o">.</span><span class="na">randomUUID</span><span class="o">());</span> <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Now you can’t pass an <code class="language-plaintext highlighter-rouge">OwnerId</code> where a <code class="language-plaintext highlighter-rouge">PetId</code> is expected. Compile-time safety.</p>

<h3 id="value-objects">Value Objects</h3>

<p>Create immutable Value Objects that encapsulate both data and behavior:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="n">record</span> <span class="nf">BirthDate</span><span class="o">(</span><span class="nd">@Column</span><span class="o">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"birth_date"</span><span class="o">)</span> <span class="nc">LocalDate</span> <span class="n">date</span><span class="o">)</span> <span class="kd">implements</span> <span class="nc">ValueObject</span> <span class="o">{</span>

    <span class="kd">public</span> <span class="nc">BirthDate</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">date</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Birth date must not be null"</span><span class="o">);</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">date</span><span class="o">.</span><span class="na">isAfter</span><span class="o">(</span><span class="nc">LocalDate</span><span class="o">.</span><span class="na">now</span><span class="o">()))</span> <span class="k">throw</span> <span class="k">new</span> <span class="nc">IllegalArgumentException</span><span class="o">(</span><span class="s">"Birth date cannot be in the future"</span><span class="o">);</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">int</span> <span class="nf">getAgeInYears</span><span class="o">()</span> <span class="o">{</span>
        <span class="k">return</span> <span class="nc">Period</span><span class="o">.</span><span class="na">between</span><span class="o">(</span><span class="n">date</span><span class="o">,</span> <span class="nc">LocalDate</span><span class="o">.</span><span class="na">now</span><span class="o">()).</span><span class="na">getYears</span><span class="o">();</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">isElderly</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="n">getAgeInYears</span><span class="o">()</span> <span class="o">&gt;=</span> <span class="mi">7</span><span class="o">;</span> <span class="o">}</span>
    <span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">isPuppy</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="n">getAgeInYears</span><span class="o">()</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="o">;</span> <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Self-validating. Behavior lives with the data. No more validation logic scattered across services.</p>

<h3 id="entities-and-aggregates">Entities and Aggregates</h3>

<p>An <strong>Entity</strong> has identity that persists across state changes. An <strong>Aggregate</strong> is a cluster of entities treated as a single unit. One entity is the <strong>Aggregate Root</strong>—all access goes through it.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Owner is the Aggregate Root</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Owner</span> <span class="kd">extends</span> <span class="nc">Person</span> <span class="kd">implements</span> <span class="nc">AggregateRoot</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">,</span> <span class="nc">OwnerId</span><span class="o">&gt;</span> <span class="o">{</span>
    <span class="kd">private</span> <span class="nc">OwnerId</span> <span class="n">id</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">OwnerId</span><span class="o">();</span>
    <span class="kd">private</span> <span class="nc">Set</span><span class="o">&lt;</span><span class="nc">Pet</span><span class="o">&gt;</span> <span class="n">pets</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">LinkedHashSet</span><span class="o">&lt;&gt;();</span>

    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">addPet</span><span class="o">(</span><span class="nc">Pet</span> <span class="n">pet</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">pets</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">pet</span><span class="o">);</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="nc">Pet</span> <span class="nf">getPet</span><span class="o">(</span><span class="nc">String</span> <span class="n">name</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">return</span> <span class="n">pets</span><span class="o">.</span><span class="na">stream</span><span class="o">()</span>
            <span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">p</span> <span class="o">-&gt;</span> <span class="n">p</span><span class="o">.</span><span class="na">getName</span><span class="o">().</span><span class="na">equals</span><span class="o">(</span><span class="n">name</span><span class="o">))</span>
            <span class="o">.</span><span class="na">findFirst</span><span class="o">()</span>
            <span class="o">.</span><span class="na">orElse</span><span class="o">(</span><span class="kc">null</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>

<span class="c1">// Pet is an Entity within the Owner aggregate</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Pet</span> <span class="kd">extends</span> <span class="nc">NamedEntity</span> <span class="kd">implements</span> <span class="nc">Entity</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">,</span> <span class="nc">PetId</span><span class="o">&gt;</span> <span class="o">{</span>
    <span class="kd">private</span> <span class="nc">PetId</span> <span class="n">id</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">PetId</span><span class="o">();</span>
    <span class="kd">private</span> <span class="nc">BirthDate</span> <span class="n">birthDateValue</span><span class="o">;</span>
    <span class="kd">private</span> <span class="nc">Set</span><span class="o">&lt;</span><span class="nc">Visit</span><span class="o">&gt;</span> <span class="n">visits</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">LinkedHashSet</span><span class="o">&lt;&gt;();</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Rules: Only the Aggregate Root has a repository. External objects reference the aggregate by ID only. Invariants are enforced within the aggregate boundary.</p>

<h3 id="cross-aggregate-references-with-association">Cross-Aggregate References with Association</h3>

<p>Direct references between aggregates create tight coupling. If <code class="language-plaintext highlighter-rouge">Pet</code> holds a direct reference to <code class="language-plaintext highlighter-rouge">PetType</code>, changes to <code class="language-plaintext highlighter-rouge">PetType</code> can break <code class="language-plaintext highlighter-rouge">Pet</code>. Worse, JPA eagerly loads the entire object graph.</p>

<p>Use <code class="language-plaintext highlighter-rouge">Association&lt;T, ID&gt;</code> to hold only the ID reference:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Pet</span> <span class="kd">implements</span> <span class="nc">Entity</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">,</span> <span class="nc">PetId</span><span class="o">&gt;</span> <span class="o">{</span>
    <span class="c1">// Don't do this - crosses aggregate boundary</span>
    <span class="c1">// private PetType type;</span>

    <span class="c1">// Do this - store only the reference</span>
    <span class="kd">private</span> <span class="nc">Association</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;</span> <span class="n">type</span><span class="o">;</span>

    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">setType</span><span class="o">(</span><span class="nc">PetType</span> <span class="n">type</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">this</span><span class="o">.</span><span class="na">type</span> <span class="o">=</span> <span class="n">type</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">?</span> <span class="nc">Association</span><span class="o">.</span><span class="na">forAggregate</span><span class="o">(</span><span class="n">type</span><span class="o">)</span> <span class="o">:</span> <span class="kc">null</span><span class="o">;</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="nc">PetTypeId</span> <span class="nf">getTypeId</span><span class="o">()</span> <span class="o">{</span>
        <span class="k">return</span> <span class="k">this</span><span class="o">.</span><span class="na">type</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">?</span> <span class="k">this</span><span class="o">.</span><span class="na">type</span><span class="o">.</span><span class="na">getId</span><span class="o">()</span> <span class="o">:</span> <span class="kc">null</span><span class="o">;</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h4 id="resolving-associations-with-associationresolver">Resolving Associations with AssociationResolver</h4>

<p>When you need the actual <code class="language-plaintext highlighter-rouge">PetType</code> object, resolve it explicitly using <code class="language-plaintext highlighter-rouge">AssociationResolver</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Pet</span> <span class="kd">implements</span> <span class="nc">Entity</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">,</span> <span class="nc">PetId</span><span class="o">&gt;</span> <span class="o">{</span>
    <span class="kd">private</span> <span class="nc">Association</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;</span> <span class="n">type</span><span class="o">;</span>

    <span class="c1">// Resolve when needed - caller provides the resolver</span>
    <span class="kd">public</span> <span class="nc">PetType</span> <span class="nf">resolveType</span><span class="o">(</span><span class="nc">AssociationResolver</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;</span> <span class="n">resolver</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">return</span> <span class="k">this</span><span class="o">.</span><span class="na">type</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">?</span> <span class="n">resolver</span><span class="o">.</span><span class="na">resolve</span><span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">type</span><span class="o">).</span><span class="na">orElse</span><span class="o">(</span><span class="kc">null</span><span class="o">)</span> <span class="o">:</span> <span class="kc">null</span><span class="o">;</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The repository implements <code class="language-plaintext highlighter-rouge">AssociationResolver</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Repository</span>
<span class="kd">public</span> <span class="kd">interface</span> <span class="nc">PetTypeRepository</span>
        <span class="kd">extends</span> <span class="nc">JpaRepository</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;,</span>
                <span class="nc">AssociationResolver</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;</span> <span class="o">{</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Usage in application service:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Service</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">PetApplicationService</span> <span class="o">{</span>
    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">PetTypeRepository</span> <span class="n">petTypes</span><span class="o">;</span>

    <span class="kd">public</span> <span class="nc">PetTypeInfo</span> <span class="nf">getPetTypeInfo</span><span class="o">(</span><span class="nc">Pet</span> <span class="n">pet</span><span class="o">)</span> <span class="o">{</span>
        <span class="nc">PetType</span> <span class="n">type</span> <span class="o">=</span> <span class="n">pet</span><span class="o">.</span><span class="na">resolveType</span><span class="o">(</span><span class="n">petTypes</span><span class="o">);</span>  <span class="c1">// Explicit resolution</span>
        <span class="k">return</span> <span class="k">new</span> <span class="nf">PetTypeInfo</span><span class="o">(</span><span class="n">type</span><span class="o">.</span><span class="na">getName</span><span class="o">());</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h4 id="why-this-pattern-matters">Why This Pattern Matters</h4>

<table>
  <thead>
    <tr>
      <th>Benefit</th>
      <th>Explanation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Aggregate boundaries respected</strong></td>
      <td><code class="language-plaintext highlighter-rouge">Pet</code> doesn’t hold a direct reference to <code class="language-plaintext highlighter-rouge">PetType</code>—only its ID</td>
    </tr>
    <tr>
      <td><strong>Type safety</strong></td>
      <td><code class="language-plaintext highlighter-rouge">AssociationResolver&lt;PetType, PetTypeId&gt;</code> ensures you can’t accidentally resolve to wrong type</td>
    </tr>
    <tr>
      <td><strong>Explicit dependencies</strong></td>
      <td>Resolution requires injecting the resolver—no hidden database calls</td>
    </tr>
    <tr>
      <td><strong>Testable</strong></td>
      <td>Easy to mock <code class="language-plaintext highlighter-rouge">AssociationResolver</code> in unit tests</td>
    </tr>
    <tr>
      <td><strong>Lazy loading</strong></td>
      <td>Associations are resolved only when explicitly requested, not eagerly by JPA</td>
    </tr>
    <tr>
      <td><strong>jMolecules standard</strong></td>
      <td>Follows the framework’s best practices for cross-aggregate references</td>
    </tr>
  </tbody>
</table>

<h4 id="testing-with-mocked-resolver">Testing with Mocked Resolver</h4>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Test</span>
<span class="kt">void</span> <span class="nf">shouldResolvePetType</span><span class="o">()</span> <span class="o">{</span>
    <span class="nc">PetType</span> <span class="n">dog</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">PetType</span><span class="o">(</span><span class="s">"Dog"</span><span class="o">);</span>
    <span class="nc">Pet</span> <span class="n">pet</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Pet</span><span class="o">();</span>
    <span class="n">pet</span><span class="o">.</span><span class="na">setType</span><span class="o">(</span><span class="n">dog</span><span class="o">);</span>

    <span class="c1">// Mock the resolver</span>
    <span class="nc">AssociationResolver</span><span class="o">&lt;</span><span class="nc">PetType</span><span class="o">,</span> <span class="nc">PetTypeId</span><span class="o">&gt;</span> <span class="n">resolver</span> <span class="o">=</span> <span class="n">mock</span><span class="o">(</span><span class="nc">AssociationResolver</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
    <span class="n">when</span><span class="o">(</span><span class="n">resolver</span><span class="o">.</span><span class="na">resolve</span><span class="o">(</span><span class="n">any</span><span class="o">())).</span><span class="na">thenReturn</span><span class="o">(</span><span class="nc">Optional</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="n">dog</span><span class="o">));</span>

    <span class="nc">PetType</span> <span class="n">resolved</span> <span class="o">=</span> <span class="n">pet</span><span class="o">.</span><span class="na">resolveType</span><span class="o">(</span><span class="n">resolver</span><span class="o">);</span>

    <span class="n">assertThat</span><span class="o">(</span><span class="n">resolved</span><span class="o">.</span><span class="na">getName</span><span class="o">()).</span><span class="na">isEqualTo</span><span class="o">(</span><span class="s">"Dog"</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>

<p>No database needed. The domain logic is fully testable in isolation.</p>

<h3 id="domain-events">Domain Events</h3>

<p>Domain Events are the key to decoupling modules. When something significant happens in one module, it publishes an event. Other modules react without knowing about each other.</p>

<h4 id="defining-events">Defining Events</h4>

<p>Events are immutable records of something that happened. Use past tense naming—<code class="language-plaintext highlighter-rouge">PetAdopted</code>, not <code class="language-plaintext highlighter-rouge">AdoptPet</code>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="n">record</span> <span class="nf">PetAdoptedEvent</span><span class="o">(</span>
    <span class="nc">PetId</span> <span class="n">petId</span><span class="o">,</span>
    <span class="nc">PetTypeId</span> <span class="n">petTypeId</span><span class="o">,</span>
    <span class="nc">OwnerId</span> <span class="n">ownerId</span><span class="o">,</span>
    <span class="nc">LocalDate</span> <span class="n">adoptionDate</span>
<span class="o">)</span> <span class="kd">implements</span> <span class="nc">DomainEvent</span> <span class="o">{</span>

    <span class="kd">public</span> <span class="kd">static</span> <span class="nc">PetAdoptedEvent</span> <span class="nf">of</span><span class="o">(</span><span class="nc">PetId</span> <span class="n">petId</span><span class="o">,</span> <span class="nc">PetTypeId</span> <span class="n">petTypeId</span><span class="o">,</span> <span class="nc">OwnerId</span> <span class="n">ownerId</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="nf">PetAdoptedEvent</span><span class="o">(</span><span class="n">petId</span><span class="o">,</span> <span class="n">petTypeId</span><span class="o">,</span> <span class="n">ownerId</span><span class="o">,</span> <span class="nc">LocalDate</span><span class="o">.</span><span class="na">now</span><span class="o">());</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Include only the data consumers need. Use IDs, not full entities—keeps events lightweight and avoids coupling to aggregate internals.</p>

<h4 id="publishing-events-two-patterns">Publishing Events: Two Patterns</h4>

<p><strong>Pattern 1: From the Aggregate (Pure DDD)</strong></p>

<p>The aggregate registers events internally. Spring Data publishes them when the aggregate is saved:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">Owner</span> <span class="kd">extends</span> <span class="nc">AbstractAggregateRoot</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">&gt;</span>
        <span class="kd">implements</span> <span class="nc">AggregateRoot</span><span class="o">&lt;</span><span class="nc">Owner</span><span class="o">,</span> <span class="nc">OwnerId</span><span class="o">&gt;</span> <span class="o">{</span>

    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">addPet</span><span class="o">(</span><span class="nc">Pet</span> <span class="n">pet</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">pets</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">pet</span><span class="o">);</span>
        <span class="c1">// Register event - published after save() completes</span>
        <span class="n">registerEvent</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="n">pet</span><span class="o">.</span><span class="na">getId</span><span class="o">(),</span> <span class="n">pet</span><span class="o">.</span><span class="na">getTypeId</span><span class="o">(),</span> <span class="k">this</span><span class="o">.</span><span class="na">id</span><span class="o">));</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The event is published automatically when <code class="language-plaintext highlighter-rouge">ownerRepository.save(owner)</code> commits. No explicit publisher needed.</p>

<p><strong>Pattern 2: From the Application Service (Pragmatic)</strong></p>

<p>The application service publishes events explicitly:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Service</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">PetApplicationService</span> <span class="o">{</span>
    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">OwnerRepository</span> <span class="n">owners</span><span class="o">;</span>
    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">ApplicationEventPublisher</span> <span class="n">events</span><span class="o">;</span>

    <span class="nd">@Transactional</span>
    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">adoptPet</span><span class="o">(</span><span class="nc">OwnerId</span> <span class="n">ownerId</span><span class="o">,</span> <span class="nc">Pet</span> <span class="n">pet</span><span class="o">)</span> <span class="o">{</span>
        <span class="nc">Owner</span> <span class="n">owner</span> <span class="o">=</span> <span class="n">owners</span><span class="o">.</span><span class="na">findById</span><span class="o">(</span><span class="n">ownerId</span><span class="o">).</span><span class="na">orElseThrow</span><span class="o">();</span>
        <span class="n">owner</span><span class="o">.</span><span class="na">addPet</span><span class="o">(</span><span class="n">pet</span><span class="o">);</span>
        <span class="n">owners</span><span class="o">.</span><span class="na">save</span><span class="o">(</span><span class="n">owner</span><span class="o">);</span>

        <span class="n">events</span><span class="o">.</span><span class="na">publishEvent</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="n">pet</span><span class="o">.</span><span class="na">getId</span><span class="o">(),</span> <span class="n">pet</span><span class="o">.</span><span class="na">getTypeId</span><span class="o">(),</span> <span class="n">ownerId</span><span class="o">));</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>When to use which:</strong></p>

<table>
  <thead>
    <tr>
      <th>Pattern</th>
      <th>Use When</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Aggregate</td>
      <td>Event is intrinsic to domain logic; you want pure domain model</td>
    </tr>
    <tr>
      <td>Application Service</td>
      <td>Event depends on application context; you need more control over timing</td>
    </tr>
  </tbody>
</table>

<h4 id="subscribing-to-events">Subscribing to Events</h4>

<p>Use <code class="language-plaintext highlighter-rouge">@ApplicationModuleListener</code> for cross-module event handling:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Service</span>
<span class="kd">class</span> <span class="nc">VetPatientTrackingService</span> <span class="o">{</span>

    <span class="nd">@ApplicationModuleListener</span>
    <span class="kt">void</span> <span class="nf">onPetAdopted</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span> <span class="n">event</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">log</span><span class="o">.</span><span class="na">info</span><span class="o">(</span><span class="s">"New patient registered: Pet ID="</span> <span class="o">+</span> <span class="n">event</span><span class="o">.</span><span class="na">petId</span><span class="o">());</span>
        <span class="c1">// Update vet module's view of patients</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">@ApplicationModuleListener</code> is Spring Modulith’s annotation that combines <code class="language-plaintext highlighter-rouge">@EventListener</code> with <code class="language-plaintext highlighter-rouge">@Async</code> and <code class="language-plaintext highlighter-rouge">@Transactional</code>. It runs in a separate transaction after the publishing transaction commits.</p>

<h4 id="transactional-event-handling">Transactional Event Handling</h4>

<p>Understanding transaction boundaries is critical:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────────────┐
│  Publishing Transaction                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ 1. owner.addPet(pet)                                         │   │
│  │ 2. ownerRepository.save(owner)                               │   │
│  │ 3. Event stored in publication registry (if enabled)        │   │
│  │ 4. COMMIT                                                     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Listener Transaction (separate)                              │   │
│  │ 5. @ApplicationModuleListener receives event                 │   │
│  │ 6. Listener does its work                                    │   │
│  │ 7. COMMIT (or ROLLBACK - doesn't affect publisher)          │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>Key insight: listener failures don’t roll back the publishing transaction. The pet is adopted even if the vet notification fails. This is usually what you want—but you need to handle listener failures.</p>

<h4 id="reliable-event-publication">Reliable Event Publication</h4>

<p>What happens if the listener fails? Or the application crashes between publishing and processing? Spring Modulith’s <strong>Event Publication Registry</strong> solves this.</p>

<p>Add the dependency and configure a database-backed registry:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.springframework.modulith<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>spring-modulith-starter-jpa<span class="nt">&lt;/artifactId&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
</code></pre></div></div>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Configuration</span>
<span class="kd">class</span> <span class="nc">ModulithConfig</span> <span class="o">{</span>

    <span class="nd">@Bean</span>
    <span class="nc">ApplicationRunner</span> <span class="nf">eventPublicationRegistrar</span><span class="o">(</span><span class="nc">EventPublicationRegistry</span> <span class="n">registry</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">return</span> <span class="n">args</span> <span class="o">-&gt;</span> <span class="o">{</span>
            <span class="c1">// On startup, retry any incomplete publications</span>
            <span class="n">registry</span><span class="o">.</span><span class="na">resubmitIncompletePublications</span><span class="o">();</span>
        <span class="o">};</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>How it works:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────────────────┐
│  1. Event Published                                                  │
│     └─&gt; Stored in EVENT_PUBLICATION table with status INCOMPLETE    │
│                                                                      │
│  2. Listener Invoked                                                 │
│     └─&gt; If SUCCESS: Mark publication COMPLETED                      │
│     └─&gt; If FAILURE: Publication remains INCOMPLETE                  │
│                                                                      │
│  3. On Application Restart                                           │
│     └─&gt; resubmitIncompletePublications() retries failed events      │
└─────────────────────────────────────────────────────────────────────┘
</code></pre></div></div>

<p>The registry guarantees at-least-once delivery. Listeners must be idempotent—they might receive the same event twice after a crash recovery.</p>

<h4 id="async-vs-sync-processing">Async vs Sync Processing</h4>

<p>By default, <code class="language-plaintext highlighter-rouge">@ApplicationModuleListener</code> is async—listeners run in a separate thread after the transaction commits. For synchronous processing within the same transaction:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@TransactionalEventListener</span><span class="o">(</span><span class="n">phase</span> <span class="o">=</span> <span class="nc">TransactionPhase</span><span class="o">.</span><span class="na">BEFORE_COMMIT</span><span class="o">)</span>
<span class="kt">void</span> <span class="nf">onPetAdoptedSync</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span> <span class="n">event</span><span class="o">)</span> <span class="o">{</span>
    <span class="c1">// Runs in same transaction as publisher</span>
    <span class="c1">// If this fails, the whole transaction rolls back</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Use sync listeners sparingly. They couple the modules more tightly—a listener failure affects the publisher.</p>

<h4 id="error-handling-in-listeners">Error Handling in Listeners</h4>

<p>Async listeners need explicit error handling:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ApplicationModuleListener</span>
<span class="kt">void</span> <span class="nf">onPetAdopted</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span> <span class="n">event</span><span class="o">)</span> <span class="o">{</span>
    <span class="k">try</span> <span class="o">{</span>
        <span class="n">vetPatientService</span><span class="o">.</span><span class="na">registerPatient</span><span class="o">(</span><span class="n">event</span><span class="o">.</span><span class="na">petId</span><span class="o">());</span>
    <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">Exception</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">log</span><span class="o">.</span><span class="na">error</span><span class="o">(</span><span class="s">"Failed to register patient for pet {}"</span><span class="o">,</span> <span class="n">event</span><span class="o">.</span><span class="na">petId</span><span class="o">(),</span> <span class="n">e</span><span class="o">);</span>
        <span class="c1">// Options:</span>
        <span class="c1">// 1. Rethrow - event stays INCOMPLETE, retried on restart</span>
        <span class="c1">// 2. Swallow - event marked COMPLETED, lost</span>
        <span class="c1">// 3. Send to dead letter queue for manual handling</span>
        <span class="k">throw</span> <span class="n">e</span><span class="o">;</span>  <span class="c1">// Prefer rethrowing for automatic retry</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h4 id="exposing-events-as-a-module-api">Exposing Events as a Module API</h4>

<p>Events are the public API between modules. Expose them through a named interface:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>owner/
├── package-info.java          # @ApplicationModule
├── Owner.java                 # Internal
├── OwnerRepository.java       # Internal
└── events/
    ├── package-info.java      # Named interface
    └── PetAdoptedEvent.java   # Public API
</code></pre></div></div>

<p>Other modules declare dependency on events only:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ApplicationModule</span><span class="o">(</span>
    <span class="n">displayName</span> <span class="o">=</span> <span class="s">"Vet Management"</span><span class="o">,</span>
    <span class="n">allowedDependencies</span> <span class="o">=</span> <span class="o">{</span> <span class="s">"model"</span><span class="o">,</span> <span class="s">"owner::events"</span> <span class="o">}</span>
<span class="o">)</span>
<span class="kn">package</span> <span class="nn">org.springframework.samples.petclinic.vet</span><span class="o">;</span>
</code></pre></div></div>

<p>The vet module can listen to <code class="language-plaintext highlighter-rouge">PetAdoptedEvent</code> but cannot access <code class="language-plaintext highlighter-rouge">Owner</code>, <code class="language-plaintext highlighter-rouge">OwnerRepository</code>, or any other internal class. True decoupling.</p>

<h4 id="testing-events">Testing Events</h4>

<p>Spring Modulith provides testing support:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ApplicationModuleTest</span>
<span class="kd">class</span> <span class="nc">OwnerModuleTests</span> <span class="o">{</span>

    <span class="nd">@Test</span>
    <span class="kt">void</span> <span class="nf">petAdoptionPublishesEvent</span><span class="o">(</span><span class="nc">Scenario</span> <span class="n">scenario</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">scenario</span><span class="o">.</span><span class="na">stimulate</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">petService</span><span class="o">.</span><span class="na">adoptPet</span><span class="o">(</span><span class="n">ownerId</span><span class="o">,</span> <span class="n">pet</span><span class="o">))</span>
            <span class="o">.</span><span class="na">andWaitForEventOfType</span><span class="o">(</span><span class="nc">PetAdoptedEvent</span><span class="o">.</span><span class="na">class</span><span class="o">)</span>
            <span class="o">.</span><span class="na">matching</span><span class="o">(</span><span class="n">event</span> <span class="o">-&gt;</span> <span class="n">event</span><span class="o">.</span><span class="na">petId</span><span class="o">().</span><span class="na">equals</span><span class="o">(</span><span class="n">pet</span><span class="o">.</span><span class="na">getId</span><span class="o">()))</span>
            <span class="o">.</span><span class="na">toArriveAndVerify</span><span class="o">(</span><span class="n">event</span> <span class="o">-&gt;</span> <span class="o">{</span>
                <span class="n">assertThat</span><span class="o">(</span><span class="n">event</span><span class="o">.</span><span class="na">ownerId</span><span class="o">()).</span><span class="na">isEqualTo</span><span class="o">(</span><span class="n">ownerId</span><span class="o">);</span>
            <span class="o">});</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">Scenario</code> API lets you verify events are published with the right data, without coupling tests to listener implementations.</p>

<h2 id="spring-modulith-defining-boundaries">Spring Modulith: Defining Boundaries</h2>

<p>Use <code class="language-plaintext highlighter-rouge">package-info.java</code> to declare modules and their allowed dependencies:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ApplicationModule</span><span class="o">(</span>
    <span class="n">displayName</span> <span class="o">=</span> <span class="s">"Owner Management"</span><span class="o">,</span>
    <span class="n">allowedDependencies</span> <span class="o">=</span> <span class="s">"model"</span>
<span class="o">)</span>
<span class="kn">package</span> <span class="nn">org.springframework.samples.petclinic.owner</span><span class="o">;</span>
</code></pre></div></div>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ApplicationModule</span><span class="o">(</span>
    <span class="n">displayName</span> <span class="o">=</span> <span class="s">"Vet Management"</span><span class="o">,</span>
    <span class="n">allowedDependencies</span> <span class="o">=</span> <span class="o">{</span> <span class="s">"model"</span><span class="o">,</span> <span class="s">"owner::events"</span> <span class="o">}</span>  <span class="c1">// Named interface - only events subpackage</span>
<span class="o">)</span>
<span class="kn">package</span> <span class="nn">org.springframework.samples.petclinic.vet</span><span class="o">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">owner::events</code> syntax is a <strong>named interface</strong>—it exposes only the <code class="language-plaintext highlighter-rouge">events</code> subpackage while keeping <code class="language-plaintext highlighter-rouge">Owner</code>, <code class="language-plaintext highlighter-rouge">OwnerRepository</code>, and other internals hidden. Combined with the event publication registry described above, this creates truly independent modules that communicate only through events.</p>

<p>Verify structure in tests:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">ModulithStructureTest</span> <span class="o">{</span>
    <span class="nc">ApplicationModules</span> <span class="n">modules</span> <span class="o">=</span> <span class="nc">ApplicationModules</span><span class="o">.</span><span class="na">of</span><span class="o">(</span><span class="s">"org.springframework.samples.petclinic"</span><span class="o">);</span>

    <span class="nd">@Test</span>
    <span class="kt">void</span> <span class="nf">verifiesModularStructure</span><span class="o">()</span> <span class="o">{</span>
        <span class="n">modules</span><span class="o">.</span><span class="na">verify</span><span class="o">();</span>  <span class="c1">// Fails if boundaries are violated</span>
    <span class="o">}</span>

    <span class="nd">@Test</span>
    <span class="kt">void</span> <span class="nf">generateDocumentation</span><span class="o">()</span> <span class="o">{</span>
        <span class="k">new</span> <span class="nf">Documenter</span><span class="o">(</span><span class="n">modules</span><span class="o">)</span>
            <span class="o">.</span><span class="na">writeModulesAsPlantUml</span><span class="o">()</span>
            <span class="o">.</span><span class="na">writeIndividualModulesAsPlantUml</span><span class="o">();</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h2 id="archunit-enforcing-architecture">ArchUnit: Enforcing Architecture</h2>

<p>Spring Modulith uses ArchUnit under the hood. The <code class="language-plaintext highlighter-rouge">modules.verify()</code> call checks:</p>
<ul>
  <li>No cycles between modules</li>
  <li>Modules only access their declared dependencies</li>
  <li>Internal packages are not accessed from outside</li>
</ul>

<p>jMolecules adds DDD-specific rules and layered architecture enforcement:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────────────────────────────┐
│                   @InterfaceLayer                        │
│              Controllers, REST endpoints                 │
├─────────────────────────────────────────────────────────┤
│                   @ApplicationLayer                      │
│           Application services, use cases                │
├─────────────────────────────────────────────────────────┤
│                     @DomainLayer                         │
│         Entities, Value Objects, Domain Services         │
├─────────────────────────────────────────────────────────┤
│                 @InfrastructureLayer                     │
│            Repositories, external services               │
└─────────────────────────────────────────────────────────┘

        Dependencies flow DOWN only (enforced by ArchUnit)
</code></pre></div></div>

<p>Run all rules in a test:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@AnalyzeClasses</span><span class="o">(</span><span class="n">packages</span> <span class="o">=</span> <span class="s">"org.springframework.samples.petclinic"</span><span class="o">)</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">JMoleculesRulesUnitTest</span> <span class="o">{</span>

    <span class="nd">@ArchTest</span>
    <span class="nc">ArchRule</span> <span class="n">dddRules</span> <span class="o">=</span> <span class="nc">JMoleculesDddRules</span><span class="o">.</span><span class="na">all</span><span class="o">();</span>

    <span class="nd">@ArchTest</span>
    <span class="nc">ArchRule</span> <span class="n">layeredArchitecture</span> <span class="o">=</span> <span class="nc">JMoleculesArchitectureRules</span><span class="o">.</span><span class="na">ensureLayering</span><span class="o">();</span>
<span class="o">}</span>
</code></pre></div></div>

<p>When a rule is violated, your build fails:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] -
Rule 'classes that implement Entity should have identity' was violated (1 times):
    Class Pet does not have an @Id annotated field
</code></pre></div></div>

<h2 id="bytebuddy-keeping-domain-classes-clean">ByteBuddy: Keeping Domain Classes Clean</h2>

<p>The magic that makes this work smoothly. ByteBuddy weaves JPA annotations at compile time based on jMolecules interfaces. Your domain classes stay clean—no <code class="language-plaintext highlighter-rouge">@Entity</code>, no <code class="language-plaintext highlighter-rouge">@Id</code> annotations polluting the model.</p>

<p>Configure the Maven plugin:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;plugin&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>net.bytebuddy<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>byte-buddy-maven-plugin<span class="nt">&lt;/artifactId&gt;</span>
    <span class="nt">&lt;executions&gt;</span>
        <span class="nt">&lt;execution&gt;</span>
            <span class="nt">&lt;goals&gt;&lt;goal&gt;</span>transform-extended<span class="nt">&lt;/goal&gt;&lt;/goals&gt;</span>
        <span class="nt">&lt;/execution&gt;</span>
    <span class="nt">&lt;/executions&gt;</span>
    <span class="nt">&lt;configuration&gt;</span>
        <span class="nt">&lt;classPathDiscovery&gt;</span>true<span class="nt">&lt;/classPathDiscovery&gt;</span>
    <span class="nt">&lt;/configuration&gt;</span>
<span class="nt">&lt;/plugin&gt;</span>
</code></pre></div></div>

<p>Write clean domain classes. ByteBuddy adds the JPA infrastructure.</p>

<h2 id="the-path-to-microservices">The Path to Microservices</h2>

<p>This architecture is a stepping stone:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                     Monolith with Modules
┌─────────────────────────────────────────────────────────┐
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Owner     │  │     Vet     │  │   Billing   │     │
│  │  Context    │──│   Context   │──│   Context   │     │
│  │  (Module)   │  │  (Module)   │  │  (Module)   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
│         │                │                │             │
│         └────── Events ──┴──── Events ────┘             │
└─────────────────────────────────────────────────────────┘
                            │
                            │ Extract when ready
                            ▼
                    Microservices
┌─────────────────┐  ┌─────────────┐  ┌─────────────┐
│   Owner         │  │     Vet     │  │   Billing   │
│  Service        │──│   Service   │──│   Service   │
└─────────────────┘  └─────────────┘  └─────────────┘
        │                   │                   │
        └────── Kafka/RabbitMQ ─────────────────┘
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>DDD Concept</th>
      <th>Monolith</th>
      <th>Microservices</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bounded Context</td>
      <td>Spring Modulith Module</td>
      <td>Separate Service</td>
    </tr>
    <tr>
      <td>Aggregate</td>
      <td>Transactional boundary</td>
      <td>Service boundary</td>
    </tr>
    <tr>
      <td>Domain Event</td>
      <td><code class="language-plaintext highlighter-rouge">ApplicationEventPublisher</code></td>
      <td>Kafka/RabbitMQ message</td>
    </tr>
    <tr>
      <td>Anti-Corruption Layer</td>
      <td>Module adapter</td>
      <td>API Gateway / BFF</td>
    </tr>
  </tbody>
</table>

<p>Module boundaries are already defined and verified. Events decouple modules. Aggregate boundaries map naturally to service boundaries. Type-safe IDs prevent accidental coupling. When a module needs independent scaling, extract it—the work is already done.</p>

<h2 id="when-not-to-use-ddd">When NOT to Use DDD</h2>

<p>DDD is not free. It adds concepts, abstractions, and ceremony.</p>

<table>
  <thead>
    <tr>
      <th>Scenario</th>
      <th>Why DDD Is Overkill</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Simple CRUD apps</strong></td>
      <td>If your app is mostly forms over data with little business logic, a simple layered architecture suffices.</td>
    </tr>
    <tr>
      <td><strong>Short-lived projects</strong></td>
      <td>Prototypes, MVPs, or throwaway code don’t benefit from the upfront investment.</td>
    </tr>
    <tr>
      <td><strong>Small teams without domain experts</strong></td>
      <td>DDD assumes collaboration with domain experts. Without them, you’re guessing.</td>
    </tr>
    <tr>
      <td><strong>Stable, simple domains</strong></td>
      <td>If the domain is unlikely to change, the flexibility DDD provides isn’t needed.</td>
    </tr>
  </tbody>
</table>

<p>The pragmatic approach: Use DDD <strong>tactically</strong> for complex domain logic. Use DDD <strong>strategically</strong> when you have multiple teams or are planning microservices.</p>

<h2 id="project-setup">Project Setup</h2>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">&lt;!-- Spring Modulith --&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.springframework.modulith<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>spring-modulith-starter-core<span class="nt">&lt;/artifactId&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.springframework.modulith<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>spring-modulith-starter-test<span class="nt">&lt;/artifactId&gt;</span>
    <span class="nt">&lt;scope&gt;</span>test<span class="nt">&lt;/scope&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>

<span class="c">&lt;!-- jMolecules DDD --&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.jmolecules<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>jmolecules-ddd<span class="nt">&lt;/artifactId&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.jmolecules<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>jmolecules-layered-architecture<span class="nt">&lt;/artifactId&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.jmolecules.integrations<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>jmolecules-jpa<span class="nt">&lt;/artifactId&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>

<span class="c">&lt;!-- ByteBuddy for JPA annotation weaving --&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.jmolecules.integrations<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>jmolecules-bytebuddy-nodep<span class="nt">&lt;/artifactId&gt;</span>
    <span class="nt">&lt;scope&gt;</span>provided<span class="nt">&lt;/scope&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>

<span class="c">&lt;!-- ArchUnit for architecture verification --&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>com.tngtech.archunit<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>archunit-junit5<span class="nt">&lt;/artifactId&gt;</span>
    <span class="nt">&lt;scope&gt;</span>test<span class="nt">&lt;/scope&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
<span class="nt">&lt;dependency&gt;</span>
    <span class="nt">&lt;groupId&gt;</span>org.jmolecules.integrations<span class="nt">&lt;/groupId&gt;</span>
    <span class="nt">&lt;artifactId&gt;</span>jmolecules-archunit<span class="nt">&lt;/artifactId&gt;</span>
    <span class="nt">&lt;scope&gt;</span>test<span class="nt">&lt;/scope&gt;</span>
<span class="nt">&lt;/dependency&gt;</span>
</code></pre></div></div>

<h2 id="quick-reference">Quick Reference</h2>

<table>
  <thead>
    <tr>
      <th>Pattern</th>
      <th>jMolecules Type</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Aggregate Root</td>
      <td><code class="language-plaintext highlighter-rouge">AggregateRoot&lt;T, ID&gt;</code></td>
      <td>Entry point to aggregate, owns repository</td>
    </tr>
    <tr>
      <td>Entity</td>
      <td><code class="language-plaintext highlighter-rouge">Entity&lt;AggregateRoot, ID&gt;</code></td>
      <td>Has identity, belongs to aggregate</td>
    </tr>
    <tr>
      <td>Value Object</td>
      <td><code class="language-plaintext highlighter-rouge">ValueObject</code></td>
      <td>Immutable, equality by value</td>
    </tr>
    <tr>
      <td>Identifier</td>
      <td><code class="language-plaintext highlighter-rouge">Identifier</code></td>
      <td>Type-safe ID wrapper</td>
    </tr>
    <tr>
      <td>Association</td>
      <td><code class="language-plaintext highlighter-rouge">Association&lt;T, ID&gt;</code></td>
      <td>Cross-aggregate reference (ID only)</td>
    </tr>
    <tr>
      <td>Domain Event</td>
      <td><code class="language-plaintext highlighter-rouge">DomainEvent</code></td>
      <td>Notification of state change</td>
    </tr>
    <tr>
      <td>Repository</td>
      <td><code class="language-plaintext highlighter-rouge">Repository&lt;T, ID&gt;</code></td>
      <td>Aggregate persistence</td>
    </tr>
  </tbody>
</table>

<h2 id="running-the-project">Running the Project</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/jeremylem/petclinic-exploration.git
<span class="nb">cd </span>petclinic-exploration
./mvnw spring-boot:run
</code></pre></div></div>

<p>Access at http://localhost:8080</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://www.domainlanguage.com/ddd/">Domain-Driven Design</a> — Eric Evans (2003)</li>
  <li><a href="https://odrotbohm.de/2020/03/Implementing-DDD-Building-Blocks-in-Java/">Implementing DDD Building Blocks in Java</a> — Oliver Drotbohm</li>
  <li><a href="https://docs.spring.io/spring-modulith/reference/">Spring Modulith Reference</a></li>
  <li><a href="https://github.com/xmolecules/jmolecules">jMolecules GitHub</a></li>
  <li><a href="https://github.com/odrotbohm/tactical-ddd-workshop">Tactical DDD Workshop</a></li>
</ul>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[The classic Spring PetClinic is everyone’s first Spring app. Simple, clean, easy to follow. But as codebases grow it might turn differently. Business logic scatters across controllers and services. Changes ripple unpredictably. New developers take months to become productive.]]></summary></entry><entry><title type="html">FusionRAG Just Got Simpler: BM25 is Now in PostgreSQL</title><link href="https://jeremylem.github.io/blogging/2025/12/30/FusionRAG_PostgreSQL.html" rel="alternate" type="text/html" title="FusionRAG Just Got Simpler: BM25 is Now in PostgreSQL" /><published>2025-12-30T00:00:00+00:00</published><updated>2025-12-30T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2025/12/30/FusionRAG_PostgreSQL</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2025/12/30/FusionRAG_PostgreSQL.html"><![CDATA[<p>In my <a href="/2025-10-28-MCP_RAG">previous post about building a Local Knowledge Base MCP Server</a>, I landed on Fusion RAG (BM25 + Vector) as the winning pattern. It caught both keywords and semantics, hitting 100% recall at 23ms.</p>

<p>The stack was: ChromaDB for vectors, rank-bm25 Python library for keyword search, custom fusion logic to merge results.</p>

<p>That stack just became simpler. I discovered two PostgreSQL extensions can handle both pieces.</p>

<h2 id="the-old-architecture">The Old Architecture</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────┐     ┌─────────────┐
│  ChromaDB   │     │  rank-bm25  │
│  (Vectors)  │     │  (Keywords) │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └───────┬───────────┘
               │
        ┌──────▼──────┐
        │ Python Code │
        │ (Fusion)    │
        └─────────────┘
</code></pre></div></div>

<p>Two data stores. Sync issues. Custom fusion logic. Works, but more moving parts than necessary.</p>

<h2 id="the-new-architecture">The New Architecture</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌────────────────────────────────┐
│          PostgreSQL            │
│  ┌──────────┐  ┌────────────┐  │
│  │ pgvector │  │ pg_search  │  │
│  │ (Vectors)│  │   (BM25)   │  │
│  └──────────┘  └────────────┘  │
│         ┌──────────┐           │
│         │   RRF    │           │
│         │  (SQL)   │           │
│         └──────────┘           │
└────────────────────────────────┘
</code></pre></div></div>

<p>One database. One source of truth. Fusion happens in SQL.</p>

<h2 id="why-bm25-matters">Why BM25 Matters</h2>

<p>Standard PostgreSQL full-text search (ts_vector) does boolean matching: document either matches or it doesn’t. No ranking. No relevance scoring.</p>

<p>BM25 solves four problems:</p>

<ul>
  <li><strong>Term Frequency Saturation</strong>: Mentioning a word 12 times doesn’t make a doc 12x more relevant. After a few mentions, additional repetitions barely help.</li>
  <li><strong>Inverse Document Frequency</strong>: Rare terms get higher weight. “Kubernetes” in a general corpus signals more than “the.”</li>
  <li><strong>Length Normalization</strong>: A focused 15-word answer beats a 80-word doc that mentions your query in passing.</li>
  <li><strong>Ranked Retrieval</strong>: Every result gets a meaningful score, not just match/no-match.</li>
</ul>

<p>BM25 is the algorithm powering Elasticsearch and Apache Lucene.</p>

<h2 id="the-extensions">The Extensions</h2>

<p>These are not built into PostgreSQL core. They are extensions you install separately.</p>

<p><a href="https://github.com/pgvector/pgvector"><strong>pgvector</strong></a>: First released April 2021. Adds vector data types and similarity search operators. HNSW indexing (the fast one) arrived in v0.5.0 (August 2023). Now at v0.8.x with broad cloud provider support.</p>

<p><a href="https://github.com/paradedb/paradedb"><strong>pg_search</strong></a>: First stable release November 2023 (originally called pg_bm25). Built on <a href="https://github.com/quickwit-oss/tantivy">Tantivy</a>, the Rust alternative to Lucene. Adds BM25 scoring and full-text search operators.</p>

<h3 id="pgvector-vs-chromadb">pgvector vs ChromaDB</h3>

<p>Here’s how they compare:</p>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>ChromaDB</th>
      <th>pgvector</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Type</strong></td>
      <td>Standalone vector database</td>
      <td>PostgreSQL extension</td>
    </tr>
    <tr>
      <td><strong>Best for</strong></td>
      <td>Prototyping, up in 5 minutes</td>
      <td>Production, existing PostgreSQL stack</td>
    </tr>
    <tr>
      <td><strong>Concurrency</strong></td>
      <td>Degrades under load</td>
      <td>Handles concurrent queries well</td>
    </tr>
    <tr>
      <td><strong>SQL joins</strong></td>
      <td>Separate data store, needs sync</td>
      <td>Native joins with relational data</td>
    </tr>
    <tr>
      <td><strong>ACID</strong></td>
      <td>No</td>
      <td>Full transactions</td>
    </tr>
    <tr>
      <td><strong>Scaling</strong></td>
      <td>Purpose-built for vectors</td>
      <td>PostgreSQL scaling patterns</td>
    </tr>
  </tbody>
</table>

<p>ChromaDB excels at rapid prototyping. Single queries are fast. But under concurrent load, pgvector tends to handle it better due to PostgreSQL’s mature connection pooling and query optimization.</p>

<p>If you already run PostgreSQL, you eliminate a separate data store. User metadata, document content, and embeddings live in one place. One backup. One connection pool. No sync logic.</p>

<h2 id="setting-it-up">Setting It Up</h2>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="n">vector</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="n">pg_search</span><span class="p">;</span>
</code></pre></div></div>

<p>Create your table with both vector and text columns:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">documents</span> <span class="p">(</span>
    <span class="n">id</span> <span class="nb">SERIAL</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">content</span> <span class="nb">TEXT</span><span class="p">,</span>
    <span class="n">embedding</span> <span class="n">vector</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Create both indexes:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Vector index (HNSW for fast approximate search)</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">idx_docs_vector</span> <span class="k">ON</span> <span class="n">documents</span>
<span class="k">USING</span> <span class="n">hnsw</span> <span class="p">(</span><span class="n">embedding</span> <span class="n">vector_cosine_ops</span><span class="p">);</span>

<span class="c1">-- BM25 index</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">idx_docs_bm25</span> <span class="k">ON</span> <span class="n">documents</span>
<span class="k">USING</span> <span class="n">bm25</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">content</span><span class="p">)</span>
<span class="k">WITH</span> <span class="p">(</span><span class="n">key_field</span><span class="o">=</span><span class="s1">'id'</span><span class="p">);</span>
</code></pre></div></div>

<h2 id="hybrid-search-in-pure-sql">Hybrid Search in Pure SQL</h2>

<p>Reciprocal Rank Fusion (RRF) in a single query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span>
<span class="n">bm25_results</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">pdb</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">rank</span>
  <span class="k">FROM</span> <span class="n">documents</span>
  <span class="k">WHERE</span> <span class="n">content</span> <span class="o">|||</span> <span class="s1">'kubernetes deployment strategy'</span>
  <span class="k">LIMIT</span> <span class="mi">20</span>
<span class="p">),</span>

<span class="n">vector_results</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">embedding</span> <span class="o">&lt;=&gt;</span> <span class="err">$</span><span class="mi">1</span><span class="p">)</span> <span class="k">AS</span> <span class="n">rank</span>
  <span class="k">FROM</span> <span class="n">documents</span>
  <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">embedding</span> <span class="o">&lt;=&gt;</span> <span class="err">$</span><span class="mi">1</span>
  <span class="k">LIMIT</span> <span class="mi">20</span>
<span class="p">),</span>

<span class="n">fused</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">/</span> <span class="p">(</span><span class="mi">60</span> <span class="o">+</span> <span class="n">rank</span><span class="p">)</span> <span class="k">AS</span> <span class="n">score</span> <span class="k">FROM</span> <span class="n">bm25_results</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">/</span> <span class="p">(</span><span class="mi">60</span> <span class="o">+</span> <span class="n">rank</span><span class="p">)</span> <span class="k">AS</span> <span class="n">score</span> <span class="k">FROM</span> <span class="n">vector_results</span>
<span class="p">)</span>

<span class="k">SELECT</span>
  <span class="n">d</span><span class="p">.</span><span class="n">id</span><span class="p">,</span>
  <span class="n">d</span><span class="p">.</span><span class="n">content</span><span class="p">,</span>
  <span class="k">SUM</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">score</span><span class="p">)</span> <span class="k">AS</span> <span class="n">relevance</span>
<span class="k">FROM</span> <span class="n">fused</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">documents</span> <span class="n">d</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">d</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">d</span><span class="p">.</span><span class="n">content</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">relevance</span> <span class="k">DESC</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>The magic number 60 in RRF controls score decay. Lower values favor top results more aggressively.</p>

<h2 id="weighted-fusion">Weighted Fusion</h2>

<p>In my MCP server, I used 30% keywords, 70% semantics. Same thing in SQL:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fused</span> <span class="k">AS</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">3</span> <span class="o">*</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">/</span> <span class="p">(</span><span class="mi">60</span> <span class="o">+</span> <span class="n">rank</span><span class="p">)</span> <span class="k">AS</span> <span class="n">score</span> <span class="k">FROM</span> <span class="n">bm25_results</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">7</span> <span class="o">*</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">/</span> <span class="p">(</span><span class="mi">60</span> <span class="o">+</span> <span class="n">rank</span><span class="p">)</span> <span class="k">AS</span> <span class="n">score</span> <span class="k">FROM</span> <span class="n">vector_results</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Tune based on your data. Technical documentation with exact terms? Bump BM25 weight. Conversational queries? Favor vectors.</p>

<h2 id="what-this-replaces">What This Replaces</h2>

<table>
  <thead>
    <tr>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ChromaDB</td>
      <td>pgvector</td>
    </tr>
    <tr>
      <td>rank-bm25 (Python)</td>
      <td>pg_search</td>
    </tr>
    <tr>
      <td>Custom fusion code</td>
      <td>SQL CTE (Common Table Expression)</td>
    </tr>
    <tr>
      <td>Two data stores</td>
      <td>One database</td>
    </tr>
    <tr>
      <td>Sync logic</td>
      <td>ACID transactions</td>
    </tr>
  </tbody>
</table>

<h2 id="operational-simplicity">Operational Simplicity</h2>

<p>Fewer dependencies, but also:</p>

<ul>
  <li><strong>Backups</strong>: One database to backup.</li>
  <li><strong>Consistency</strong>: ACID transactions across text and vectors.</li>
  <li><strong>Scaling</strong>: PostgreSQL scaling patterns you already know.</li>
  <li><strong>Monitoring</strong>: One set of metrics.</li>
</ul>

<p>When your documents update, both indexes update atomically. No sync jobs. No eventual consistency headaches.</p>

<h2 id="when-to-still-use-elasticsearch">When to Still Use Elasticsearch</h2>

<p>The 1% cases:</p>

<ul>
  <li>Multi-petabyte scale with sub-100ms requirements</li>
  <li>Complex faceted search with dozens of filters</li>
  <li>Geo-spatial + full-text + vector in the same query at massive scale</li>
</ul>

<p>For the rest of us building RAG pipelines, knowledge bases, and semantic search? PostgreSQL handles it.</p>

<h2 id="trying-it-out">Trying It Out</h2>

<p>Easiest path: <a href="https://hub.docker.com/r/paradedb/paradedb">ParadeDB Docker image</a> comes with both extensions pre-installed.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">--name</span> paradedb <span class="nt">-e</span> <span class="nv">POSTGRES_PASSWORD</span><span class="o">=</span>password <span class="nt">-p</span> 5432:5432 paradedb/paradedb
</code></pre></div></div>

<h2 id="next-step-for-the-mcp-server">Next Step for the MCP Server</h2>

<p>The <a href="https://github.com/j3r3myfoobar/knowledge_base_mcp">Knowledge Base MCP Server</a> currently uses ChromaDB + rank-bm25. Migrating to PostgreSQL would:</p>

<ol>
  <li>Remove two dependencies (chromadb, rank-bm25)</li>
  <li>Simplify deployment (just needs a PostgreSQL connection)</li>
  <li>Enable SQL-based analytics on search patterns</li>
  <li>Make it easier to integrate with existing enterprise databases</li>
</ol>

<p>The fusion logic moves from Python to a SQL view. The MCP server becomes a thin query layer.</p>

<p>Same accuracy. Simpler stack.</p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[In my previous post about building a Local Knowledge Base MCP Server, I landed on Fusion RAG (BM25 + Vector) as the winning pattern. It caught both keywords and semantics, hitting 100% recall at 23ms.]]></summary></entry><entry><title type="html">From Continuous Delivery to Continuous Deployment</title><link href="https://jeremylem.github.io/blogging/2025/12/29/CD_to_CD.html" rel="alternate" type="text/html" title="From Continuous Delivery to Continuous Deployment" /><published>2025-12-29T00:00:00+00:00</published><updated>2025-12-29T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2025/12/29/CD_to_CD</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2025/12/29/CD_to_CD.html"><![CDATA[<p>After reading <a href="https://itrevolution.com/product/accelerate/"><em>Accelerate</em></a> by Nicole Forsgren, Jez Humble, and Gene Kim, I started rethinking what CD actually means. For years, I worked in environments where CD meant Continuous Delivery: code ready to deploy, waiting for approval.</p>

<p>It is still CI/CD. The difference is you can move faster.</p>

<h2 id="delivery-vs-deployment">Delivery vs. Deployment</h2>

<p><strong>Continuous Delivery</strong>: Code is built, tested, and pushed to a staging environment automatically. It can be deployed to production at any time, but a human decision or a scheduled window triggers the go-live.</p>

<p><strong>Continuous Deployment</strong>: Every change that passes the automated test suite is deployed to Production immediately, without human intervention.</p>

<p>In regulated environments, teams simulate Continuous Deployment by automating Change Request ticket creation and approval based on test evidence. The old world was a manager clicking Approve in ServiceNow. The new world is automated governance: the pipeline generates an attestation document proving that tests passed, security scans completed, and peer review happened. The auditor is satisfied without stopping the assembly line.</p>

<h2 id="key-concept-decoupling-deployment-from-release">Key Concept: Decoupling Deployment from Release</h2>

<p>This is the most important concept I took from Accelerate:</p>

<ul>
  <li><strong>Deployment</strong> (Technical Act): Moving code to the production server. Happens continuously.</li>
  <li><strong>Release</strong> (Business Act): Making the feature visible to the customer. Happens when the business is ready.</li>
</ul>

<p>You can deploy on Tuesday at 10 AM, but release on Friday for a Business launch.</p>

<h2 id="the-mechanic-feature-flags">The Mechanic: Feature Flags</h2>

<p>How do you deploy code in the middle of a sprint without breaking the user experience?</p>

<p><strong>Day 3 of Sprint:</strong> You finish the backend for a new payment feature. It deploys to Prod immediately. Safe because the code is wrapped in a Feature Flag set to <code class="language-plaintext highlighter-rouge">False</code>. Users cannot hit it.</p>

<p><strong>Day 7 of Sprint:</strong> The UI is done. It deploys to Prod. Flag is still <code class="language-plaintext highlighter-rouge">False</code>.</p>

<p><strong>End of Sprint (Review):</strong> You toggle the flag to <code class="language-plaintext highlighter-rouge">True</code> only for internal users to demo it in Production.</p>

<p><strong>Release Day:</strong> The business toggles the flag to <code class="language-plaintext highlighter-rouge">True</code> for 100% of users.</p>

<h3 id="feature-flag-frameworks">Feature Flag Frameworks</h3>

<ul>
  <li><a href="https://launchdarkly.com/"><strong>LaunchDarkly</strong></a> (SaaS): Deep audit logging, RBAC, SSO integration. Expensive at scale.</li>
  <li><a href="https://www.getunleash.io/"><strong>Unleash</strong></a> (Open Source, Self-Hosted): Self-host inside your private cloud. No data leaves your network.</li>
  <li><a href="https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html"><strong>AWS AppConfig</strong></a>: Good if you want to avoid buying another tool.</li>
  <li><a href="https://openfeature.dev/"><strong>OpenFeature</strong></a> (CNCF Project): An open specification that lets you swap vendors without rewriting application code.</li>
</ul>

<h3 id="feature-flags-in-code">Feature Flags in Code</h3>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// OLD WAY: Hardcoded or Config File</span>
<span class="k">if</span> <span class="o">(</span><span class="n">config</span><span class="o">.</span><span class="na">isNewPaymentFlowEnabled</span><span class="o">())</span> <span class="o">{</span>
    <span class="n">runNewPaymentLogic</span><span class="o">();</span>
<span class="o">}</span>

<span class="c1">// NEW WAY: Feature Flag SDK</span>
<span class="kt">boolean</span> <span class="n">showNewFeature</span> <span class="o">=</span> <span class="n">featureFlagClient</span><span class="o">.</span><span class="na">boolVariation</span><span class="o">(</span>
    <span class="s">"new-payment-flow"</span><span class="o">,</span> <span class="n">userContext</span><span class="o">,</span> <span class="kc">false</span><span class="o">);</span>

<span class="k">if</span> <span class="o">(</span><span class="n">showNewFeature</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">runNewPaymentLogic</span><span class="o">();</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
    <span class="n">runOldPaymentLogic</span><span class="o">();</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The real power is Targeting Rules. The code is deployed to all servers, but you control who can access the feature through the dashboard, without redeploying:</p>

<ul>
  <li>Enable only for QA: if <code class="language-plaintext highlighter-rouge">user_id = "qa_tester_bob"</code>, return <code class="language-plaintext highlighter-rouge">True</code>.</li>
  <li>Enable for a region: if <code class="language-plaintext highlighter-rouge">user_region = "EU"</code>, return <code class="language-plaintext highlighter-rouge">True</code>.</li>
  <li>Gradual business rollout: 5% of users today, 50% next week, 100% after validation.</li>
</ul>

<p>This is different from Canary Deployment. Canary is about infrastructure: deploy to a small percentage of servers to check if the code is stable. Feature Flags are about business logic: the code runs everywhere, but you choose which users can see the feature.</p>

<h2 id="safe-deployment-canary-releases">Safe Deployment: Canary Releases</h2>

<p>To deploy continuously without crashing Prod, teams use Canary Deployments:</p>

<ol>
  <li>Deploy v2.0 alongside v1.0.</li>
  <li>Route a small percentage of traffic to v2.0.</li>
  <li>Automated monitoring checks for errors (HTTP 500s, latency spikes).</li>
  <li>If error rate &lt; threshold, gradually ramp up traffic to 100%.</li>
  <li>If errors spike, automatically rollback to v1.0.</li>
</ol>

<p>The goal: Make deployment a non-event that happens constantly.</p>

<h2 id="environments-ephemeral-over-static">Environments: Ephemeral over Static</h2>

<p>In the traditional model, you had static environments: Dev, QA, UAT, Prod. These servers were always running, often drifted from Prod configurations, and were bottlenecks.</p>

<p>In Continuous Deployment, this changes to Ephemeral Environments:</p>

<ul>
  <li><strong>Local</strong>: Developer works on their machine (Docker to mimic Prod).</li>
  <li><strong>Preview Environment</strong>: Auto-created when a PR is opened. Tests run here. QA clicks a link to verify. Destroyed after merge.</li>
  <li><strong>Staging (Pre-Prod)</strong>: Single environment mirroring Prod exactly. Auto-deploys to Prod if smoke tests pass.</li>
</ul>

<p>Do you need a permanent QA server? No. You create a fresh one for every feature, test it, and destroy it.</p>

<h2 id="branching-trunk-based-development">Branching: Trunk-Based Development</h2>

<p>The industry standard for Continuous Deployment is Trunk-Based Development.</p>

<p><strong>Old Way (GitFlow):</strong> You have Master, Develop, Feature-X, Release-1.0. Code lives in a feature branch for weeks. Merging is painful.</p>

<p><strong>New Way (Trunk-Based):</strong></p>

<ul>
  <li>One Main Branch, usually called <code class="language-plaintext highlighter-rouge">main</code> or <code class="language-plaintext highlighter-rouge">trunk</code>.</li>
  <li>Short-Lived Feature Branches: Developers create a branch and merge it back to <code class="language-plaintext highlighter-rouge">main</code> within 24 hours.</li>
  <li>No Release Branches: You deploy a specific commit from <code class="language-plaintext highlighter-rouge">main</code>.</li>
</ul>

<p>How can you merge unfinished work? You merge the backend code but hide it behind a Feature Flag. Your code is integrated with everyone else’s code daily. You never have merge conflicts because you never drift far from <code class="language-plaintext highlighter-rouge">main</code>.</p>

<h2 id="measuring-success-dora-metrics">Measuring Success: DORA Metrics</h2>

<p>The Accelerate book introduces four metrics that have become the industry standard (<a href="https://dora.dev/">DORA</a>):</p>

<ul>
  <li><strong>Deployment Frequency</strong>: How often you deploy to production. High performers: multiple times per day.</li>
  <li><strong>Lead Time for Changes</strong>: Time from commit to production. High performers: less than one hour.</li>
  <li><strong>Time to Restore Service</strong>: How quickly you recover from incidents. High performers: less than one hour.</li>
  <li><strong>Change Failure Rate</strong>: Percentage of deployments causing failures. High performers: 0-15%.</li>
</ul>

<p>The counter-intuitive finding: Teams that deploy multiple times a day have lower change failure rates than teams that deploy monthly. Smaller changes mean smaller blast radius and easier rollback.</p>

<h2 id="accelerating-even-faster">Accelerating Even Faster</h2>

<p>Once deploying is no longer the challenge, the focus shifts to making pipelines smarter.</p>
<h3 id="1-predictive-test-selection">1. Predictive Test Selection</h3>

<p>Running 2,000 tests for a one-line CSS change is a waste of resources. If your regression suite takes 30 minutes, that adds up when you deploy multiple times a day.</p>

<p><a href="https://www.cloudbees.com/capabilities/cloudbees-smart-tests"><strong>CloudBees Smart Tests</strong></a> (formerly Launchable) analyzes your Git history and test failures. It tells your pipeline: only run these 50 tests, skip the other 2,000.</p>

<p><a href="https://gradle.com/develocity/"><strong>Gradle Develocity</strong></a> (formerly Gradle Enterprise) is the gold standard for Java/Spring shops. It caches test results and uses ML to skip tests that haven’t been impacted by your code changes.</p>

<p><a href="https://www.harness.io/products/continuous-integration"><strong>Harness Test Intelligence</strong></a> builds a call graph of your code. If you change <code class="language-plaintext highlighter-rouge">Login.java</code>, it knows exactly which tests cover that file.</p>

<p><strong>DORA Impact:</strong> Reduces <strong>Lead Time for Changes</strong>. By cutting feedback time from 30 mins to 5 mins, developers stay in flow, and code moves to staging hours faster.</p>

<h3 id="2-deployment-risk-scoring">2. Deployment Risk Scoring</h3>

<p>Most CD tools like <a href="https://argoproj.github.io/cd/">ArgoCD</a> are dumb. They just sync Git to Cluster. They don’t know if the app is actually working, only that the pod is running.</p>

<p><a href="https://www.opsmx.com/autopilot-overview/"><strong>OpsMx Autopilot</strong></a>: The brain you attach to your muscle (ArgoCD or <a href="https://spinnaker.io/">Spinnaker</a>). It connects to your logs (<a href="https://www.splunk.com/">Splunk</a>, <a href="https://www.datadoghq.com/">Datadog</a>) and metrics (<a href="https://prometheus.io/">Prometheus</a>). When you deploy to Staging, it compares the new version against the old one in real-time and assigns a Risk Score (0-100). If the score drops below 90, it automatically commands ArgoCD to rollback. This automates the Canary Analysis that usually requires a senior engineer staring at a dashboard for 30 minutes.</p>

<p><a href="https://www.harness.io/"><strong>Harness Continuous Verification</strong></a>: Similar approach. Connects to your monitoring. Uses ML to compare versions. Auto-rolls back if errors deviate by more than 1%.</p>

<p>This replaces blind approval rules with smart rules based on actual risk. For regulated industries, these tools also generate the digital paper trail that satisfies compliance.</p>

<p><strong>DORA Impact:</strong> Lowers <strong>Change Failure Rate</strong>. By catching weak signals (like a 2% latency increase) in Staging, you prevent bad code from ever hitting Production, keeping the failure rate close to zero.</p>

<h3 id="3-smart-root-cause-analysis">3. Smart Root Cause Analysis</h3>

<p>When a build fails, someone has to dig through 1,000 lines of logs. When Production alerts fire at 3 AM, someone has to correlate logs, traces, and recent deployments.</p>

<p><a href="https://komodor.com/"><strong>Komodor</strong></a> tracks every single change in Kubernetes (config, deploy, health check) and correlates it to failures. Like a Time Machine for K8s.</p>

<p><a href="https://www.dynatrace.com/platform/artificial-intelligence/"><strong>Dynatrace Davis AI</strong></a> uses deterministic AI (not just ML guessing) to analyze the dependency graph. It can tell you: “The user login failed because the backend SQL database was locked by the Inventory Service.”</p>

<p><a href="https://www.datadoghq.com/product/platform/bits-ai/"><strong>Datadog Bits AI</strong></a> lets you ask in natural language: “Who deployed to the payment service right before the latency spike?” It correlates the Git commit to the error logs.</p>

<p><a href="https://www.harness.io/"><strong>Harness AIDA</strong></a> (AI DevOps Agent) scans logs and Git history, then generates a summary: “Failure likely caused by memory leak in commit 8a4b2 by User X.”</p>

<p><strong>DORA Impact:</strong> Improves <strong>Time to Restore Service</strong>. Instead of spending 4 hours investigating what broke, the AI tells you the root cause in seconds, allowing you to fix (or rollback) immediately.</p>

<h3 id="4-gitops-from-push-to-pull">4. GitOps: From Push to Pull</h3>

<p>This is the standard operating model now. You don’t use a UI like Jenkins to deploy. You commit a change to a config file in Git, and an agent inside the Production cluster pulls the change in.</p>

<p><strong>The Old Way (Push Model, Jenkins style):</strong></p>

<ol>
  <li>Developer commits code.</li>
  <li>Jenkins builds the artifact.</li>
  <li>Jenkins runs: <code class="language-plaintext highlighter-rouge">kubectl apply -f my-app.yaml</code>.</li>
</ol>

<p>The Risk: A debug flag gets enabled directly in the cluster during troubleshooting. The issue gets fixed, but the flag stays on for weeks. Git and Production are now out of sync.</p>

<p><strong>The New Way (Pull Model):</strong></p>

<ol>
  <li>Developer commits code or config to Git.</li>
  <li>CI only updates a Docker image registry.</li>
  <li>An Agent living inside the Production Cluster asks: Does my current state match what is in Git?</li>
  <li>It sees a new image tag in Git. It pulls the change and applies it.</li>
</ol>

<p>Why is this safer?</p>

<ul>
  <li><strong>Drift Detection</strong>: If someone changes a setting in Prod manually, the agent detects the drift immediately and can auto-revert.</li>
  <li><strong>Security</strong>: You don’t give your CI server Admin Access to your Prod cluster. The cluster reaches out to Git; nothing reaches in.</li>
</ul>

<p><a href="https://argoproj.github.io/cd/"><strong>ArgoCD</strong></a>: Best UI for visualizing Kubernetes. Logs exactly who merged the PR that triggered the sync.</p>

<p><a href="https://fluxcd.io/"><strong>Flux v2</strong></a>: If you want it invisible. No UI; it just works in the background.</p>

<p><a href="https://www.harness.io/"><strong>Harness GitOps</strong></a>: Managed ArgoCD with an enterprise UI and dashboards.</p>

<p>For developers, this complexity is often hidden behind an Internal Developer Portal (IDP) like <a href="https://backstage.io/">Backstage</a>. A junior dev clicks “Deploy to Staging” in a web UI; under the hood, it commits to a GitOps repo and ArgoCD syncs the cluster. They never need to become Kubernetes experts.</p>

<p><strong>DORA Impact:</strong> Increases <strong>Deployment Frequency</strong>. Because deployment is purely declarative (a git commit), it removes the friction of manual deployments, encouraging teams to ship smaller batches more often.</p>

<h3 id="5-finops-integration">5. FinOps Integration</h3>

<p>The frontier between development teams and infrastructure teams is becoming fuzzy. A feature might have an impact on the infrastructure. It should be considered during the CI/CD phase.</p>

<p>In the cloud, developers have infinite resources. A junior dev can accidentally provision a database that costs $5,000/month, and you won’t know until the bill arrives 30 days later. The fix: shift cost analysis into the Pull Request.</p>

<p><strong>For Terraform:</strong> The industry standard is <a href="https://www.infracost.io/">Infracost</a>. It parses your Terraform code, compares it against a cloud pricing API, and posts a comment on your Pull Request showing the price difference.</p>

<p>Developer changes an AWS EC2 instance from <code class="language-plaintext highlighter-rouge">t3.micro</code> to <code class="language-plaintext highlighter-rouge">m5.large</code>. CI runs <code class="language-plaintext highlighter-rouge">infracost breakdown --path .</code> and comments on the PR:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cost Increase: +$65/month
Create aws_instance.app_server: +$72.00
Remove aws_instance.old_server: -$7.00
</code></pre></div></div>

<p><strong>For Kubernetes/Helm:</strong> Harder because Kubernetes files list generic CPU/RAM requests, not instance types. The cost depends on which node the pod lands on.</p>

<p><a href="https://www.kubecost.com/">Kubecost</a> / <a href="https://opencost.io/">OpenCost</a> handles this with the <code class="language-plaintext highlighter-rouge">kubectl cost predict</code> command. These tools answer the “why” question: which team’s microservice is hoarding RAM, which namespace is over-provisioned. The trick: you cannot scan a raw Helm chart easily. You must render it first with <code class="language-plaintext highlighter-rouge">helm template . &gt; final_manifest.yaml</code>, then run the prediction.</p>

<p><a href="https://www.vantage.sh/"><strong>Vantage</strong></a>: Works at the cloud bill level (AWS/GCP invoices) rather than the cluster level. It tells you how much you owe; Kubecost tells you why. Good for Cost per Tenant views across your entire cloud footprint.</p>

<p><a href="https://www.harness.io/"><strong>Harness Cloud Cost Management</strong></a>: Does both Terraform and Kubernetes natively. Has a policy engine built-in: you can set a rule to block any PR that increases the monthly forecast by more than $500.</p>

<p><strong>DORA Impact:</strong> While cost is not a standard DORA metric, it acts as a stability guardrail. It prevents financial incidents (blowing the budget), giving management the confidence to allow high-frequency deployments without financial risk.</p>

<h2 id="the-bleeding-edge-agentic-devops">The Bleeding Edge: Agentic DevOps</h2>

<p>The industry is moving from Automated Pipelines to AI Agents.</p>

<p><strong>Former Way (Automated):</strong> The pipeline fails. You get an alert. You read the log. You fix it.</p>

<p><strong>New Way (Agentic):</strong> The pipeline fails. An AI Agent reads the log, writes a fix, and opens a PR for you to approve.</p>

<p>This is what high-performing companies are building towards. Tools like OpsMx (Verification) and Komodor (Troubleshooting) are the answers to the 3 AM problem. They use data to fix or revert things before you even open your laptop.</p>

<h2 id="summary">Summary</h2>

<p>Tools I looked at:</p>

<table>
  <thead>
    <tr>
      <th>Capability</th>
      <th>Tool</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Predictive Test Selection</td>
      <td><a href="https://www.cloudbees.com/capabilities/cloudbees-smart-tests">CloudBees Smart Tests</a>, <a href="https://gradle.com/develocity/">Gradle Develocity</a></td>
    </tr>
    <tr>
      <td>Deployment Risk Scoring</td>
      <td><a href="https://www.opsmx.com/autopilot-overview/">OpsMx Autopilot</a></td>
    </tr>
    <tr>
      <td>Root Cause Analysis</td>
      <td><a href="https://komodor.com/">Komodor</a>, <a href="https://www.dynatrace.com/platform/artificial-intelligence/">Dynatrace Davis AI</a>, <a href="https://www.datadoghq.com/product/platform/bits-ai/">Datadog Bits AI</a></td>
    </tr>
    <tr>
      <td>GitOps</td>
      <td><a href="https://argoproj.github.io/cd/">ArgoCD</a>, <a href="https://fluxcd.io/">Flux v2</a></td>
    </tr>
    <tr>
      <td>FinOps</td>
      <td><a href="https://www.infracost.io/">Infracost</a> (Terraform), <a href="https://www.kubecost.com/">Kubecost</a> (K8s)</td>
    </tr>
  </tbody>
</table>

<p><a href="https://www.harness.io/">Harness</a> claims to cover all of this in one platform (I have not tested these features myself, as Harness does not offer easy access to trial their advanced capabilities):</p>

<table>
  <thead>
    <tr>
      <th>Requirement</th>
      <th>Harness Module</th>
      <th>How it works</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Predictive Test Selection</td>
      <td><a href="https://www.harness.io/products/continuous-integration">Test Intelligence</a></td>
      <td>Builds a call graph, runs only relevant tests</td>
    </tr>
    <tr>
      <td>Deployment Risk Scoring</td>
      <td>Continuous Verification</td>
      <td>ML compares new vs old version, auto-rollback if errors spike</td>
    </tr>
    <tr>
      <td>Smart Root Cause Analysis</td>
      <td>AIDA</td>
      <td>Scans logs and Git history, generates failure summary</td>
    </tr>
    <tr>
      <td>GitOps</td>
      <td>Harness GitOps</td>
      <td>Managed ArgoCD with enterprise UI</td>
    </tr>
    <tr>
      <td>FinOps</td>
      <td>Cloud Cost Management</td>
      <td>Calculates cost impact in PR, can block on budget</td>
    </tr>
  </tbody>
</table>

<h2 id="recommended-reading">Recommended Reading</h2>

<p><a href="https://itrevolution.com/product/accelerate/"><strong>Accelerate: The Science of Lean Software and DevOps</strong></a> (Forsgren, Humble, Kim)</p>

<p>This book uses rigorous statistical data to prove that High Performers who deploy multiple times a day have lower change failure rates than Low Performers who deploy monthly. It is essential reading for understanding why Continuous Deployment is actually safer than the traditional approach.</p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[After reading Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, I started rethinking what CD actually means. For years, I worked in environments where CD meant Continuous Delivery: code ready to deploy, waiting for approval.]]></summary></entry><entry><title type="html">Running AI Models in Java without Python</title><link href="https://jeremylem.github.io/blogging/2025/11/30/AI_Java.html" rel="alternate" type="text/html" title="Running AI Models in Java without Python" /><published>2025-11-30T00:00:00+00:00</published><updated>2025-11-30T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2025/11/30/AI_Java</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2025/11/30/AI_Java.html"><![CDATA[<p>One of the most common misconceptions in AI engineering is that you always need a Python runtime to execute models.</p>

<p>This is where <strong>ONNX (Open Neural Network Exchange)</strong> is critical. In 2017, Microsoft and Facebook realized they had a problem: framework Lock-in.</p>

<p>At the time, if you trained a model in PyTorch, you were stuck there. Deploying it to production often meant rewriting code or using slow wrappers.</p>

<p>Their goal was to create a “Universal Interchange Format” a standard that allowed models to be trained in flexible frameworks (like PyTorch) but run on high-performance inference
engines (like ONNX Runtime) without being tied to the original training environment.</p>

<p>Google did not join the ONNX partnership initially; they had their own ecosystem but over time, it did become the standard bridge between PyTorch and TensorFlow,</p>

<p>In the modern AI landscape, 99% of model development happens in one of two places: PyTorch or TensorFlow.</p>

<h2 id="problem">Problem</h2>

<p>Usually, running AI in Java means creating a sidecar Python service or using slow HTTP bridges.</p>

<p>If you are a Java shop building AI features, you might go through the following steps:</p>

<ol>
  <li>
    <p>Data Science team builds a model in PyTorch/TensorFlow.</p>
  </li>
  <li>
    <p>Engineering team has to wrap it in a Flask/FastAPI container.</p>
  </li>
  <li>
    <p>You end up managing two languages, two CI/CD pipelines, and a massive Python runtime (often 3GB+) just to perform a simple calculation.</p>
  </li>
</ol>

<p>Only recently (with the maturation of LangChain4j and better Java bindings for ONNX Runtime, it has become a viable way to replace Python entirely in enterprise backends.</p>

<h2 id="onnx-is-like-the-pdf-for-machine-learning-models">ONNX is like the “PDF” for machine learning models.</h2>

<ul>
  <li>
    <p>Python (PyTorch/TensorFlow): It’s the editor where you create, train, and tweak the model. It’s heavy and complex.</p>
  </li>
  <li>
    <p>ONNX: Is the exported, static artifact. It serializes the model into a computation graph (a set of nodes and edges representing mathematical operations).</p>
  </li>
</ul>

<h2 id="the-runtime-architecture">The Runtime Architecture</h2>

<p>When this system runs, it doesn’t spin up a hidden Python process or make HTTP calls to a flask server. It uses the Microsoft ONNX Runtime (ORT).</p>

<ul>
  <li>
    <p>ORT is a high-performance inference engine written in C++.</p>
  </li>
  <li>
    <p>The Java application communicates with ORT via the Java Native Interface (JNI).</p>
  </li>
  <li>
    <p>This allows us to run the model on the CPU (using AVX2/AVX512 instructions) or GPU directly from Java, often faster than the original Python implementation because we bypass the
Python Global Interpreter Lock (GIL).</p>
  </li>
</ul>

<p><strong>Result</strong>: you get near-metal performance with zero Python interpreter overhead, no CPython Global Interpreter Lock (GIL), and no pip install nightmares in production and therefore
you can full take advantage of Java multithreading capabilities. Java has no GIL: If your server receives 100 requests to vectorize documents, Java can utilize all cores of the
server simultaneously to process them. By using Java, you unlock the hardware’s full potential without complex workarounds (like multiprocessing).</p>

<h2 id="the-tokenization-challenge">The Tokenization Challenge</h2>

<p>The trickiest part of “Python-free” RAG isn’t the model it’s the tokenization (converting text “Hello” into numbers [142, 7489]).</p>

<p>Even if you have the model in ONNX, you still need Python to perform tokenization</p>

<p>In Python, we take transformers.AutoTokenizer for granted, but under the hood, it is performing a complex process:</p>

<ol>
  <li>
    <p>Normalization: Unicode formatting (NFC vs NFD), lowercasing, and stripping accents.</p>
  </li>
  <li>
    <p>Pre-tokenization: splitting text by whitespace or punctuation (e.g., “don’t” -&gt; “don”, “‘t”).</p>
  </li>
  <li>
    <p>Model Mapping: applying algorithms like BPE (Byte-Pair Encoding used by GPT-4) or WordPiece (used by BERT) to merge characters into sub-word tokens and therefore understand the
concept of the word by analyzing its parts, even if it has never seen the full word before.</p>
  </li>
</ol>

<p>In Java, LangChain4j and the underlying <a href="https://github.com/deepjavalibrary/djl">DJL</a>/<a href="https://jeremylem.github.io/blogging/2025/11/30/AI_Java.html">ONNX</a> dependencies handles this
natively It reads the standard tokenizer.json file (exported from Hugging Face) for a Model and performs the text-to-ID conversion entirely in Java before feeding the tensors to
ONNX.</p>

<h2 id="the-pure-java-pipeline">The Pure Java Pipeline:</h2>

<ol>
  <li>
    <p>Input: String (Java)</p>
  </li>
  <li>
    <p>Tokenization: Native Java implementation (No Python)</p>
  </li>
  <li>
    <p>Inference: ONNX Runtime (C++ via JNI)</p>
  </li>
  <li>
    <p>Output: Vector Embedding (Java float array)</p>
  </li>
</ol>

<p>This architecture is what allows mcp_server4j to run as a single, self-contained JAR file with zero external dependencies.</p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[One of the most common misconceptions in AI engineering is that you always need a Python runtime to execute models.]]></summary></entry><entry><title type="html">Building RAG Systems in Java</title><link href="https://jeremylem.github.io/blogging/2025/11/29/MCP_RAG4J.html" rel="alternate" type="text/html" title="Building RAG Systems in Java" /><published>2025-11-29T00:00:00+00:00</published><updated>2025-11-29T00:00:00+00:00</updated><id>https://jeremylem.github.io/blogging/2025/11/29/MCP_RAG4J</id><content type="html" xml:base="https://jeremylem.github.io/blogging/2025/11/29/MCP_RAG4J.html"><![CDATA[<p>After the python version, I wanted to verify if you can build a Retrieval-Augmented Generation (RAG) system from scratch in Java.</p>

<h2 id="the-challenge">The Challenge</h2>

<p>Python has become the de facto language for AI/ML projects, and for good reason: excellent libraries, rapid prototyping, and a mature ecosystem.</p>

<p>I wanted to explore whether RAG systems could be built with the same effectiveness in Java, particularly for production environments.</p>

<h2 id="the-implementation">The Implementation</h2>

<p>I built MCP Server 4J, a Model Context Protocol server implementing hybrid search (BM25 + vector similarity) with:</p>

<ul>
  <li>Apache Lucene for BM25 keyword indexing</li>
  <li>LangChain4j for vector embeddings and ChromaDB integration</li>
  <li>Spring Boot for dependency injection and configuration management</li>
</ul>

<h2 id="key-findings">Key Findings</h2>

<h3 id="what-works-well">What Works Well:</h3>

<ul>
  <li>Type safety catches errors at compile time, not runtime</li>
  <li>Spring Boot’s DI container makes testing straightforward</li>
  <li>Apache Lucene provides native, production-ready BM25 implementation</li>
</ul>

<h3 id="java-has-everything-you-need">Java Has Everything You Need:</h3>

<ul>
  <li>Apache Lucene provides industrial-strength BM25 ranking</li>
  <li>LangChain4j brings vector embeddings and model integrations</li>
  <li><a href="https://jeremylem.github.io/blogging/2025/11/30/AI_Java.html">ONNX</a> runtime eliminates Python dependencies entirely handling the model execution natively</li>
  <li>The ecosystem is mature and production-ready</li>
</ul>

<h3 id="the-java-advantage">The Java Advantage:</h3>

<ul>
  <li>Interfaces (KeywordIndexer, DocumentLoader, DocumentChunker) make the system testable and extensible</li>
  <li>Type safety means errors show up in my IDE, not in production</li>
  <li>LangChain4j mitigate the risk of silent tokenization failures: LangChain4j and the underlying
<a href="https://github.com/deepjavalibrary/djl">DJL</a>/<a href="https://jeremylem.github.io/blogging/2025/11/30/AI_Java.html">ONNX</a> dependencies favor explicit, compiled code with fixed
configurations loaded from a standard asset (tokenizer.json). In Python, a developer has more flexibility (and thus more room for error) to skip or misconfigure the normalization
step.</li>
</ul>

<h3 id="the-tradeoffs">The Tradeoffs:</h3>

<ul>
  <li>10x more code than the Python equivalent (~2000 lines vs ~200)</li>
  <li>Longer development cycles for initial implementation</li>
  <li>Higher memory footprint (~500MB vs ~200MB)</li>
  <li>More complex build tooling (Maven vs pip)</li>
</ul>

<p>The performance is essentially identical 20-30ms query latency with hybrid search combining BM25 and vector similarity. The real difference isn’t runtime performance; it’s
development confidence.</p>

<h2 id="lessons-learned">Lessons Learned</h2>

<ol>
  <li>RAG is definitely achievable in Java. The ecosystem has matured significantly with LangChain4j, Apache Lucene, and ONNX runtime support.</li>
  <li>Enterprise patterns matter at scale. What feels like over-engineering in Python (factories, interfaces, dependency injection) becomes valuable when you have multiple
teams working on the same codebase.</li>
  <li>Choose the right tool for the job. Python excels at rapid prototyping and research. Java shines in production environments where you need strong contracts, clear
interfaces, and long-term maintainability.</li>
</ol>

<h2 id="the-verdict">The Verdict</h2>

<p>Choose Java if:</p>

<ul>
  <li>You need strong type safety and compile-time guarantees</li>
  <li>You’re building production systems requiring clear interfaces</li>
  <li>Long-term maintainability is a priority</li>
</ul>

<p>Stick with Python if:</p>

<ul>
  <li>You’re in research/prototype phase</li>
  <li>Team expertise is primarily Python</li>
  <li>You need access to cutting-edge model libraries</li>
</ul>

<p>The additional development time is offset by fewer runtime surprises.</p>

<h2 id="technical-details">Technical Details</h2>

<p>The complete implementation includes:</p>

<ul>
  <li>Hybrid search with configurable BM25/vector weights</li>
  <li>Multi-format document support (PDF, Markdown, TXT)</li>
  <li>
    <p>~20-30ms query latency with 100% recall@5 on test queries</p>
  </li>
  <li>
    <p>Embedding Model Specifications:* the <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">All-MiniLM-L6-v2</a> model than the python implementation was choosen.
Exactly for the same reasons at the time: high efficiency, producing a dense vector of <strong>384 dimensions</strong>. This dimension is automatically respected by LangChain4j when interfacing
with <strong>ChromaDB</strong>.</p>
  </li>
  <li>Chunking Strategy: the <code class="language-plaintext highlighter-rouge">RecursiveDocumentChunker</code> uses the <code class="language-plaintext highlighter-rouge">DocumentSplitters.recursive()</code> method, configured for <strong>character count</strong> (512 characters). This strategy
intentionally keeps chunks safely below the model’s hard limit of <strong>256 tokens</strong> (since 512 characters is roughly equivalent to 128-150 tokens in English), preventing truncation
and maximizing context preservation.</li>
</ul>

<p>The full source is available on GitHub for anyone interested in exploring RAG beyond the Python ecosystem: <a href="https://github.com/jeremylem/mcp_server4j">mcpp_server4j_github</a></p>]]></content><author><name>Jeremy L.</name></author><category term="blogging" /><summary type="html"><![CDATA[After the python version, I wanted to verify if you can build a Retrieval-Augmented Generation (RAG) system from scratch in Java.]]></summary></entry></feed>