Cognitive offloading

AWS KMS XKS Without an HSM: Threshold Cryptography Across Cloud Providers

2026-02-17T00:00:00+00:00

In a previous post, I backed AWS KMS with a SoftHSM on my local machine via an SSH tunnel, based on the AWS XKS proxy sample. It proved the concept: the encryption key lives on hardware I control, and AWS KMS calls my proxy for every cryptographic operation.

This post takes a different approach. Instead of storing the AES key in one place and guarding it, I split the key so it never exists in any single location. Two independent services on two different cloud providers each hold a share. They cooperate to perform encryption, but neither can derive the key alone. No HSM required.

The Problem with a Single Key in a Single Place

HSMs are expensive, complex to operate, and tied to a specific vendor and facility. If you want data sovereignty guarantees across European jurisdictions, you need HSMs in those jurisdictions, and you need to double the hardware for resilience, with all the procurement and compliance overhead that entails.

There is a way to make it simpler.

Split Key Across Cloud Providers

The idea is 2-of-3 threshold cryptography over elliptic curves (P-256). Three key shares are generated in a one-time ceremony. Any two shares can derive the AES-256 encryption key for a given key identifier. One share alone is mathematically useless.

For normal operation, two shares participate:

Share 1 lives in a Scaleway Serverless Container (fr-par, Paris). This is the XKS proxy that AWS KMS calls. Scaleway provides the HTTPS endpoint; the container runs plain HTTP behind Scaleway’s TLS termination.
Share 2 lives on an Exoscale compute instance (ch-gva-2, Geneva). A minimal service that computes one elliptic curve operation per request, protected by mutual TLS.
Share 3 is offline backup. Stored securely, never deployed. Used only if one of the other two services is permanently lost.

When AWS KMS needs to encrypt or decrypt, the XKS proxy computes its part locally, asks the EU service for the second part over mTLS, combines the two parts mathematically, and derives the AES-256 key. The key exists in memory for the duration of one request, then is zeroed.

The critical property: the private key is never reconstructed, not even transiently. Each party computes its own partial result on its own share. The proxy combines these partial results using Lagrange interpolation on the elliptic curve. The math guarantees this produces the same shared secret as if the full private key had been used, but no single party ever held that full key.

AWS KMS
  |
  |  HTTPS + SigV4
  v
Scaleway (fr-par)                      Exoscale (ch-gva-2)
  XKS Proxy — Holds: Share 1            EU Share Service — Holds: Share 2
  |                                      |
  | 1. V = hash_to_curve(keyId)          |
  | 2. partial_1 = share_1 * V           |
  |                                      |
  |------------ mTLS -----------------> |
  |    "compute share_2 * V"             | 3. partial_2 = share_2 * V
  | <----------------------------------- |
  |                                      |
  | 4. Lagrange combine partials         |
  | 5. HKDF -> AES-256 key              |
  | 6. AES-GCM encrypt/decrypt          |
  | 7. Zeroize key                       |
  |
  v
AWS KMS (response)

The virtual point V is derived deterministically from the key identifier using hash-to-curve (RFC 9380). This means the same keyId always produces the same AES-256 key – a requirement for KMS, where encrypt and decrypt must use the same key.

What Each Party Sees

Party	Knows	Cannot derive
XKS Proxy — Scaleway (Share 1)	Its share, the virtual point, its own partial result	The AES key without Share 2’s partial
EU Share Service — Exoscale (Share 2)	Its share, the virtual point, its own partial result	The AES key, the plaintext, the ciphertext, anything about the AWS request
AWS KMS	Ciphertext, AAD	The AES key (no shares, no partials)
Attacker with 1 share	One share scalar	The AES key (need 2 of 3)

The Exoscale service is deliberately minimal. It receives a point, multiplies it by its share scalar, and returns the result. It never sees plaintext, ciphertext, additional authenticated data, or the derived AES key. It doesn’t even know whether the request is for an encrypt or decrypt operation.

What This Protects and What It Doesn’t

Like the previous HSM-based approach, this is about data at rest. When you upload a file to S3 with KMS encryption, the file is encrypted before it hits storage. When you download it, S3 asks KMS to decrypt, KMS asks my proxy, my proxy cooperates with Exoscale, and the plaintext is returned through the chain. At each step of that chain, the data passes through systems in the clear – AWS sees the plaintext when it serves the GET request. There is no way around this when using server-side encryption.

The protection is against someone gaining access to the stored data without going through the live system. If S3 storage is exfiltrated, the encrypted objects are useless without the key. And the key cannot be derived without the cooperation of two independent services.

The monitoring advantage

This is where split keys differ meaningfully from a single HSM. Every decryption operation requires the Exoscale service to participate. Every request is logged: which key path was used, when, and from which client. If someone – including the operator of the XKS proxy – tries to mass-decrypt stored data, the Exoscale service sees a sudden flood of requests. This is visible in Exoscale’s logs, independently of anything happening on the AWS or Scaleway side.

Here are the actual logs from the deployed system. First, the Scaleway XKS proxy starting up and handling requests:

# Scaleway XKS Proxy (fr-par)
07:01  Share store initialized, share_id=1, key_count=5
07:01  EU client mTLS configured (client cert + custom CA)
07:01  EU client initialized, base_url=https://92.39.60.218:8443
07:01  XKS Proxy listening, addr=0.0.0.0:8080
20:57  Encrypt   kmsRequestId=f6be4217-...  key_id=test-key-1
21:06  Decrypt   kmsRequestId=b836a273-...  key_id=test-key-1
34:23  Decrypt   kmsRequestId=544d2f24-...  key_id=test-key-1
34:26  Decrypt   kmsRequestId=45b3ecaa-...  key_id=test-key-1
35:34  Decrypt   kmsRequestId=1ab54f6b-...  key_id=test-key-1

And the Exoscale EU share service, which sees only partial ECDH requests — no plaintext, no ciphertext, no indication of whether it’s an encrypt or decrypt:

# Exoscale EU Share Service (ch-gva-2)
02:54  Share store initialized, share_id=2, key_count=5
02:54  EU Share Service listening (mTLS), addr=0.0.0.0:8443
20:57  Partial ECDH request  request_id=6a7621ce772b8015  key_path=encryption/key-1
21:06  Partial ECDH request  request_id=1d5d05530cdc70a4  key_path=encryption/key-1
34:23  Partial ECDH request  request_id=8defb91565227a92  key_path=encryption/key-1
34:26  Partial ECDH request  request_id=ed1cf2052a008935  key_path=encryption/key-1
35:34  Partial ECDH request  request_id=22c75b772d54ffaa  key_path=encryption/key-1

Every operation leaves a trace on both providers independently. The KMS request UUID from AWS flows through to the Scaleway proxy logs. The Exoscale service logs the key derivation path but has no visibility into what the operation is for.

More importantly, the Exoscale operator can act on anomalies. If they see thousands of partial computation requests in a few minutes for key paths that normally see a handful per hour, they can shut down the instance. The XKS proxy still holds its share, but one share alone is useless – mathematically, it reveals nothing about the key.

This is a property that no single-HSM setup provides. When the key is in one place, whoever controls that place can silently decrypt everything. With split keys, any large-scale decryption is necessarily visible to the other party, and either party can pull the plug.

Performance: The 250ms Budget

AWS KMS requires the XKS proxy to respond within 250ms. The threshold protocol adds a network round-trip to a second cloud provider in a different country. Can it stay within budget?

The XKS proxy runs on Scaleway Serverless Containers (fr-par, Paris). The EU share service runs on Exoscale (ch-gva-2, Geneva). Paris to Geneva is ~340km, with network round-trips typically 5-15ms.

Warm requests

With connection reuse (the typical case after the first request), the round-trip to Geneva is fast. The elliptic curve operations take about 1ms on each side. The bottleneck is network latency, not computation. All operations stay well within the 250ms budget.

Cold start

The Scaleway container starts in under a second. KMS periodic health checks keep it warm. Since both the XKS proxy and the EU share service are static Rust binaries in distroless Docker images, startup is fast. The XKS proxy initializes its share store, configures the mTLS client to Exoscale, and begins listening – all in under a second.

Exoscale service

The EU share service runs on an Exoscale compute instance with rustls handling mTLS directly (no TLS termination proxy). It performs a single scalar-point multiplication per request (~1ms of computation). The mTLS connection ensures only the XKS proxy – holding the correct client certificate signed by the pinned CA – can reach it.

The Sovereignty Argument

The real value isn’t performance – it’s the trust model.

With a traditional HSM-backed XKS, the key is in one place. Whoever controls that place controls the key. With threshold cryptography, the key is split across providers:

No single cloud provider can derive the key. Scaleway holds one share but needs Exoscale’s cooperation. Exoscale holds one share but never sees the key or the data. AWS holds no shares at all.
Revocation is instant. Shut down the Exoscale instance, and AWS KMS can no longer encrypt or decrypt. The kill switch is a single command to a Swiss provider that has no relationship with AWS or Scaleway.
No HSM vendor lock-in. The system is pure software. The shares are 34-byte values. They can be deployed to any compute platform that runs a Rust binary.

Three Providers, Three Countries

The current deployment already spans two providers and two countries: Scaleway (Paris, France) for the XKS proxy and Exoscale (Geneva, Switzerland) for the EU share service. But the 2-of-3 threshold scheme is designed for three:

Share	Provider	Location	Jurisdiction	Status
Share 1	Scaleway Serverless Containers	fr-par (Paris)	France	Active — XKS proxy
Share 2	Exoscale Compute	ch-gva-2 (Geneva)	Switzerland	Active — EU share service
Share 3	IONOS or Hetzner	Frankfurt	Germany	Cold standby

Any two of the three can derive the key. This gives you:

Provider resilience. If one provider has an outage, the other two can still operate. Reconfigure the XKS proxy to call the remaining provider. Since the Lagrange interpolation is parameterized by share IDs, switching from Share 2 to Share 3 is a configuration change, not a key change. The derived AES key is identical regardless of which two shares are used.
Jurisdictional resilience. If one country’s regulator orders a provider to freeze or hand over data, they get one share – which is mathematically useless without a second share from a different jurisdiction. The other two providers can continue operating.
No single point of legal compulsion. A French court order to Scaleway yields Share 1. A Swiss court order to Exoscale yields Share 2. Neither alone produces the key. Compelling two jurisdictions simultaneously requires international cooperation – a significantly higher bar than a single domestic order.
Mutual oversight. Each share service logs every request. Unusual decryption patterns are visible to at least two independent operators in two different countries. A government quietly compelling mass decryption through one provider is immediately visible to the other.

Cross-border latency

Paris to Geneva is ~340km. Paris to Frankfurt is ~480km. Network round-trips between these cities are typically 5-15ms.

The current Paris-Geneva deployment already works within the 250ms budget. Adding a third share in Frankfurt would not change the latency significantly – the XKS proxy only calls one partner per request, and Paris to Frankfurt is a comparable distance.

Swiss neutrality

Switzerland is particularly interesting as a share location. It’s not an EU member state, so it’s outside the reach of EU-wide data access orders. Swiss data protection law (nDSG/LPD) is independently strong. Providers like Infomaniak and Exoscale specifically market sovereignty and data localization. A share stored with a Swiss provider adds jurisdictional diversity that goes beyond having multiple EU locations.

The operational model

In normal operation, only two shares are active. The third is cold standby. If the XKS proxy needs to switch to a different partner:

Activate the standby share service (deploy the container to the standby provider)
Update the XKS proxy’s configuration: new URL and the new share ID
The Lagrange coefficients adjust automatically based on the share IDs involved
The derived AES key is identical – the math guarantees it

Cost

The entire system runs on near-free infrastructure:

Component	Cost
Scaleway Serverless Container (XKS proxy)	Free tier (400k GB-s/month)
Exoscale Compute (EU share service)	~$15/month (standard.small)
Total	~$15/month

Scaleway provides the HTTPS endpoint for free with its serverless containers. The Exoscale VM is the only fixed cost.

Compare this to a Thales Luna HSM ($15-25k/year) or a cloud-managed HSM billed per key per hour. The threshold approach costs almost nothing at low to moderate volumes.

What This Does Not Replace

This is not FIPS 140-3 Level 3. There is no tamper-resistant hardware. If you need a compliance checkbox that says “FIPS”, you need an HSM.

And as with any server-side encryption scheme, the data passes through AWS in the clear during normal operations. AWS could retain plaintext, log decrypted data, or be compelled to do so. This is inherent to the model: you’re asking AWS to encrypt and decrypt on your behalf.

What threshold cryptography adds is control over the key at rest and visibility into its use. The key cannot be derived without active cooperation from two independent parties on two different cloud providers in two different countries. Unusual access patterns are visible to both. And either party can revoke access instantly by shutting down their service.

For the actual security properties most people want from external key management – control over key material, revocation capability, multi-jurisdiction sovereignty, defense against silent key extraction – threshold cryptography provides stronger guarantees than a single HSM. A compromised HSM exposes all its keys. A compromised share exposes nothing.

Regulatory Landscape

The threshold approach maps to several EU regulations, existing and upcoming.

DORA (Digital Operational Resilience Act)

DORA Article 7 (RTS) requires key lifecycle management, controls against loss and unauthorized access proportional to risk, key replacement procedures, and certificate registers. The threshold approach addresses each:

Loss protection – 2-of-3 means losing one share doesn’t lose the key. Offline backup share provides recovery.
Unauthorized access – no single party holds the full key. Compromising one provider reveals nothing mathematically useful.
Key replacement – re-run the ceremony, generate new shares, redeploy.
Third-party concentration risk (DORA Art. 28+) – shares on separate providers in separate jurisdictions directly addresses this. DORA specifically calls out concentration risk with single ICT providers.
Auditability – every operation logged independently on each provider.
Revocability – either party can shut down their service instantly.

NIS2 Directive

Article 21.2(h) requires encryption and key management proportional to risk for essential and important entities (energy, healthcare, transport, digital infrastructure). AES-256 meets the minimum. Splitting the key across jurisdictions is a stronger control than storing it in one place.

Article 32 requires “appropriate technical measures” including encryption. Article 34 says if data is encrypted and the key is not exposed, breach notification to individuals may not be required. With threshold splitting, the cloud provider never holds the key – they are a processor with no access to key material. If S3 storage is breached, the encrypted objects are useless without cooperation of two independent parties.

EU Data Act

Article 28 (applicable since September 2025) requires cloud providers to take “all reasonable measures, including encryption” to prevent unlawful access to data, especially from third-country government requests. This is where threshold splitting has its strongest argument.

The US CLOUD Act allows US law enforcement to compel American companies to hand over data stored abroad. But the CLOUD Act is encryption-neutral – it does not compel providers to decrypt data they cannot decrypt. A subpoena to AWS is mathematically useless.

EUCS (European Cybersecurity Certification Scheme for Cloud Services)

Still being finalized by ENISA. Earlier drafts included a “high+” level requiring encryption keys to be held outside the cloud provider’s control and within EU jurisdiction. Even though sovereignty requirements were dropped from the latest draft, individual member states can still impose them. The threshold approach is ready for the strictest interpretation: keys are split across European providers, none held by a US-subject entity.

AWS KMS External Key Store: Keep Your Encryption Keys Out of the Cloud

2026-02-11T00:00:00+00:00

A proof of concept demonstrating AWS KMS encryption backed by a local SoftHSM key, using the XKS Proxy API. The cryptographic master key lives on a machine I fully own, not in AWS, not in a managed service, but on my personal hardware.

Motivation

I wanted to back AWS KMS with a hardware security module under my physical control. SoftHSM on a local machine serves as a stand-in for a real HSM for this POC. The EC2 instance and SSH reverse tunnel exist only because AWS KMS requires an HTTPS endpoint with a valid TLS certificate to reach the XKS proxy. The actual key material never leaves the local machine.

Architecture

The intended model

The idea behind XKS is straightforward: you run an HSM on your premises, front it with a service layer that implements the XKS Proxy API, and AWS KMS calls that proxy whenever it needs to encrypt or decrypt. The proxy talks to any Hardware Security Module that supports PKCS#11 v2.40 – Thales Luna, Entrust nShield, or anything else that speaks the standard. AWS publishes the API spec and a reference implementation in Rust, so you can build and run your own.

S3 (SSE-KMS) → KMS → XKS Proxy → HSM (PKCS#11 v2.40)

In production, the proxy sits next to the HSM on the same network. Simple.

My workaround

I don’t have an HSM or a server with a public IP and a TLS certificate. So I stitched together a workaround to validate the concept: SoftHSM on my Mac as the HSM stand-in, an EC2 instance running the XKS proxy behind an ALB for TLS termination, and p11-kit remoting the PKCS#11 calls back to my machine over an SSH reverse tunnel.

S3 (SSE-KMS) → KMS → XKS Proxy (EC2) → p11-kit client
                                              ↓
                                     SSH reverse tunnel
                                              ↓
                                     p11-kit server (Mac) → SoftHSM

Not what you’d run in production. But it proves the point: when S3 encrypts an object, the actual cryptographic operation happens on my machine, with a key that never leaves it.

Component	Location	Purpose
xks-proxy	EC2 (aarch64, t4g.nano)	AWS reference XKS proxy (Rust)
ALB	AWS	TLS termination with valid certificate
p11-kit 0.26.2	EC2 + Mac	PKCS#11 remoting over Unix sockets
SoftHSM v2	Mac (Homebrew)	Software HSM holding the AES-256 key

Setup

Create the SoftHSM AES-256 key

softhsm2-util --init-token --slot 0 --label foo --pin 1234 --so-pin 0000
softhsm2-util --show-slots

pkcs11-tool --module /opt/homebrew/lib/softhsm/libsofthsm2.so \
  --login --pin 1234 \
  --keygen --key-type AES:32 --label foo \
  --token-label foo

Expected output:

Secret Key Object; AES length 32
  label:      foo
  Usage:      encrypt, decrypt, sign, verify, wrap, unwrap
  Access:     never extractable, local

Deploy the infrastructure

A single create.sh script handles everything: cross-compiling xks-proxy for aarch64 via cargo zigbuild, deploying CloudFormation (ALB, EC2, Route53, CloudWatch), SCPing the binary and config to EC2, and starting the systemd service.

The p11-kit version issue

This is where I spent most of my time. The AL2023 repo ships p11-kit 0.24.1, which lacks AES-GCM RPC serialization entirely, every encrypt/decrypt call fails with CKR_MECHANISM_INVALID.

Version	AES-GCM RPC	Object Handles	Status
0.24.1 (AL2023 repo)	No	N/A	`CKR_MECHANISM_INVALID`
0.25.x	Yes	Broken across sessions	`CKR_OBJECT_HANDLE_INVALID`
0.26.2	Yes	Works	Working

Version 0.26.2 must be built from source on EC2. And it must match on both ends – a version mismatch between client (EC2) and server (Mac) causes RPC protocol errors.

# On EC2 (t4g.nano, 512MB RAM)
sudo dnf install -y meson ninja-build gcc libtasn1-devel libffi-devel
curl -sL https://github.com/p11-glue/p11-kit/releases/download/0.26.2/p11-kit-0.26.2.tar.xz | tar xJ
cd p11-kit-0.26.2
meson setup _build --prefix=/usr --libdir=/usr/lib64
ninja -C _build -j1    # -j1 required: t4g.nano OOMs on parallel builds
sudo ninja -C _build install

Start the PKCS#11 tunnel

Two terminals on the Mac:

Terminal 1 – p11-kit server:

p11-kit server --provider /opt/homebrew/lib/softhsm/libsofthsm2.so "pkcs11:"

Terminal 2 – SSH reverse tunnel:

export P11_KIT_SERVER_ADDRESS=unix:path=/var/folders/.../pkcs11-XXXX

ssh -i ~/Downloads/EC2Tutorial2.pem \
  -R /home/ec2-user/.p11-kit.sock:${P11_KIT_SERVER_ADDRESS#unix:path=} \
  ec2-user@

This forwards the EC2 Unix socket to the Mac’s p11-kit server socket. From EC2’s perspective, PKCS#11 operations on /home/ec2-user/.p11-kit.sock transparently reach SoftHSM on the Mac.

Test end-to-end

# Health check
curl https://xks.lemaire.tel/ping

# Upload with XKS encryption
echo "hello from softhsm" > /tmp/test.txt
aws s3 cp /tmp/test.txt s3://xks-proxy-poc-test/test.txt \
  --region eu-west-3 \
  --sse aws:kms \
  --sse-kms-key-id cd0608a9-0726-4187-b1b1-d0b08370d8f9

# Download (decryption goes through SoftHSM)
aws s3 cp s3://xks-proxy-poc-test/test.txt /tmp/downloaded.txt --region eu-west-3
cat /tmp/downloaded.txt
# hello from softhsm

A Rust Undefined Behavior Bug in the XKS Proxy

The interesting problem I hit wasn’t infrastructure, it was a compiler optimization bug in the AWS reference implementation.

The GetKeyMetadata handler declares stack variables as immutable (let key_type = 0;) and passes pointers to them via set_ck_ulong() for C_GetAttributeValue to write into. The C function writes the correct values (e.g., key_type=31 for CKK_AES), but the Rust compiler’s release-mode optimizer treats the immutable bindings as compile-time constants and inlines 0 wherever they’re subsequently read.

Impact: keyspec(0, 0) = "RSA_0" instead of keyspec(31, 32) = "AES_256". KMS rejected the key metadata.

This only manifests in release builds. Debug builds work fine because the optimizer doesn’t inline the constants.

To fix: Force the compiler to re-read from actual memory after the C call:

let key_type = unsafe { std::ptr::read_volatile(&key_type) };
let key_size = unsafe { std::ptr::read_volatile(&key_size) };

The root cause is Rust undefined behavior: writing through a raw pointer derived from an immutable reference. The proper long-term fix would be UnsafeCell or MaybeUninit in the rust-pkcs11 crate’s CK_ATTRIBUTE::set_ck_ulong implementation.

What XKS Does and Does Not Protect

XKS protects against

Future unauthorized access by AWS. Disconnect the tunnel, shut down the proxy, and AWS cannot decrypt data going forward.
Regulatory and sovereignty requirements. Cryptographic master keys remain under your control, with an audit trail of all key operations.
Cloud provider key management concerns. The master key material stays entirely outside AWS.

XKS does not protect against

AWS actively retaining or exfiltrating data. They could copy plaintext before encryption or retain data encryption keys.
Legal compulsion. If DEKs were retained in AWS systems, AWS could be compelled to produce them.
Retrospective decryption. If AWS legitimately decrypted an object to serve a GET request, they could have retained the plaintext.

Bottom line

External key management is about governance, operational control, and trust reduction. It is not about protecting against a malicious provider – they see plaintext anyway. If the cloud provider itself is your threat model, use client-side encryption before uploading, or don’t use cloud.

Cloud Provider Comparison

AWS – External Key Store (XKS)

DIY implementation: YES. AWS publishes the XKS Proxy API specification and a reference implementation. You can build your own proxy backed by any PKCS#11-compatible HSM or key manager.

GCP – Cloud External Key Manager (EKM)

DIY implementation: NO. No public API specification, no reference implementation. You must use certified partner solutions (Thales CipherTrust, Fortanix DSM).

Azure – No external key store equivalent

DIY implementation: N/A. Azure offers BYOK (keys end up in Azure), Managed HSM (single-tenant, still in Azure), and Dedicated HSM (Thales Luna in Azure). No equivalent to XKS or EKM where keys remain outside the cloud.

Source Code

Available on GitHub: aws-kms-xks-poc

aws-kms-xks-poc/
  cloudformation.yaml          # EC2 + ALB + Route53 + CloudWatch
  configuration/settings.toml  # xks-proxy config
  create.sh                    # One-shot deploy script

Virtual Me 2.0: 21x Faster Cold Starts with Swift and S3 Vectors

2026-02-05T00:00:00+00:00

The Evolution

A couple of weeks ago, I built Virtual Me, a RAG-powered chatbot that answers questions about my professional experience. It worked, but I wasn’t fully satisfied.

The v1.0 stack:

Python Lambda with custom DynamoDB vector store
LangGraph orchestration
Manual embedding generation and chunking
4-second cold starts
1,488 lines of application code

It was over-engineered to avoid using ElasticSearch.

So I rebuilt it from scratch. Virtual Me 2.0 is now simpler, faster, and cheaper.

What Changed

Architecture Simplification

Before (v1.0):

Lambda (Python) → Custom Vector Search (DynamoDB) → Bedrock

After (v2.0):

Lambda (Swift) → Bedrock Knowledge Base → S3 Vectors → Bedrock

S3 Vectors replaced my custom hacked DynamoDB similarity search
Bedrock Knowledge Base handles document ingestion, chunking, and embedding automatically
Swift runtime replaced Python for faster cold starts

The Numbers

Metric	v1.0	v2.0	Improvement
Cold Start	~4000ms	~190ms	21x faster
Application Code	1,488 lines	205 lines	86% reduction
Memory	512MB	256MB	50% reduction
Monthly Cost	~$3-5	~$2-3	40% cheaper

The cold start improvement is measured via CloudWatch Logs Insights:

P50: 177ms
P99: 214ms
Range: 173-214ms

Why Swift?

I first explored Swift server-side when I worked on a mobile app in my previous job. At the time, I experimented with Vapor and Kitura. I’m really happy to see the language reached the cloud.

I chose Swift for Lambda because of its compiled nature and minimal runtime overhead. I also wanted to revisit Swift on the cloud after exploring Vapor and Kitura years ago. I see real value in running memory-efficient code for cost reduction, and using a strongly typed language is essential to catch errors now that so much of our code is AI-generated.

Swift uses Automatic Reference Counting (ARC) for memory management, which means no garbage collection pauses like JVM languages (Java, Kotlin) - memory is deallocated deterministically when references drop to zero.

Cold start breakdown:

Init Duration: Time to initialize the Lambda execution environment
Duration: Actual function execution time

Python’s interpreted nature means importing libraries (boto3, langchain, etc.) adds significant overhead to every cold start. Swift compiles to a native binary with all dependencies linked.

The result is above what I was expecting: 190ms cold starts vs 4000ms in Python.

Swift Lambda Code

Here’s the complete Lambda handler (simplified):

import AWSLambdaRuntime
import AWSLambdaEvents
import SotoBedrockAgentRuntime

@main
struct VirtualMeLambda {
    static func main() async throws {
        let runtime = LambdaRuntime { 
            (event: APIGatewayV2Request, context: LambdaContext) async throws -> APIGatewayV2Response in
            try await handleRequest(event: event)
        }
        try await runtime.run()
    }
}

func handleRequest(event: APIGatewayV2Request) async throws -> APIGatewayV2Response {
    guard let body = event.body else {
        return errorResponse(400, "Missing request body")
    }
    
    let request = try JSONDecoder().decode(ChatRequest.self, from: Data(body.utf8))
    let question = request.messages.last?.text ?? ""
    
    // Call Bedrock Knowledge Base
    let answer = try await retrieveAndGenerate(question)
    
    let responseBody = try JSONEncoder().encode(ChatResponse(text: answer))
    return APIGatewayV2Response(
        statusCode: .ok,
        headers: ["Content-Type": "application/json"],
        body: String(data: responseBody, encoding: .utf8)
    )
}

No custom vector search. No manual embedding generation. Just call Bedrock Knowledge Base and return the result.

S3 Vectors: Native Vector Storage

S3 Vectors is AWS’s managed vector database. It integrates directly with Bedrock Knowledge Base.

What it handles:

Vector indexing (automatic)
Similarity search (sub-100ms)
Scaling (automatic)
Cost (pay per query, not per GB stored)

What I don’t have to build:

Cosine similarity calculations
Vector normalization
Index management
Query optimization

My v1.0 DynamoDB implementation had ~400 lines of code for vector search. S3 Vectors replaced all of it. Not free like DynamoDB, but affordable for my budget.

Infrastructure: AWS SAM

I migrated from Terraform to AWS SAM (Serverless Application Model) for better Lambda development workflow.

Why SAM:

sam build - Automatic dependency packaging
Built-in best practices (IAM, X-Ray, CORS)
Faster iteration cycle

Nested stacks keep the templates modular. Each stack is ~150 lines and specialized.

Local Testing with Swift Lambda Runtime

The Swift AWS Lambda Runtime automatically starts a local HTTP server when not running in a Lambda execution environment. This makes testing incredibly simple:

# Start local server on http://127.0.0.1:7000/invoke
cd sam
make run-local

Under the hood, this runs:

KNOWLEDGE_BASE_ID=RPXCA7UUQN LLM_MODEL=nova-2-lite LLM_TEMPERATURE=0.1 swift run

Test with curl:

curl -X POST http://127.0.0.1:7000/invoke \
  -H "Content-Type: application/json" \
  -d '{
    "version": "2.0",
    "routeKey": "POST /chat",
    "body": "{\"messages\":[{\"role\":\"user\",\"text\":\"What is your experience?\"}]}"
  }'

This is much simpler than sam local start-api which requires Docker and emulates the entire API Gateway + Lambda stack. The Swift runtime’s built-in local server connects directly to real AWS services (Bedrock, S3 Vectors) for authentic integration testing.

Measuring Cold Starts

CloudWatch automatically captures Lambda cold start metrics. I added a Makefile command to query them:

make cold-start-metrics

This runs a CloudWatch Logs Insights query:

fields @initDuration, @duration, @memorySize
| filter @type = "REPORT" and ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart,
        max(@initDuration) as maxColdStart,
        pct(@initDuration, 50) as p50ColdStart,
        pct(@initDuration, 99) as p99ColdStart,
        count() as totalColdStarts

The @initDuration field only appears on cold starts, making it easy to track.

Lessons Learned

1. S3 Vectors

My v1.0 custom DynamoDB vector store wasn’t over-engineering, it was a necessary compromise. ElasticSearch was too expensive for a personal project. DynamoDB’s free tier (25GB storage, 25 RCU/WCU) made it the only viable option for vector storage.

The complexity was the price of staying affordable.

Then S3 Vectors reached general availability on December 2, 2025, search became accessible for personal projects. The managed service eliminated 86% of my code while providing capabilities (sub-100ms similarity search, automatic indexing) that my DynamoDB implementation couldn’t match.

The lesson: Sometimes complexity is justified by constraints. When those constraints change (new services, pricing models), it’s worth revisiting your architecture.

2. Measure

I assumed Python would be “fast enough” for cold starts. Measuring with CloudWatch proved otherwise. Swift’s 21x improvement was worth the migration effort.

Rule: Measure before optimizing, but also measure to validate assumptions.

3. Compiled > Interpreted for Lambda

Python’s flexibility comes at a cost: import overhead. Swift’s compiled binary starts instantly.

For Lambda, prefer compiled languages over interpreted ones (Python, Node.js) when cold start matters.

4. Infrastructure as Code

Terraform worked, but SAM’s Lambda-specific features (automatic packaging) made development faster.

Rule: Choose IaC tools that match your workload. SAM for serverless, Terraform for multi-cloud.

Cost Breakdown

Monthly cost for ~100 conversations:

Service	v1.0	v2.0
Lambda	$0.50	$0.25
DynamoDB	$1.50	$0
S3 Vectors	$0	$0.75
Bedrock	$1.50	$1.00
Other (S3, CloudFront, Route53)	$1.00	$1.00
Total	~$4.50	~$3.00

The cost reduction comes from:

50% less Lambda memory (512MB → 256MB)
No DynamoDB scan operations
Bedrock Knowledge Base efficiency (fewer API calls)

Try It Yourself

The complete source code is on GitHub: github.com/jeremylem/virtualme2

Quick start:

git clone https://github.com/jeremylem/virtualme2
cd virtualme2/sam
sam build
sam deploy --guided

Local testing:

make run-local          # Start Swift Lambda locally
make test-local         # Send test request
make cold-start-metrics # View CloudWatch metrics

Try the live version: chat.lemaire.tel

Tech Stack: Swift 6.0 · AWS Lambda · Amazon Bedrock · S3 Vectors · AWS SAM · CloudFormation

Performance: 190ms cold starts · 256MB memory · $3/month

Code: 86% reduction · 62% total codebase reduction

Multi-Agent RAG with S3 Vectors and Bedrock AgentCore

2026-02-01T00:00:00+00:00

A RAG chatbot for personal notes using S3 Vectors and Bedrock Agents. Features true multi-agent collaboration with critique-driven feedback loop and multi-turn conversations. Built as a learning exercise after re:Invent 2025.

Why I Built This

I already have virtualme running in production. It’s a RAG chatbot using DynamoDB for vector storage. It works, stays in free tier, does the job.

But re:Invent 2025 announced two things that caught my attention:

Amazon S3 Vectors (December 2025). Native vector storage with server-side similarity search. Supports cosine, euclidean, and dot product metrics. Up to 10,000 dimensions per vector. Metadata filtering built-in. No more client-side cosine calculations or DynamoDB scan-and-compute patterns.
Amazon Bedrock AgentCore (December 2025). Managed agent infrastructure with tool orchestration. Agents get knowledge base access, action groups, and session memory. Multi-agent setups possible: agents can invoke other agents, share context, or work in parallel. Built-in trace for debugging retrieval and reasoning steps.

I wanted to understand how these compare to my DynamoDB approach.

Multi-Agent Architecture

True multi-agent collaboration with specialized roles:

Research Agent: Searches knowledge base, extracts facts
Critique Agent: Evaluates quality, provides feedback (1-10 scoring)
Formatter Agent: Creates natural user responses

Orchestration Pattern

Sequential agent handoff with feedback loop:

Query → Research Agent → Critique Agent → Score Check
                                      ↓
                              Score ≥ 7? → Formatter Agent → Response
                                      ↓
                              Score < 7? → Feedback to Research (max 3x)

Session Management

Shared session IDs across all agent calls
Each query gets unique session ID
All 3 agents share conversation memory
Natural multi-turn conversations

Architecture

                        DEPLOYMENT
                        ==========

./deploy.sh
     |
     v
+------------------+
|  S3 Bucket       |  <-- your .md/.txt notes go here
|  (notes/)        |
+--------+---------+
         |
         v
+------------------+
|  Bedrock         |  reads notes, chunks them
|  Knowledge Base  |  calls Titan Embeddings (1024 dims)
+--------+---------+
         |
         v
+------------------+
|  S3 Vectors      |  <-- vectors stored here
|  Index           |      native similarity search
+------------------+

                         QUERY FLOW
                         ==========

python client.py "What is DynamoDB?"
     |
     | HTTPS + SigV4 signing
     v
+------------------+
|  API Lambda      |
|  Function URL    |
+--------+---------+
         |
         v
+------------------+
|  Orchestrator    |  manages agent workflow
|  Lambda          |  extracts citations from trace
+--------+---------+
         |
         +---> Research Agent ---> Critique Agent
         |           ^                   |
         |           |     score < 7     |
         |           +-------------------+
         |                   |
         |            score >= 7
         |                   v
         +---> Formatter Agent ---> Response with Sources

Orchestrator Lambda

The orchestrator is a Python Lambda (orchestrator.handler) that coordinates the multi-agent workflow. It receives the query, manages the critique loop, and assembles the final response.

Key implementation details:

Calls bedrock.invoke_agent() with enableTrace=True to capture retrieval metadata
Parses knowledgeBaseLookupOutput.retrievedReferences from trace to extract real S3 URIs
Extracts filenames from URIs and appends them as sources (agents hallucinate filenames, so this is done server-side)
Shares same session_id across all agent calls for conversation continuity
Returns structured response with iterations, final_score, and sources for observability

What I Wanted to Learn

1. AWS SAM

I’ve used Terraform before. Never SAM.

SAM handles Lambda packaging automatically. You point it at a directory, it zips and uploads:

OrchestratorLambda:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: python3.13
    Handler: orchestrator.handler
    CodeUri: ../lambda/ # SAM packages this

Run sam build, it creates .aws-sam/build/ with deployment artifacts. Run sam deploy --resolve-s3, it handles the S3 bucket for you.

Good: Less config than raw CloudFormation. Bad: Another abstraction layer to debug when things break.

2. S3 Vectors vs DynamoDB

virtualme approach:

# Client-side similarity calculation
for item in dynamodb.scan():
    score = cosine_similarity(query_vector, item['embedding'])

S3 Vectors approach:

VectorIndex:
  Type: AWS::S3Vectors::Index
  Properties:
    Dimension: 1024
    DistanceMetric: cosine # server-side!

No Python similarity code. Bedrock handles the vector search natively.

Trade-off: S3 Vectors has a 2048-byte limit on filterable metadata per record. (see details below)

3. Multi-Agent Patterns

Understood the difference between:

Bedrock Flow: Service orchestration, explicit control, no session memory between queries
Bedrock Agent: Autonomous decision-making within a role, built-in session management
AgentCore: Multi-agent collaboration, agents can delegate to each other

For this project, I chose individual agents coordinated by an orchestrator Lambda. This gives full control over the workflow while leveraging agent autonomy within each role.

Key Technical Choices

Agent Autonomy: Each agent makes decisions within its role (search strategy, evaluation criteria, formatting style)
Shared Sessions: Context preservation across agent calls via session ID
Feedback Loops: Critique-driven improvement (max 3 iterations to limit cost)
Source Extraction: Orchestrator extracts real filenames from Bedrock trace (agents were hallucinating sources)
Modular Architecture: Infrastructure, agents, and API in separate nested stacks

Things That Broke

The 2048-byte Metadata Limit

First deployment. Ingestion job fails with Filterable metadata must have at most 2048 bytes.

Bedrock stores chunk text in filterable metadata by default. Even small paragraphs exceed 2048 bytes. I tried shorter filenames, smaller chunks, different chunking strategies. Nothing worked because the limit is per-record, not total.

The fix is to configure the index to treat Bedrock metadata as non-filterable:

VectorIndex:
  Type: AWS::S3Vectors::Index
  Properties:
    MetadataConfiguration:
      NonFilterableMetadataKeys:
        - AMAZON_BEDROCK_TEXT
        - AMAZON_BEDROCK_METADATA

Catch: This must be set at index creation. I had to destroy the stack and redeploy.

Agent Source Hallucination

Agents consistently invented plausible-sounding filenames instead of citing actual sources. Asked about SOLID principles, got “From Design Patterns Basics.md” when the real file was “SOLID & Design pattern.md”.

Tried multiple approaches:

Explicit instructions to copy exact filenames
Critique agent penalizing invented sources
Different prompt formats

None worked reliably. Nova Lite models don’t seem to have clear visibility into the S3 URIs from retrieval results.

The fix to make it work: Extract citations in the orchestrator. Bedrock’s invoke_agent response includes trace data with knowledgeBaseLookupOutput.retrievedReferences. Each reference has location.s3Location.uri. Parse the filename, append to response. Real sources, no hallucination.

Cold Start Reality

First query after deployment or idle:

Component	Time
API Lambda init	~500ms
Orchestrator Lambda init	~500ms
Bedrock Agent session	variable
Knowledge Base connection	first query slower

First query: 15-30 seconds (everything cold) Warm queries: 3-8 seconds After 15 min idle: Agent session expires, partial cold start

virtualme has the same cold start issues. Serverless trade-off.

Cost Comparison

This Project (S3 Vectors + Multi-Agent)

Component	Monthly Cost
S3 Vectors storage	< $0.01
S3 Vectors queries	< $0.01
Titan Embeddings (ingestion)	< $0.01
Nova Lite (3 agents)	~$0.30-0.70
Lambda	free tier
Total	~$0.35-0.75

Multi-agent multiplies Bedrock costs: 3 agents per query, up to 3 iterations if critique score < 7. Worst case: 9 model calls per query.

Pricing reference (Nova Lite): $0.06/1M input tokens, $0.24/1M output tokens.

virtualme (DynamoDB)

Component	Monthly Cost	Notes
DynamoDB	$0	Free tier: 25 RCUs/WCUs, 25GB
Nova Lite	~$0.10-0.30	Single agent per query
Total	~$0.10-0.30

Session Storage

Bedrock Agent sessions are configured with IdleSessionTTLInSeconds: 600 (10 minutes). After 10 minutes of inactivity, the session expires and conversation context is lost.

Session storage itself is free. Idle time doesn’t cost anything. What costs tokens is conversation history. The agent includes previous turns in each request:

Turn 1: "What is S3?"           →  ~50 input tokens
Turn 2: "Tell me more"          → ~150 input tokens (includes Turn 1)
Turn 3: "How about pricing?"    → ~300 input tokens (includes Turn 1+2)

For light usage (~100 queries/month), the difference is negligible.

Final Outcome

DynamoDB is cheaper because I hacked it. In virtualme, I scan all items and calculate cosine similarity client-side. This works because my notes corpus produces fewer than 1000 chunks. Beyond that, it would get slow and expensive.

S3 Vectors costs slightly more but removes all that custom code. Native similarity search, no client-side calculations, no manual orchestration. For anything larger than a small personal project, managed infrastructure wins.

Multi-agent adds overhead. Each query involves 3+ model calls. Worth it for complex reasoning tasks, overkill for simple Q&A.

Model Specialization (Future Improvement)

Currently all three agents use the same model (Nova 2 Lite). Using different models per role would improve results:

Different Strengths: Each model brings different capabilities
Error Correction: Claude might catch what Nova misses
Cost Optimization: Use expensive models only where needed (e.g., Claude for critique, Nova for research)
Quality Improvement: Specialized models for specialized tasks

I kept it simple for this learning exercise. Model specialization is a next step.

Next Learning Path

1. Peer-to-Peer Agent Communication

Current architecture uses orchestrator-driven coordination. Agents don’t talk to each other directly.

Next exploration:

Research agent directly asks critique agent for guidance mid-search
Agents negotiate who handles which part of a complex query
Dynamic task decomposition without central orchestrator

This requires AgentCore’s agent-to-agent invocation. Different trade-offs: more autonomous, but less predictable, should be harder to debug.

2. Model Specialization Experiment

A/B test different model combinations per agent role
Compare: Claude Haiku vs Nova Lite 2 vs Nova Micro 2
Track: cost per query, response quality, iteration count

What I Learned

S3 Vectors has edge cases. The 2048-byte metadata limit is not documented prominently. Cost me a few hours.
Different LLMs behave differently. Nova and Claude interpret the same agent prompts differently. Test with your actual model.
Agents hallucinate sources. Even with explicit instructions, models invent plausible filenames. Extract citations from the API trace, not the model output.
SAM is convenient but adds abstraction. When it works, great. When it breaks, you’re debugging two layers.
Multi-agent adds complexity and cost. 3 agents × 3 iterations = 9 model calls worst case. Worth it for quality, overkill for simple queries.
Orchestrator gives control. Lambda-based orchestration lets you extract trace data, manage iterations, and append real sources. Pure agent-to-agent would lose this visibility.
Free tier is hard to beat (in my case). DynamoDB + custom code is cheaper only because I have fewer than 1000 chunks. This hack won’t scale. For anything larger, managed services like S3 Vectors are the right choice.

Resources

Building a Virtual Me: RAG-Powered Resume Chatbot on AWS

2026-01-11T00:00:00+00:00

The Resume Problem

I’ve been frustrated with traditional resumes for a while now. Not because mine is bad, but because the format itself is a bit broken.

Two core problems:

Resumes don’t translate experience well. You spend hours crafting bullet points that technically describe what you did, but they fail to capture the why behind your decisions, the trade-offs you considered, or the context that made your work meaningful. A line like “Implemented serverless RAG pipeline” does not tell you much.
Resumes are boring. Reading a resume is like reading a phone book. It’s a static, one-dimensional artifact that forces recruiters to play detective, inferring things that should be explicit. Want to know some details about they why and the how trade-offs? You’d have to schedule a call.

I wanted something better. Something that could answer questions about my experience for me.

So I built Virtual Me, an AI chatbot that represents me, fee with my actual resume and technical knowledge, using Retrieval Augmented Generation (RAG) .

You can try it here: chat.lemaire.tel

The rest of this post are some technical details I learnt along the way.

Request/Response Flow with JSON Examples

Here is the complete end-to-end flow from user input to AI response, including actual JSON payloads exchanged between components, CORS handling, validation, RAG pipeline execution, and error cases.

Request Flow Diagram

User Input
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 1. Web UI (Deep Chat)                                       │
│    POST https://api.lemaire.tel/chat                        │
│    Content-Type: application/json                           │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 2. Route 53                                                  │
│    DNS: api.lemaire.tel → API Gateway Endpoint              │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 3. API Gateway HTTP API                                      │
│    Transforms HTTP → Lambda Event (API Gateway v2 format)   │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 4. Lambda (lambda_function.py)                               │
│    - Validates with Pydantic (ChatRequest)                  │
│    - Truncates to last 20 messages                          │
│    - Invokes run_rag_pipeline(question)                     │
└─────────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────────┐
│ 5. LangGraph Workflow (rag/pipeline.py)                      │
│    ┌──────────────┐      ┌──────────────┐                  │
│    │ retrieve_node│ ───► │ generate_node│                  │
│    └──────────────┘      └──────────────┘                  │
│           │                      │                          │
│           ↓                      ↓                          │
│      DynamoDB              Amazon Bedrock                   │
└─────────────────────────────────────────────────────────────┘
    ↓
Response (reverse path)

CORS Preflight Handling

Before the actual POST request, browsers send a preflight OPTIONS request when making cross-origin calls (chat.lemaire.tel → api.lemaire.tel). The Lambda function explicitly returns 200 OK with CORS headers to allow the request:

# lambda_function.py handles OPTIONS preflight
if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
    return {
        "statusCode": 200,
        "headers": {
            "Access-Control-Allow-Origin": "https://chat.lemaire.tel",
            "Access-Control-Allow-Methods": "POST, OPTIONS",
            "Access-Control-Allow-Headers": "content-type"
        }
    }

Step 1: Web UI Request (Deep Chat Format)

The Lambda backend is completely stateless. All conversation history is maintained client-side by the Deep Chat UI and sent with every request.

First message:

{
  "messages": [
    {
      "role": "user",
      "text": "What is your experience with AWS?"
    }
  ]
}

Third message in conversation (includes full history):

{
  "messages": [
    {
      "role": "user",
      "text": "What technologies do you work with?"
    },
    {
      "role": "ai",
      "text": "I work with Python, AWS Lambda, Terraform, and Amazon Bedrock..."
    },
    {
      "role": "user",
      "text": "Tell me more about your AWS experience"
    }
  ]
}

Why client-side state? Zero backend storage cost, instant scaling (no session affinity), privacy (conversations never stored), and simplicity.

Step 2: API Gateway Event (Lambda Input)

API Gateway transforms the HTTP request into a Lambda event (AWS API Gateway v2 format):

{
  "version": "2.0",
  "routeKey": "POST /chat",
  "rawPath": "/chat",
  "requestContext": {
    "accountId": "123456789012",
    "apiId": "abc123xyz",
    "domainName": "api.lemaire.tel",
    "requestId": "abc-123-def-456",
    "http": {
      "method": "POST",
      "path": "/chat",
      "protocol": "HTTP/1.1",
      "sourceIp": "203.0.113.42",
      "userAgent": "Mozilla/5.0..."
    },
    "time": "10/Jan/2026:14:23:45 +0000",
    "timeEpoch": 1736517825000
  },
  "headers": {
    "content-type": "application/json",
    "host": "api.lemaire.tel",
    "origin": "https://chat.lemaire.tel"
  },
  "body": "{\"messages\":[{\"role\":\"user\",\"text\":\"What is your experience with AWS?\"}]}",
  "isBase64Encoded": false
}

Note: Simplified for readability. Actual events include additional fields and more headers from the browser.

Step 3: Lambda Processing

Lambda validates, truncates, and extracts the request:

# lambda_function.py
body = json.loads(event.get("body", "{}"))

# Validation: Pydantic ensures schema validity. Malformed requests fail early.
chat_request = ChatRequest(**body)

# Truncation: Only last 20 messages retained to limit token usage and costs.
# Older context is irrelevant for immediate queries.
messages = chat_request.messages
if len(messages) > 20:
    messages = messages[-20:]

# Extract question and invoke RAG pipeline
question = messages[-1].text
answer = run_rag_pipeline(question)  # Invokes LangGraph workflow

Step 4: LangGraph Internal State

The RAG pipeline (rag/pipeline.py) implements a LangGraph state machine with two nodes:

retrieve_node: Calls DynamoDBRetriever, fetches chunks, flattens to context string
generate_node: Injects context into System Prompt, calls Bedrock

State flows through the workflow:

Initial State (after retrieve_node):

{
  "question": "What is your experience with AWS?",
  "context": "Jeremy Lemaire\n\nAWS Solutions Architect...",
  "messages": [
    HumanMessage(content="You are a helpful assistant representing Jeremy..."),
    HumanMessage(content="What is your experience with AWS?")
  ]
}

After generate_node:

{
  "question": "What is your experience with AWS?",
  "context": "...",
  "messages": [
    HumanMessage(content="You are a helpful assistant..."),
    HumanMessage(content="What is your experience with AWS?"),
    AIMessage(content="I have 8 years of experience as an AWS Solutions Architect...")
  ]
}

Step 5: DynamoDB Query (Internal)

The embedding for “What is your experience with AWS?” is computed:

query_embedding = [0.123, -0.456, 0.789, ...]  # 1024-dimensional vector

DynamoDB scan retrieves all items and computes cosine similarity client-side:

# Results sorted by similarity score
[
  {
    "id": "doc_0_abc123",
    "text": "Jeremy Lemaire\n\nAWS Solutions Architect...",
    "similarity": 0.87
  },
  {
    "id": "doc_3_def456",
    "text": "Certifications:\n- AWS Certified Solutions Architect...",
    "similarity": 0.82
  },
  {
    "id": "doc_1_ghi789",
    "text": "Technical Skills:\n- Python, Terraform, AWS Lambda...",
    "similarity": 0.78
  }
]
# Top 3 are concatenated into the context string

Step 6: Bedrock API Call (Internal)

Generate Node calls Amazon Bedrock:

{
  "modelId": "us.amazon.nova-lite-v2:0",
  "messages": [
    {
      "role": "user",
      "content": "You are a Virtual Clone representing Jeremy Lemaire. Answer ONLY using information from the CONTEXT below.\n\nCONTEXT:\nJeremy Lemaire\n\nAWS Solutions Architect..."
    },
    {
      "role": "user",
      "content": "What is your experience with AWS?"
    }
  ],
  "inferenceConfig": {
    "temperature": 0.1,
    "maxTokens": 500
  }
}

Bedrock Response:

{
  "output": {
    "message": {
      "role": "assistant",
      "content": [
        {
          "text": "I have 8 years of experience as an AWS Solutions Architect..."
        }
      ]
    }
  },
  "usage": {
    "inputTokens": 234,
    "outputTokens": 47,
    "totalTokens": 281
  }
}

Architecture & Lifecycle: Cold vs Warm Start

Cold Start (~4 seconds)

Occurs when no active container exists (after ~15 mins inactivity).

Python Imports: import langchain (~1.5s).
Module Level Initialization: Global singleton variables initialized.
Handler Execution:
- Detects _vector_store is None.
- Initialization: Establishes DynamoDB connection (SSL handshake ~0.5s), compiles graph (~0.5s).
- Total Latency: ~4s

Warm Start (<400ms)

Occurs on subsequent requests to the same container.

Container State: Memory preserved.
No Imports: Modules already loaded.

Persisted Globals: _vector_store already initialized.

def get_retriever():
    global _vector_store
    if _vector_store:
        return _vector_store  # Immediate return

Execution: Direct graph.invoke().
- Runtime limited to API I/O: ~20ms DynamoDB scan + ~300ms Bedrock generation.

Optimization: Module-level global variables reuse connections across invocations.

Technical Key Points

DynamoDB as a Vector Store

Vector search requires comparing the query against every single document to determine semantic proximity.

Implementation: Full table scan with client-side cosine similarity
Process: Loads all 50 chunks into memory (~10ms) and computes cosine similarity in Python
Scalability: Effective up to ~1000 chunks

Why this works: For <1000 items, brute-force scanning is practically free and requires zero maintenance. 1000 chunks × 4KB = 4MB data. DynamoDB scans 1MB per request = 4 round-trips (~60ms) + Python cosine calculation (~40ms) = ~100ms total.

When it doesn’t: Beyond 1000 chunks, you need Approximate Nearest Neighbor (ANN) algorithms via AWS OpenSearch, ChromaDB, or PostgreSQL with pgvector. ANN reduces search from O(N) to O(log n).

Binary Embedding Optimization

Embeddings are 1024 floats (Titan V2 default).

[0.123456789, 0.23456789, ...] // JSON representation ≈ 12KB per row

Packed into binary using struct.pack:

Standard float = 4 bytes. 1024 * 4 = 4KB
Result: 3x reduction in storage size and read throughput costs
Impact: Saves ~8GB for 1M rows

LangGraph vs LangChain

LangChain (The “Traditionnal” Way)

# Hidden state flow. A bit opaque
chain = retriever | prompt | llm | parser
result = chain.invoke("question")

With LangGraph

# Clearly defined state schema
class GraphState(TypedDict):
    question: str
    context: str
    messages: List[BaseMessage]

# Pure function nodes
def retrieve(state):
    docs = retriever.get_relevant_documents(state["question"])
    return {"context": format_docs(docs)}

def generate(state):
    response = llm.invoke(prompt.format(context=state["context"]))
    return {"messages": [response]}

# Explicit control flow
workflow.add_edge("retrieve", "generate")

Advantage: Full visibility into data transformations at every step.

Few performance tricks

A bit of overengineering for what I need but that was fun to look at.

L1 Cold Start Optimization

AWS uses Firecracker micro-VMs for Lambda execution environments. Once created, the environment is frozen and reused. I exploit this by using global variables to persist state:

# src/rag/dynamodb_retriever.py
_vector_store = None

def get_vector_store():
    global _vector_store
    if _vector_store is None:
        # EXPENSIVE: Runs only on new environment (~4s)
        _vector_store = DynamoDBVectorStore(...)
    return _vector_store

This saves 4 seconds on every warm start.

Context Sliding Window

Unbounded conversation to avoid Token Explosion.

If I send 100 messages of history, the 101st request pays for processing all 100 previous turns. Costs grow linearly.

Solution: Strict cap at 20 messages.

if len(messages) > 20:
    messages = messages[-20:]

Deterministic cost ceiling: Max cost per turn = (20 msgs × avg_tokens) + new_query

RAG Hyperparameters

Top K = 3

Why not 1? Markdown splitting separates headers from content. Retrieving 3 captures surrounding semantic hierarchy.
Why not 10? Signal-to-noise ratio degrades. LLMs are known to ignore information buried in the middle (“Lost in the Middle” problem). 10 chunks = 3x more input tokens.

Temperature = 0.1

Goal: Determinism.
Logic: The LLM should act as a Retrieval Engine, not a Creative Writer.
- 0.1: “According to the text, Jeremy studied AWS.” (Fact)
- 0.9: “Jeremy, a cloud wizard, soared through the AWS skies…” (Hallucination)

AWS Adaptive Retries

Standard retries (fixed interval) worsen AWS throttling scenarios (Thundering Herd problem).

from botocore.config import Config

BEDROCK_RETRY_CONFIG = Config(
    retries={
        'max_attempts': 3,
        'mode': 'adaptive'  # Dynamic backoff for HTTP 429
    }
)

AWS X-Ray Tracing

Lambda X-Ray is enabled:

resource "aws_lambda_function" "virtual_me" {
  tracing_config {
    mode = "Active"
  }
}

This automatically traces:

Lambda execution time and cold starts
DynamoDB Scan operations with latency
Bedrock InvokeModel calls with token counts and latency
All boto3 SDK calls

No Python SDK required - Lambda’s X-Ray integration handles it automatically.

API Gateway Limitation: I use HTTP API (v2) instead of REST API (v1) because:

70% cheaper: $1.00/million vs $3.50/million requests
Simpler CORS: Native configuration vs manual OPTIONS handling

Trade-off: HTTP API v2 does not support X-Ray tracing. Only Lambda traces are captured.

Resource Utilization

Lambda Memory: 512MB

Breakdown:

Python Runtime + Boto3: ~120MB
LangChain + Dependencies: ~180MB
Graph Compilation & Working State: ~50MB
Total Overhead: ~350MB
Safety Margin: ~160MB (for embedding processing and JSON overhead)

Using less than 512MB leads to Memory Limit Exceeded during LangChain initialization.

Data Storage Strategy

DynamoDB Schema

Item Structure:

{
  "id": "doc_0_abc123",
  "text": "I have 8 years...",       // Retrieved & sent to LLM
  "embedding":  Blob>,    // Search key (cosine similarity)
  "metadata": "{\"source\": \"...\"}"
}

DynamoDB vs Dedicated Vector DB

Feature	DynamoDB (My Approach)	Vector DB (Chroma)
Search Logic	Client-side (Python) - fetch everything, compute similarity	Server-side - DB engine finds neighbors
Scalability	O(N) - slower as data grows	O(log n) - instant even with millions of rows
Cost	High read cost (pay to read every row)	Optimized for search
Maintenance	Zero	Run dedicated cluster ($hundreds/month)

Why DynamoDB: “Serverless Poor Man’s Vector DB”. For <1000 items, brute-force scanning is practically free and requires zero maintenance.

Cost Estimate

For personal use (~100 conversations/month): ~$0.60-$1.00/month

Service	Estimated Cost	Notes
Lambda	$0	Free tier: 1M requests/month
API Gateway	$0	Free tier: 1M requests/month
DynamoDB	$0	Free tier: 25 RCUs/WCUs, 25GB storage
Bedrock (Nova Lite)	$0.10 - $0.30	~700 queries × 2K input + 300 output tokens
S3 + CloudFront	$0.05 - $0.15	Static frontend hosting
Route53	$0.50	1 hosted zone

Idle Cost: $0. Pure pay-per-request serverless architecture.

Pricing reference (Nova Lite): $0.06/1M input tokens, $0.24/1M output tokens.

Conclusion

Traditional resumes are broken. They don’t capture context, trade-offs, or technical depth. They’re boring documents that force everyone to play guessing games.

The architecture is fully serverless:

Compute: Lambda (pay-per-request)
Storage: DynamoDB (pay-per-request)
Inference: Bedrock (pay-per-token)
Idle Cost: $0

It’s cheap, fast (400ms warm start), and actually represents how I think about my work - with context and technical nuance.

Try it: chat.lemaire.tel

Source code: github.com/ox00004a/virtualme

Migrating Spring PetClinic to DDD with Spring Modulith and jMolecules

2025-12-30T00:00:00+00:00

The classic Spring PetClinic is everyone’s first Spring app. Simple, clean, easy to follow. But as codebases grow it might turn differently. Business logic scatters across controllers and services. Changes ripple unpredictably. New developers take months to become productive.

I modernized PetClinic as a reference implementation for Domain-Driven Design using Spring Modulith, jMolecules, ByteBuddy, and ArchUnit. Here’s what I learned.

Why DDD Now?

Eric Evans published the Blue Book in 2003. For years, applying DDD to Spring/Hibernate meant wrestling with anemic domain models, entities reduced to data containers with getters and setters, business logic scattered across service layers.

That’s changed. Thanks to Oliver Drotbohm and the work on Spring Modulith and jMolecules, we can build rich domain models with proper encapsulation, enforce module boundaries at compile time, and prepare a monolith for eventual microservice extraction, all while keeping Spring productive.

DDD is seeing a renaissance for three reasons:

Microservices need boundaries. Teams discovered that decomposing monoliths without clear domain boundaries leads to distributed monoliths. DDD’s Bounded Contexts provide a principled way to define service boundaries.
Event-driven architecture. Modern systems communicate through events. DDD’s Domain Events pattern maps directly to event sourcing and message-driven microservices.
Better tooling. Frameworks like Spring Modulith and jMolecules finally make DDD practical in Java without fighting the framework.

The Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Spring PetClinic                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────┐         ┌──────────────────┐          │
│  │  Owner Module    │         │   Vet Module     │          │
│  │                  │ events  │                  │          │
│  │  - Owner         │────────▶│  - Vet           │          │
│  │  - Pet           │         │  - Specialty     │          │
│  │  - Visit         │         │  - Patient       │          │
│  │  - PetType       │         │    Tracking      │          │
│  └────────┬─────────┘         └────────┬─────────┘          │
│           │                            │                     │
│           ▼                            ▼                     │
│  ┌─────────────────────────────────────────────────┐        │
│  │              Shared Kernel (model)               │        │
│  │         Person, PersonName, NamedEntity          │        │
│  └─────────────────────────────────────────────────┘        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Modules communicate through events, not direct calls. When you’re ready to extract a microservice, the boundaries are already clean.

The Stack

Four pieces work together:

Tool	Purpose
Spring Modulith	Defines and verifies module boundaries
jMolecules	DDD building blocks as interfaces (Entity, ValueObject, AggregateRoot)
ByteBuddy	Weaves JPA annotations at compile time so domain classes stay clean
ArchUnit	Enforces architecture rules in tests

Tactical DDD Building Blocks

Type-Safe Identifiers

Primitive obsessionm using Integer or Long for IDs, leads to accidentally mixing different entity IDs. Wrap identifiers in type-safe records:

public record PetId(@Column(name = "id") UUID value) implements Identifier {
    public PetId() { this(UUID.randomUUID()); }
}

public record OwnerId(@Column(name = "id") UUID value) implements Identifier {
    public OwnerId() { this(UUID.randomUUID()); }
}

Now you can’t pass an OwnerId where a PetId is expected. Compile-time safety.

Value Objects

Create immutable Value Objects that encapsulate both data and behavior:

public record BirthDate(@Column(name = "birth_date") LocalDate date) implements ValueObject {

    public BirthDate {
        if (date == null) throw new IllegalArgumentException("Birth date must not be null");
        if (date.isAfter(LocalDate.now())) throw new IllegalArgumentException("Birth date cannot be in the future");
    }

    public int getAgeInYears() {
        return Period.between(date, LocalDate.now()).getYears();
    }

    public boolean isElderly() { return getAgeInYears() >= 7; }
    public boolean isPuppy() { return getAgeInYears() < 1; }
}

Self-validating. Behavior lives with the data. No more validation logic scattered across services.

Entities and Aggregates

An Entity has identity that persists across state changes. An Aggregate is a cluster of entities treated as a single unit. One entity is the Aggregate Root—all access goes through it.

// Owner is the Aggregate Root
public class Owner extends Person implements AggregateRoot<Owner, OwnerId> {
    private OwnerId id = new OwnerId();
    private Set<Pet> pets = new LinkedHashSet<>();

    public void addPet(Pet pet) {
        pets.add(pet);
    }

    public Pet getPet(String name) {
        return pets.stream()
            .filter(p -> p.getName().equals(name))
            .findFirst()
            .orElse(null);
    }
}

// Pet is an Entity within the Owner aggregate
public class Pet extends NamedEntity implements Entity<Owner, PetId> {
    private PetId id = new PetId();
    private BirthDate birthDateValue;
    private Set<Visit> visits = new LinkedHashSet<>();
}

Rules: Only the Aggregate Root has a repository. External objects reference the aggregate by ID only. Invariants are enforced within the aggregate boundary.

Cross-Aggregate References with Association

Direct references between aggregates create tight coupling. If Pet holds a direct reference to PetType, changes to PetType can break Pet. Worse, JPA eagerly loads the entire object graph.

Use Association to hold only the ID reference:

public class Pet implements Entity<Owner, PetId> {
    // Don't do this - crosses aggregate boundary
    // private PetType type;

    // Do this - store only the reference
    private Association<PetType, PetTypeId> type;

    public void setType(PetType type) {
        this.type = type != null ? Association.forAggregate(type) : null;
    }

    public PetTypeId getTypeId() {
        return this.type != null ? this.type.getId() : null;
    }
}

Resolving Associations with AssociationResolver

When you need the actual PetType object, resolve it explicitly using AssociationResolver:

public class Pet implements Entity<Owner, PetId> {
    private Association<PetType, PetTypeId> type;

    // Resolve when needed - caller provides the resolver
    public PetType resolveType(AssociationResolver<PetType, PetTypeId> resolver) {
        return this.type != null ? resolver.resolve(this.type).orElse(null) : null;
    }
}

The repository implements AssociationResolver:

@Repository
public interface PetTypeRepository
        extends JpaRepository<PetType, PetTypeId>,
                AssociationResolver<PetType, PetTypeId> {
}

Usage in application service:

@Service
public class PetApplicationService {
    private final PetTypeRepository petTypes;

    public PetTypeInfo getPetTypeInfo(Pet pet) {
        PetType type = pet.resolveType(petTypes);  // Explicit resolution
        return new PetTypeInfo(type.getName());
    }
}

Why This Pattern Matters

Benefit	Explanation
Aggregate boundaries respected	`Pet` doesn’t hold a direct reference to `PetType`—only its ID
Type safety	`AssociationResolver` ensures you can’t accidentally resolve to wrong type
Explicit dependencies	Resolution requires injecting the resolver—no hidden database calls
Testable	Easy to mock `AssociationResolver` in unit tests
Lazy loading	Associations are resolved only when explicitly requested, not eagerly by JPA
jMolecules standard	Follows the framework’s best practices for cross-aggregate references

Testing with Mocked Resolver

@Test
void shouldResolvePetType() {
    PetType dog = new PetType("Dog");
    Pet pet = new Pet();
    pet.setType(dog);

    // Mock the resolver
    AssociationResolver<PetType, PetTypeId> resolver = mock(AssociationResolver.class);
    when(resolver.resolve(any())).thenReturn(Optional.of(dog));

    PetType resolved = pet.resolveType(resolver);

    assertThat(resolved.getName()).isEqualTo("Dog");
}

No database needed. The domain logic is fully testable in isolation.

Domain Events

Domain Events are the key to decoupling modules. When something significant happens in one module, it publishes an event. Other modules react without knowing about each other.

Defining Events

Events are immutable records of something that happened. Use past tense naming—PetAdopted, not AdoptPet:

public record PetAdoptedEvent(
    PetId petId,
    PetTypeId petTypeId,
    OwnerId ownerId,
    LocalDate adoptionDate
) implements DomainEvent {

    public static PetAdoptedEvent of(PetId petId, PetTypeId petTypeId, OwnerId ownerId) {
        return new PetAdoptedEvent(petId, petTypeId, ownerId, LocalDate.now());
    }
}

Include only the data consumers need. Use IDs, not full entities—keeps events lightweight and avoids coupling to aggregate internals.

Publishing Events: Two Patterns

Pattern 1: From the Aggregate (Pure DDD)

The aggregate registers events internally. Spring Data publishes them when the aggregate is saved:

public class Owner extends AbstractAggregateRoot<Owner>
        implements AggregateRoot<Owner, OwnerId> {

    public void addPet(Pet pet) {
        pets.add(pet);
        // Register event - published after save() completes
        registerEvent(PetAdoptedEvent.of(pet.getId(), pet.getTypeId(), this.id));
    }
}

The event is published automatically when ownerRepository.save(owner) commits. No explicit publisher needed.

Pattern 2: From the Application Service (Pragmatic)

The application service publishes events explicitly:

@Service
public class PetApplicationService {
    private final OwnerRepository owners;
    private final ApplicationEventPublisher events;

    @Transactional
    public void adoptPet(OwnerId ownerId, Pet pet) {
        Owner owner = owners.findById(ownerId).orElseThrow();
        owner.addPet(pet);
        owners.save(owner);

        events.publishEvent(PetAdoptedEvent.of(pet.getId(), pet.getTypeId(), ownerId));
    }
}

When to use which:

Pattern	Use When
Aggregate	Event is intrinsic to domain logic; you want pure domain model
Application Service	Event depends on application context; you need more control over timing

Subscribing to Events

Use @ApplicationModuleListener for cross-module event handling:

@Service
class VetPatientTrackingService {

    @ApplicationModuleListener
    void onPetAdopted(PetAdoptedEvent event) {
        log.info("New patient registered: Pet ID=" + event.petId());
        // Update vet module's view of patients
    }
}

@ApplicationModuleListener is Spring Modulith’s annotation that combines @EventListener with @Async and @Transactional. It runs in a separate transaction after the publishing transaction commits.

Transactional Event Handling

Understanding transaction boundaries is critical:

┌─────────────────────────────────────────────────────────────────────┐
│  Publishing Transaction                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ 1. owner.addPet(pet)                                         │   │
│  │ 2. ownerRepository.save(owner)                               │   │
│  │ 3. Event stored in publication registry (if enabled)        │   │
│  │ 4. COMMIT                                                     │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Listener Transaction (separate)                              │   │
│  │ 5. @ApplicationModuleListener receives event                 │   │
│  │ 6. Listener does its work                                    │   │
│  │ 7. COMMIT (or ROLLBACK - doesn't affect publisher)          │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Key insight: listener failures don’t roll back the publishing transaction. The pet is adopted even if the vet notification fails. This is usually what you want—but you need to handle listener failures.

Reliable Event Publication

What happens if the listener fails? Or the application crashes between publishing and processing? Spring Modulith’s Event Publication Registry solves this.

Add the dependency and configure a database-backed registry:

    org.springframework.modulith
    spring-modulith-starter-jpa

@Configuration
class ModulithConfig {

    @Bean
    ApplicationRunner eventPublicationRegistrar(EventPublicationRegistry registry) {
        return args -> {
            // On startup, retry any incomplete publications
            registry.resubmitIncompletePublications();
        };
    }
}

How it works:

┌─────────────────────────────────────────────────────────────────────┐
│  1. Event Published                                                  │
│     └─> Stored in EVENT_PUBLICATION table with status INCOMPLETE    │
│                                                                      │
│  2. Listener Invoked                                                 │
│     └─> If SUCCESS: Mark publication COMPLETED                      │
│     └─> If FAILURE: Publication remains INCOMPLETE                  │
│                                                                      │
│  3. On Application Restart                                           │
│     └─> resubmitIncompletePublications() retries failed events      │
└─────────────────────────────────────────────────────────────────────┘

The registry guarantees at-least-once delivery. Listeners must be idempotent—they might receive the same event twice after a crash recovery.

Async vs Sync Processing

By default, @ApplicationModuleListener is async—listeners run in a separate thread after the transaction commits. For synchronous processing within the same transaction:

@TransactionalEventListener(phase = TransactionPhase.BEFORE_COMMIT)
void onPetAdoptedSync(PetAdoptedEvent event) {
    // Runs in same transaction as publisher
    // If this fails, the whole transaction rolls back
}

Use sync listeners sparingly. They couple the modules more tightly—a listener failure affects the publisher.

Error Handling in Listeners

Async listeners need explicit error handling:

@ApplicationModuleListener
void onPetAdopted(PetAdoptedEvent event) {
    try {
        vetPatientService.registerPatient(event.petId());
    } catch (Exception e) {
        log.error("Failed to register patient for pet {}", event.petId(), e);
        // Options:
        // 1. Rethrow - event stays INCOMPLETE, retried on restart
        // 2. Swallow - event marked COMPLETED, lost
        // 3. Send to dead letter queue for manual handling
        throw e;  // Prefer rethrowing for automatic retry
    }
}

Exposing Events as a Module API

Events are the public API between modules. Expose them through a named interface:

owner/
├── package-info.java          # @ApplicationModule
├── Owner.java                 # Internal
├── OwnerRepository.java       # Internal
└── events/
    ├── package-info.java      # Named interface
    └── PetAdoptedEvent.java   # Public API

Other modules declare dependency on events only:

@ApplicationModule(
    displayName = "Vet Management",
    allowedDependencies = { "model", "owner::events" }
)
package org.springframework.samples.petclinic.vet;

The vet module can listen to PetAdoptedEvent but cannot access Owner, OwnerRepository, or any other internal class. True decoupling.

Testing Events

Spring Modulith provides testing support:

@ApplicationModuleTest
class OwnerModuleTests {

    @Test
    void petAdoptionPublishesEvent(Scenario scenario) {
        scenario.stimulate(() -> petService.adoptPet(ownerId, pet))
            .andWaitForEventOfType(PetAdoptedEvent.class)
            .matching(event -> event.petId().equals(pet.getId()))
            .toArriveAndVerify(event -> {
                assertThat(event.ownerId()).isEqualTo(ownerId);
            });
    }
}

The Scenario API lets you verify events are published with the right data, without coupling tests to listener implementations.

Spring Modulith: Defining Boundaries

Use package-info.java to declare modules and their allowed dependencies:

@ApplicationModule(
    displayName = "Owner Management",
    allowedDependencies = "model"
)
package org.springframework.samples.petclinic.owner;

@ApplicationModule(
    displayName = "Vet Management",
    allowedDependencies = { "model", "owner::events" }  // Named interface - only events subpackage
)
package org.springframework.samples.petclinic.vet;

The owner::events syntax is a named interface—it exposes only the events subpackage while keeping Owner, OwnerRepository, and other internals hidden. Combined with the event publication registry described above, this creates truly independent modules that communicate only through events.

Verify structure in tests:

class ModulithStructureTest {
    ApplicationModules modules = ApplicationModules.of("org.springframework.samples.petclinic");

    @Test
    void verifiesModularStructure() {
        modules.verify();  // Fails if boundaries are violated
    }

    @Test
    void generateDocumentation() {
        new Documenter(modules)
            .writeModulesAsPlantUml()
            .writeIndividualModulesAsPlantUml();
    }
}

ArchUnit: Enforcing Architecture

Spring Modulith uses ArchUnit under the hood. The modules.verify() call checks:

No cycles between modules
Modules only access their declared dependencies
Internal packages are not accessed from outside

jMolecules adds DDD-specific rules and layered architecture enforcement:

┌─────────────────────────────────────────────────────────┐
│                   @InterfaceLayer                        │
│              Controllers, REST endpoints                 │
├─────────────────────────────────────────────────────────┤
│                   @ApplicationLayer                      │
│           Application services, use cases                │
├─────────────────────────────────────────────────────────┤
│                     @DomainLayer                         │
│         Entities, Value Objects, Domain Services         │
├─────────────────────────────────────────────────────────┤
│                 @InfrastructureLayer                     │
│            Repositories, external services               │
└─────────────────────────────────────────────────────────┘

        Dependencies flow DOWN only (enforced by ArchUnit)

Run all rules in a test:

@AnalyzeClasses(packages = "org.springframework.samples.petclinic")
public class JMoleculesRulesUnitTest {

    @ArchTest
    ArchRule dddRules = JMoleculesDddRules.all();

    @ArchTest
    ArchRule layeredArchitecture = JMoleculesArchitectureRules.ensureLayering();
}

When a rule is violated, your build fails:

java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] -
Rule 'classes that implement Entity should have identity' was violated (1 times):
    Class Pet does not have an @Id annotated field

ByteBuddy: Keeping Domain Classes Clean

The magic that makes this work smoothly. ByteBuddy weaves JPA annotations at compile time based on jMolecules interfaces. Your domain classes stay clean—no @Entity, no @Id annotations polluting the model.

Configure the Maven plugin:

    net.bytebuddy
    byte-buddy-maven-plugin
    
            transform-extended
        
        true

Write clean domain classes. ByteBuddy adds the JPA infrastructure.

The Path to Microservices

This architecture is a stepping stone:

                     Monolith with Modules
┌─────────────────────────────────────────────────────────┐
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Owner     │  │     Vet     │  │   Billing   │     │
│  │  Context    │──│   Context   │──│   Context   │     │
│  │  (Module)   │  │  (Module)   │  │  (Module)   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
│         │                │                │             │
│         └────── Events ──┴──── Events ────┘             │
└─────────────────────────────────────────────────────────┘
                            │
                            │ Extract when ready
                            ▼
                    Microservices
┌─────────────────┐  ┌─────────────┐  ┌─────────────┐
│   Owner         │  │     Vet     │  │   Billing   │
│  Service        │──│   Service   │──│   Service   │
└─────────────────┘  └─────────────┘  └─────────────┘
        │                   │                   │
        └────── Kafka/RabbitMQ ─────────────────┘

DDD Concept	Monolith	Microservices
Bounded Context	Spring Modulith Module	Separate Service
Aggregate	Transactional boundary	Service boundary
Domain Event	`ApplicationEventPublisher`	Kafka/RabbitMQ message
Anti-Corruption Layer	Module adapter	API Gateway / BFF

Module boundaries are already defined and verified. Events decouple modules. Aggregate boundaries map naturally to service boundaries. Type-safe IDs prevent accidental coupling. When a module needs independent scaling, extract it—the work is already done.

When NOT to Use DDD

DDD is not free. It adds concepts, abstractions, and ceremony.

Scenario	Why DDD Is Overkill
Simple CRUD apps	If your app is mostly forms over data with little business logic, a simple layered architecture suffices.
Short-lived projects	Prototypes, MVPs, or throwaway code don’t benefit from the upfront investment.
Small teams without domain experts	DDD assumes collaboration with domain experts. Without them, you’re guessing.
Stable, simple domains	If the domain is unlikely to change, the flexibility DDD provides isn’t needed.

The pragmatic approach: Use DDD tactically for complex domain logic. Use DDD strategically when you have multiple teams or are planning microservices.

Project Setup

    org.springframework.modulith
    spring-modulith-starter-core

    org.springframework.modulith
    spring-modulith-starter-test
    test

    org.jmolecules
    jmolecules-ddd

    org.jmolecules
    jmolecules-layered-architecture

    org.jmolecules.integrations
    jmolecules-jpa

    org.jmolecules.integrations
    jmolecules-bytebuddy-nodep
    provided

    com.tngtech.archunit
    archunit-junit5
    test

    org.jmolecules.integrations
    jmolecules-archunit
    test

Quick Reference

Pattern	jMolecules Type	Purpose
Aggregate Root	`AggregateRoot`	Entry point to aggregate, owns repository
Entity	`Entity`	Has identity, belongs to aggregate
Value Object	`ValueObject`	Immutable, equality by value
Identifier	`Identifier`	Type-safe ID wrapper
Association	`Association`	Cross-aggregate reference (ID only)
Domain Event	`DomainEvent`	Notification of state change
Repository	`Repository`	Aggregate persistence

Running the Project

git clone https://github.com/jeremylem/petclinic-exploration.git
cd petclinic-exploration
./mvnw spring-boot:run

Access at http://localhost:8080

References

Domain-Driven Design — Eric Evans (2003)
Implementing DDD Building Blocks in Java — Oliver Drotbohm
Spring Modulith Reference
jMolecules GitHub
Tactical DDD Workshop

FusionRAG Just Got Simpler: BM25 is Now in PostgreSQL

2025-12-30T00:00:00+00:00

In my previous post about building a Local Knowledge Base MCP Server, I landed on Fusion RAG (BM25 + Vector) as the winning pattern. It caught both keywords and semantics, hitting 100% recall at 23ms.

The stack was: ChromaDB for vectors, rank-bm25 Python library for keyword search, custom fusion logic to merge results.

That stack just became simpler. I discovered two PostgreSQL extensions can handle both pieces.

The Old Architecture

┌─────────────┐     ┌─────────────┐
│  ChromaDB   │     │  rank-bm25  │
│  (Vectors)  │     │  (Keywords) │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └───────┬───────────┘
               │
        ┌──────▼──────┐
        │ Python Code │
        │ (Fusion)    │
        └─────────────┘

Two data stores. Sync issues. Custom fusion logic. Works, but more moving parts than necessary.

The New Architecture

┌────────────────────────────────┐
│          PostgreSQL            │
│  ┌──────────┐  ┌────────────┐  │
│  │ pgvector │  │ pg_search  │  │
│  │ (Vectors)│  │   (BM25)   │  │
│  └──────────┘  └────────────┘  │
│         ┌──────────┐           │
│         │   RRF    │           │
│         │  (SQL)   │           │
│         └──────────┘           │
└────────────────────────────────┘

One database. One source of truth. Fusion happens in SQL.

Why BM25 Matters

Standard PostgreSQL full-text search (ts_vector) does boolean matching: document either matches or it doesn’t. No ranking. No relevance scoring.

BM25 solves four problems:

Term Frequency Saturation: Mentioning a word 12 times doesn’t make a doc 12x more relevant. After a few mentions, additional repetitions barely help.
Inverse Document Frequency: Rare terms get higher weight. “Kubernetes” in a general corpus signals more than “the.”
Length Normalization: A focused 15-word answer beats a 80-word doc that mentions your query in passing.
Ranked Retrieval: Every result gets a meaningful score, not just match/no-match.

BM25 is the algorithm powering Elasticsearch and Apache Lucene.

The Extensions

These are not built into PostgreSQL core. They are extensions you install separately.

pgvector: First released April 2021. Adds vector data types and similarity search operators. HNSW indexing (the fast one) arrived in v0.5.0 (August 2023). Now at v0.8.x with broad cloud provider support.

pg_search: First stable release November 2023 (originally called pg_bm25). Built on Tantivy, the Rust alternative to Lucene. Adds BM25 scoring and full-text search operators.

pgvector vs ChromaDB

Here’s how they compare:

Aspect	ChromaDB	pgvector
Type	Standalone vector database	PostgreSQL extension
Best for	Prototyping, up in 5 minutes	Production, existing PostgreSQL stack
Concurrency	Degrades under load	Handles concurrent queries well
SQL joins	Separate data store, needs sync	Native joins with relational data
ACID	No	Full transactions
Scaling	Purpose-built for vectors	PostgreSQL scaling patterns

ChromaDB excels at rapid prototyping. Single queries are fast. But under concurrent load, pgvector tends to handle it better due to PostgreSQL’s mature connection pooling and query optimization.

If you already run PostgreSQL, you eliminate a separate data store. User metadata, document content, and embeddings live in one place. One backup. One connection pool. No sync logic.

Setting It Up

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_search;

Create your table with both vector and text columns:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

Create both indexes:

-- Vector index (HNSW for fast approximate search)
CREATE INDEX idx_docs_vector ON documents
USING hnsw (embedding vector_cosine_ops);

-- BM25 index
CREATE INDEX idx_docs_bm25 ON documents
USING bm25 (id, content)
WITH (key_field='id');

Hybrid Search in Pure SQL

Reciprocal Rank Fusion (RRF) in a single query:

WITH
bm25_results AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY pdb.score(id) DESC) AS rank
  FROM documents
  WHERE content ||| 'kubernetes deployment strategy'
  LIMIT 20
),

vector_results AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
  FROM documents
  ORDER BY embedding <=> $1
  LIMIT 20
),

fused AS (
  SELECT id, 1.0 / (60 + rank) AS score FROM bm25_results
  UNION ALL
  SELECT id, 1.0 / (60 + rank) AS score FROM vector_results
)

SELECT
  d.id,
  d.content,
  SUM(f.score) AS relevance
FROM fused f
JOIN documents d USING (id)
GROUP BY d.id, d.content
ORDER BY relevance DESC
LIMIT 10;

The magic number 60 in RRF controls score decay. Lower values favor top results more aggressively.

Weighted Fusion

In my MCP server, I used 30% keywords, 70% semantics. Same thing in SQL:

fused AS (
  SELECT id, 0.3 * 1.0 / (60 + rank) AS score FROM bm25_results
  UNION ALL
  SELECT id, 0.7 * 1.0 / (60 + rank) AS score FROM vector_results
)

Tune based on your data. Technical documentation with exact terms? Bump BM25 weight. Conversational queries? Favor vectors.

What This Replaces

Before	After
ChromaDB	pgvector
rank-bm25 (Python)	pg_search
Custom fusion code	SQL CTE (Common Table Expression)
Two data stores	One database
Sync logic	ACID transactions

Operational Simplicity

Fewer dependencies, but also:

Backups: One database to backup.
Consistency: ACID transactions across text and vectors.
Scaling: PostgreSQL scaling patterns you already know.
Monitoring: One set of metrics.

When your documents update, both indexes update atomically. No sync jobs. No eventual consistency headaches.

When to Still Use Elasticsearch

The 1% cases:

Multi-petabyte scale with sub-100ms requirements
Complex faceted search with dozens of filters
Geo-spatial + full-text + vector in the same query at massive scale

For the rest of us building RAG pipelines, knowledge bases, and semantic search? PostgreSQL handles it.

Trying It Out

Easiest path: ParadeDB Docker image comes with both extensions pre-installed.

docker run --name paradedb -e POSTGRES_PASSWORD=password -p 5432:5432 paradedb/paradedb

Next Step for the MCP Server

The Knowledge Base MCP Server currently uses ChromaDB + rank-bm25. Migrating to PostgreSQL would:

Remove two dependencies (chromadb, rank-bm25)
Simplify deployment (just needs a PostgreSQL connection)
Enable SQL-based analytics on search patterns
Make it easier to integrate with existing enterprise databases

The fusion logic moves from Python to a SQL view. The MCP server becomes a thin query layer.

Same accuracy. Simpler stack.

From Continuous Delivery to Continuous Deployment

2025-12-29T00:00:00+00:00

After reading Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, I started rethinking what CD actually means. For years, I worked in environments where CD meant Continuous Delivery: code ready to deploy, waiting for approval.

It is still CI/CD. The difference is you can move faster.

Delivery vs. Deployment

Continuous Delivery: Code is built, tested, and pushed to a staging environment automatically. It can be deployed to production at any time, but a human decision or a scheduled window triggers the go-live.

Continuous Deployment: Every change that passes the automated test suite is deployed to Production immediately, without human intervention.

In regulated environments, teams simulate Continuous Deployment by automating Change Request ticket creation and approval based on test evidence. The old world was a manager clicking Approve in ServiceNow. The new world is automated governance: the pipeline generates an attestation document proving that tests passed, security scans completed, and peer review happened. The auditor is satisfied without stopping the assembly line.

Key Concept: Decoupling Deployment from Release

This is the most important concept I took from Accelerate:

Deployment (Technical Act): Moving code to the production server. Happens continuously.
Release (Business Act): Making the feature visible to the customer. Happens when the business is ready.

You can deploy on Tuesday at 10 AM, but release on Friday for a Business launch.

The Mechanic: Feature Flags

How do you deploy code in the middle of a sprint without breaking the user experience?

Day 3 of Sprint: You finish the backend for a new payment feature. It deploys to Prod immediately. Safe because the code is wrapped in a Feature Flag set to False. Users cannot hit it.

Day 7 of Sprint: The UI is done. It deploys to Prod. Flag is still False.

End of Sprint (Review): You toggle the flag to True only for internal users to demo it in Production.

Release Day: The business toggles the flag to True for 100% of users.

Feature Flag Frameworks

LaunchDarkly (SaaS): Deep audit logging, RBAC, SSO integration. Expensive at scale.
Unleash (Open Source, Self-Hosted): Self-host inside your private cloud. No data leaves your network.
AWS AppConfig: Good if you want to avoid buying another tool.
OpenFeature (CNCF Project): An open specification that lets you swap vendors without rewriting application code.

Feature Flags in Code

// OLD WAY: Hardcoded or Config File
if (config.isNewPaymentFlowEnabled()) {
    runNewPaymentLogic();
}

// NEW WAY: Feature Flag SDK
boolean showNewFeature = featureFlagClient.boolVariation(
    "new-payment-flow", userContext, false);

if (showNewFeature) {
    runNewPaymentLogic();
} else {
    runOldPaymentLogic();
}

The real power is Targeting Rules. The code is deployed to all servers, but you control who can access the feature through the dashboard, without redeploying:

Enable only for QA: if user_id = "qa_tester_bob", return True.
Enable for a region: if user_region = "EU", return True.
Gradual business rollout: 5% of users today, 50% next week, 100% after validation.

This is different from Canary Deployment. Canary is about infrastructure: deploy to a small percentage of servers to check if the code is stable. Feature Flags are about business logic: the code runs everywhere, but you choose which users can see the feature.

Safe Deployment: Canary Releases

To deploy continuously without crashing Prod, teams use Canary Deployments:

Deploy v2.0 alongside v1.0.
Route a small percentage of traffic to v2.0.
Automated monitoring checks for errors (HTTP 500s, latency spikes).
If error rate < threshold, gradually ramp up traffic to 100%.
If errors spike, automatically rollback to v1.0.

The goal: Make deployment a non-event that happens constantly.

Environments: Ephemeral over Static

In the traditional model, you had static environments: Dev, QA, UAT, Prod. These servers were always running, often drifted from Prod configurations, and were bottlenecks.

In Continuous Deployment, this changes to Ephemeral Environments:

Local: Developer works on their machine (Docker to mimic Prod).
Preview Environment: Auto-created when a PR is opened. Tests run here. QA clicks a link to verify. Destroyed after merge.
Staging (Pre-Prod): Single environment mirroring Prod exactly. Auto-deploys to Prod if smoke tests pass.

Do you need a permanent QA server? No. You create a fresh one for every feature, test it, and destroy it.

Branching: Trunk-Based Development

The industry standard for Continuous Deployment is Trunk-Based Development.

Old Way (GitFlow): You have Master, Develop, Feature-X, Release-1.0. Code lives in a feature branch for weeks. Merging is painful.

New Way (Trunk-Based):

One Main Branch, usually called main or trunk.
Short-Lived Feature Branches: Developers create a branch and merge it back to main within 24 hours.
No Release Branches: You deploy a specific commit from main.

How can you merge unfinished work? You merge the backend code but hide it behind a Feature Flag. Your code is integrated with everyone else’s code daily. You never have merge conflicts because you never drift far from main.

Measuring Success: DORA Metrics

The Accelerate book introduces four metrics that have become the industry standard (DORA):

Deployment Frequency: How often you deploy to production. High performers: multiple times per day.
Lead Time for Changes: Time from commit to production. High performers: less than one hour.
Time to Restore Service: How quickly you recover from incidents. High performers: less than one hour.
Change Failure Rate: Percentage of deployments causing failures. High performers: 0-15%.

The counter-intuitive finding: Teams that deploy multiple times a day have lower change failure rates than teams that deploy monthly. Smaller changes mean smaller blast radius and easier rollback.

Accelerating Even Faster

Once deploying is no longer the challenge, the focus shifts to making pipelines smarter.

1. Predictive Test Selection

Running 2,000 tests for a one-line CSS change is a waste of resources. If your regression suite takes 30 minutes, that adds up when you deploy multiple times a day.

CloudBees Smart Tests (formerly Launchable) analyzes your Git history and test failures. It tells your pipeline: only run these 50 tests, skip the other 2,000.

Gradle Develocity (formerly Gradle Enterprise) is the gold standard for Java/Spring shops. It caches test results and uses ML to skip tests that haven’t been impacted by your code changes.

Harness Test Intelligence builds a call graph of your code. If you change Login.java, it knows exactly which tests cover that file.

DORA Impact: Reduces Lead Time for Changes. By cutting feedback time from 30 mins to 5 mins, developers stay in flow, and code moves to staging hours faster.

2. Deployment Risk Scoring

Most CD tools like ArgoCD are dumb. They just sync Git to Cluster. They don’t know if the app is actually working, only that the pod is running.

OpsMx Autopilot: The brain you attach to your muscle (ArgoCD or Spinnaker). It connects to your logs (Splunk, Datadog) and metrics (Prometheus). When you deploy to Staging, it compares the new version against the old one in real-time and assigns a Risk Score (0-100). If the score drops below 90, it automatically commands ArgoCD to rollback. This automates the Canary Analysis that usually requires a senior engineer staring at a dashboard for 30 minutes.

Harness Continuous Verification: Similar approach. Connects to your monitoring. Uses ML to compare versions. Auto-rolls back if errors deviate by more than 1%.

This replaces blind approval rules with smart rules based on actual risk. For regulated industries, these tools also generate the digital paper trail that satisfies compliance.

DORA Impact: Lowers Change Failure Rate. By catching weak signals (like a 2% latency increase) in Staging, you prevent bad code from ever hitting Production, keeping the failure rate close to zero.

3. Smart Root Cause Analysis

When a build fails, someone has to dig through 1,000 lines of logs. When Production alerts fire at 3 AM, someone has to correlate logs, traces, and recent deployments.

Komodor tracks every single change in Kubernetes (config, deploy, health check) and correlates it to failures. Like a Time Machine for K8s.

Dynatrace Davis AI uses deterministic AI (not just ML guessing) to analyze the dependency graph. It can tell you: “The user login failed because the backend SQL database was locked by the Inventory Service.”

Datadog Bits AI lets you ask in natural language: “Who deployed to the payment service right before the latency spike?” It correlates the Git commit to the error logs.

Harness AIDA (AI DevOps Agent) scans logs and Git history, then generates a summary: “Failure likely caused by memory leak in commit 8a4b2 by User X.”

DORA Impact: Improves Time to Restore Service. Instead of spending 4 hours investigating what broke, the AI tells you the root cause in seconds, allowing you to fix (or rollback) immediately.

4. GitOps: From Push to Pull

This is the standard operating model now. You don’t use a UI like Jenkins to deploy. You commit a change to a config file in Git, and an agent inside the Production cluster pulls the change in.

The Old Way (Push Model, Jenkins style):

Developer commits code.
Jenkins builds the artifact.
Jenkins runs: kubectl apply -f my-app.yaml.

The Risk: A debug flag gets enabled directly in the cluster during troubleshooting. The issue gets fixed, but the flag stays on for weeks. Git and Production are now out of sync.

The New Way (Pull Model):

Developer commits code or config to Git.
CI only updates a Docker image registry.
An Agent living inside the Production Cluster asks: Does my current state match what is in Git?
It sees a new image tag in Git. It pulls the change and applies it.

Why is this safer?

Drift Detection: If someone changes a setting in Prod manually, the agent detects the drift immediately and can auto-revert.
Security: You don’t give your CI server Admin Access to your Prod cluster. The cluster reaches out to Git; nothing reaches in.

ArgoCD: Best UI for visualizing Kubernetes. Logs exactly who merged the PR that triggered the sync.

Flux v2: If you want it invisible. No UI; it just works in the background.

Harness GitOps: Managed ArgoCD with an enterprise UI and dashboards.

For developers, this complexity is often hidden behind an Internal Developer Portal (IDP) like Backstage. A junior dev clicks “Deploy to Staging” in a web UI; under the hood, it commits to a GitOps repo and ArgoCD syncs the cluster. They never need to become Kubernetes experts.

DORA Impact: Increases Deployment Frequency. Because deployment is purely declarative (a git commit), it removes the friction of manual deployments, encouraging teams to ship smaller batches more often.

5. FinOps Integration

The frontier between development teams and infrastructure teams is becoming fuzzy. A feature might have an impact on the infrastructure. It should be considered during the CI/CD phase.

In the cloud, developers have infinite resources. A junior dev can accidentally provision a database that costs $5,000/month, and you won’t know until the bill arrives 30 days later. The fix: shift cost analysis into the Pull Request.

For Terraform: The industry standard is Infracost. It parses your Terraform code, compares it against a cloud pricing API, and posts a comment on your Pull Request showing the price difference.

Developer changes an AWS EC2 instance from t3.micro to m5.large. CI runs infracost breakdown --path . and comments on the PR:

Cost Increase: +$65/month
Create aws_instance.app_server: +$72.00
Remove aws_instance.old_server: -$7.00

For Kubernetes/Helm: Harder because Kubernetes files list generic CPU/RAM requests, not instance types. The cost depends on which node the pod lands on.

Kubecost / OpenCost handles this with the kubectl cost predict command. These tools answer the “why” question: which team’s microservice is hoarding RAM, which namespace is over-provisioned. The trick: you cannot scan a raw Helm chart easily. You must render it first with helm template . > final_manifest.yaml, then run the prediction.

Vantage: Works at the cloud bill level (AWS/GCP invoices) rather than the cluster level. It tells you how much you owe; Kubecost tells you why. Good for Cost per Tenant views across your entire cloud footprint.

Harness Cloud Cost Management: Does both Terraform and Kubernetes natively. Has a policy engine built-in: you can set a rule to block any PR that increases the monthly forecast by more than $500.

DORA Impact: While cost is not a standard DORA metric, it acts as a stability guardrail. It prevents financial incidents (blowing the budget), giving management the confidence to allow high-frequency deployments without financial risk.

The Bleeding Edge: Agentic DevOps

The industry is moving from Automated Pipelines to AI Agents.

Former Way (Automated): The pipeline fails. You get an alert. You read the log. You fix it.

New Way (Agentic): The pipeline fails. An AI Agent reads the log, writes a fix, and opens a PR for you to approve.

This is what high-performing companies are building towards. Tools like OpsMx (Verification) and Komodor (Troubleshooting) are the answers to the 3 AM problem. They use data to fix or revert things before you even open your laptop.

Summary

Tools I looked at:

Capability	Tool
Predictive Test Selection	CloudBees Smart Tests, Gradle Develocity
Deployment Risk Scoring	OpsMx Autopilot
Root Cause Analysis	Komodor, Dynatrace Davis AI, Datadog Bits AI
GitOps	ArgoCD, Flux v2
FinOps	Infracost (Terraform), Kubecost (K8s)

Harness claims to cover all of this in one platform (I have not tested these features myself, as Harness does not offer easy access to trial their advanced capabilities):

Requirement	Harness Module	How it works
Predictive Test Selection	Test Intelligence	Builds a call graph, runs only relevant tests
Deployment Risk Scoring	Continuous Verification	ML compares new vs old version, auto-rollback if errors spike
Smart Root Cause Analysis	AIDA	Scans logs and Git history, generates failure summary
GitOps	Harness GitOps	Managed ArgoCD with enterprise UI
FinOps	Cloud Cost Management	Calculates cost impact in PR, can block on budget

Running AI Models in Java without Python

2025-11-30T00:00:00+00:00

One of the most common misconceptions in AI engineering is that you always need a Python runtime to execute models.

This is where ONNX (Open Neural Network Exchange) is critical. In 2017, Microsoft and Facebook realized they had a problem: framework Lock-in.

At the time, if you trained a model in PyTorch, you were stuck there. Deploying it to production often meant rewriting code or using slow wrappers.

Their goal was to create a “Universal Interchange Format” a standard that allowed models to be trained in flexible frameworks (like PyTorch) but run on high-performance inference engines (like ONNX Runtime) without being tied to the original training environment.

Google did not join the ONNX partnership initially; they had their own ecosystem but over time, it did become the standard bridge between PyTorch and TensorFlow,

In the modern AI landscape, 99% of model development happens in one of two places: PyTorch or TensorFlow.

Problem

Usually, running AI in Java means creating a sidecar Python service or using slow HTTP bridges.

If you are a Java shop building AI features, you might go through the following steps:

Data Science team builds a model in PyTorch/TensorFlow.
Engineering team has to wrap it in a Flask/FastAPI container.
You end up managing two languages, two CI/CD pipelines, and a massive Python runtime (often 3GB+) just to perform a simple calculation.

Only recently (with the maturation of LangChain4j and better Java bindings for ONNX Runtime, it has become a viable way to replace Python entirely in enterprise backends.

ONNX is like the “PDF” for machine learning models.

Python (PyTorch/TensorFlow): It’s the editor where you create, train, and tweak the model. It’s heavy and complex.
ONNX: Is the exported, static artifact. It serializes the model into a computation graph (a set of nodes and edges representing mathematical operations).

The Runtime Architecture

When this system runs, it doesn’t spin up a hidden Python process or make HTTP calls to a flask server. It uses the Microsoft ONNX Runtime (ORT).

ORT is a high-performance inference engine written in C++.
The Java application communicates with ORT via the Java Native Interface (JNI).
This allows us to run the model on the CPU (using AVX2/AVX512 instructions) or GPU directly from Java, often faster than the original Python implementation because we bypass the Python Global Interpreter Lock (GIL).

Result: you get near-metal performance with zero Python interpreter overhead, no CPython Global Interpreter Lock (GIL), and no pip install nightmares in production and therefore you can full take advantage of Java multithreading capabilities. Java has no GIL: If your server receives 100 requests to vectorize documents, Java can utilize all cores of the server simultaneously to process them. By using Java, you unlock the hardware’s full potential without complex workarounds (like multiprocessing).

The Tokenization Challenge

The trickiest part of “Python-free” RAG isn’t the model it’s the tokenization (converting text “Hello” into numbers [142, 7489]).

Even if you have the model in ONNX, you still need Python to perform tokenization

In Python, we take transformers.AutoTokenizer for granted, but under the hood, it is performing a complex process:

Normalization: Unicode formatting (NFC vs NFD), lowercasing, and stripping accents.
Pre-tokenization: splitting text by whitespace or punctuation (e.g., “don’t” -> “don”, “‘t”).
Model Mapping: applying algorithms like BPE (Byte-Pair Encoding used by GPT-4) or WordPiece (used by BERT) to merge characters into sub-word tokens and therefore understand the concept of the word by analyzing its parts, even if it has never seen the full word before.

In Java, LangChain4j and the underlying DJL/ONNX dependencies handles this natively It reads the standard tokenizer.json file (exported from Hugging Face) for a Model and performs the text-to-ID conversion entirely in Java before feeding the tensors to ONNX.

The Pure Java Pipeline:

Input: String (Java)
Tokenization: Native Java implementation (No Python)
Inference: ONNX Runtime (C++ via JNI)
Output: Vector Embedding (Java float array)

This architecture is what allows mcp_server4j to run as a single, self-contained JAR file with zero external dependencies.

Building RAG Systems in Java

2025-11-29T00:00:00+00:00

After the python version, I wanted to verify if you can build a Retrieval-Augmented Generation (RAG) system from scratch in Java.

The Challenge

Python has become the de facto language for AI/ML projects, and for good reason: excellent libraries, rapid prototyping, and a mature ecosystem.

I wanted to explore whether RAG systems could be built with the same effectiveness in Java, particularly for production environments.

The Implementation

I built MCP Server 4J, a Model Context Protocol server implementing hybrid search (BM25 + vector similarity) with:

Apache Lucene for BM25 keyword indexing
LangChain4j for vector embeddings and ChromaDB integration
Spring Boot for dependency injection and configuration management

Key Findings

What Works Well:

Type safety catches errors at compile time, not runtime
Spring Boot’s DI container makes testing straightforward
Apache Lucene provides native, production-ready BM25 implementation

Java Has Everything You Need:

Apache Lucene provides industrial-strength BM25 ranking
LangChain4j brings vector embeddings and model integrations
ONNX runtime eliminates Python dependencies entirely handling the model execution natively
The ecosystem is mature and production-ready

The Java Advantage:

Interfaces (KeywordIndexer, DocumentLoader, DocumentChunker) make the system testable and extensible
Type safety means errors show up in my IDE, not in production
LangChain4j mitigate the risk of silent tokenization failures: LangChain4j and the underlying DJL/ONNX dependencies favor explicit, compiled code with fixed configurations loaded from a standard asset (tokenizer.json). In Python, a developer has more flexibility (and thus more room for error) to skip or misconfigure the normalization step.

The Tradeoffs:

10x more code than the Python equivalent (~2000 lines vs ~200)
Longer development cycles for initial implementation
Higher memory footprint (~500MB vs ~200MB)
More complex build tooling (Maven vs pip)

The performance is essentially identical 20-30ms query latency with hybrid search combining BM25 and vector similarity. The real difference isn’t runtime performance; it’s development confidence.

Lessons Learned

RAG is definitely achievable in Java. The ecosystem has matured significantly with LangChain4j, Apache Lucene, and ONNX runtime support.
Enterprise patterns matter at scale. What feels like over-engineering in Python (factories, interfaces, dependency injection) becomes valuable when you have multiple teams working on the same codebase.
Choose the right tool for the job. Python excels at rapid prototyping and research. Java shines in production environments where you need strong contracts, clear interfaces, and long-term maintainability.

The Verdict

Choose Java if:

You need strong type safety and compile-time guarantees
You’re building production systems requiring clear interfaces
Long-term maintainability is a priority

Stick with Python if:

You’re in research/prototype phase
Team expertise is primarily Python
You need access to cutting-edge model libraries

The additional development time is offset by fewer runtime surprises.

Technical Details

The complete implementation includes:

Hybrid search with configurable BM25/vector weights
Multi-format document support (PDF, Markdown, TXT)
~20-30ms query latency with 100% recall@5 on test queries
Embedding Model Specifications:* the All-MiniLM-L6-v2 model than the python implementation was choosen. Exactly for the same reasons at the time: high efficiency, producing a dense vector of 384 dimensions. This dimension is automatically respected by LangChain4j when interfacing with ChromaDB.
Chunking Strategy: the RecursiveDocumentChunker uses the DocumentSplitters.recursive() method, configured for character count (512 characters). This strategy intentionally keeps chunks safely below the model’s hard limit of 256 tokens (since 512 characters is roughly equivalent to 128-150 tokens in English), preventing truncation and maximizing context preservation.

The full source is available on GitHub for anyone interested in exploring RAG beyond the Python ecosystem: mcpp_server4j_github

Cognitive offloading

AWS KMS XKS Without an HSM: Threshold Cryptography Across Cloud Providers

The Problem with a Single Key in a Single Place

Split Key Across Cloud Providers

What Each Party Sees

What This Protects and What It Doesn’t

The monitoring advantage

Performance: The 250ms Budget

Warm requests

Cold start

Exoscale service

The Sovereignty Argument

Three Providers, Three Countries

Cross-border latency

Swiss neutrality

The operational model

Cost

What This Does Not Replace

Regulatory Landscape

DORA (Digital Operational Resilience Act)

NIS2 Directive

GDPR

EU Data Act

EUCS (European Cybersecurity Certification Scheme for Cloud Services)

AWS KMS External Key Store: Keep Your Encryption Keys Out of the Cloud

Motivation

Architecture

The intended model

My workaround

Setup

Create the SoftHSM AES-256 key

Deploy the infrastructure

The p11-kit version issue

Start the PKCS#11 tunnel

Test end-to-end

A Rust Undefined Behavior Bug in the XKS Proxy

What XKS Does and Does Not Protect

XKS protects against

XKS does not protect against

Bottom line

Cloud Provider Comparison

AWS – External Key Store (XKS)

GCP – Cloud External Key Manager (EKM)

Azure – No external key store equivalent

Source Code

Virtual Me 2.0: 21x Faster Cold Starts with Swift and S3 Vectors

The Evolution

What Changed

Architecture Simplification

The Numbers

Why Swift?

Swift Lambda Code

S3 Vectors: Native Vector Storage

Infrastructure: AWS SAM

Local Testing with Swift Lambda Runtime

Measuring Cold Starts

Lessons Learned

1. S3 Vectors

2. Measure

3. Compiled > Interpreted for Lambda

4. Infrastructure as Code

Cost Breakdown

Try It Yourself

Multi-Agent RAG with S3 Vectors and Bedrock AgentCore

Why I Built This

Multi-Agent Architecture

Orchestration Pattern

Session Management

Architecture

Orchestrator Lambda

What I Wanted to Learn

1. AWS SAM

2. S3 Vectors vs DynamoDB

3. Multi-Agent Patterns

Key Technical Choices

Things That Broke

The 2048-byte Metadata Limit

Agent Source Hallucination

Cold Start Reality

Cost Comparison