After the python version, I wanted to verify if you can build a Retrieval-Augmented Generation (RAG) system from scratch in Java.

The Challenge

Python has become the de facto language for AI/ML projects, and for good reason: excellent libraries, rapid prototyping, and a mature ecosystem.

I wanted to explore whether RAG systems could be built with the same effectiveness in Java, particularly for production environments.

The Implementation

I built MCP Server 4J, a Model Context Protocol server implementing hybrid search (BM25 + vector similarity) with:

  • Apache Lucene for BM25 keyword indexing
  • LangChain4j for vector embeddings and ChromaDB integration
  • Spring Boot for dependency injection and configuration management

Key Findings

What Works Well:

  • Type safety catches errors at compile time, not runtime
  • Spring Boot’s DI container makes testing straightforward
  • Apache Lucene provides native, production-ready BM25 implementation

Java Has Everything You Need:

  • Apache Lucene provides industrial-strength BM25 ranking
  • LangChain4j brings vector embeddings and model integrations
  • ONNX runtime eliminates Python dependencies entirely handling the model execution natively
  • The ecosystem is mature and production-ready

The Java Advantage:

  • Interfaces (KeywordIndexer, DocumentLoader, DocumentChunker) make the system testable and extensible
  • Type safety means errors show up in my IDE, not in production
  • LangChain4j mitigate the risk of silent tokenization failures: LangChain4j and the underlying DJL/ONNX dependencies favor explicit, compiled code with fixed configurations loaded from a standard asset (tokenizer.json). In Python, a developer has more flexibility (and thus more room for error) to skip or misconfigure the normalization step.

The Tradeoffs:

  • 10x more code than the Python equivalent (~2000 lines vs ~200)
  • Longer development cycles for initial implementation
  • Higher memory footprint (~500MB vs ~200MB)
  • More complex build tooling (Maven vs pip)

The performance is essentially identical 20-30ms query latency with hybrid search combining BM25 and vector similarity. The real difference isn’t runtime performance; it’s development confidence.

Lessons Learned

  1. RAG is definitely achievable in Java. The ecosystem has matured significantly with LangChain4j, Apache Lucene, and ONNX runtime support.
  2. Enterprise patterns matter at scale. What feels like over-engineering in Python (factories, interfaces, dependency injection) becomes valuable when you have multiple teams working on the same codebase.
  3. Choose the right tool for the job. Python excels at rapid prototyping and research. Java shines in production environments where you need strong contracts, clear interfaces, and long-term maintainability.

The Verdict

Choose Java if:

  • You need strong type safety and compile-time guarantees
  • You’re building production systems requiring clear interfaces
  • Long-term maintainability is a priority

Stick with Python if:

  • You’re in research/prototype phase
  • Team expertise is primarily Python
  • You need access to cutting-edge model libraries

The additional development time is offset by fewer runtime surprises.

Technical Details

The complete implementation includes:

  • Hybrid search with configurable BM25/vector weights
  • Multi-format document support (PDF, Markdown, TXT)
  • ~20-30ms query latency with 100% recall@5 on test queries

  • Embedding Model Specifications:* the All-MiniLM-L6-v2 model than the python implementation was choosen. Exactly for the same reasons at the time: high efficiency, producing a dense vector of 384 dimensions. This dimension is automatically respected by LangChain4j when interfacing with ChromaDB.

  • Chunking Strategy: the RecursiveDocumentChunker uses the DocumentSplitters.recursive() method, configured for character count (512 characters). This strategy intentionally keeps chunks safely below the model’s hard limit of 256 tokens (since 512 characters is roughly equivalent to 128-150 tokens in English), preventing truncation and maximizing context preservation.

The full source is available on GitHub for anyone interested in exploring RAG beyond the Python ecosystem: mcpp_server4j_github

Updated: