If you've already built a basic RAG app using Spring AI, Ollama, and a vector database, you're off to a great start. But let's be honest - most RAG demos fail in production.
In this guide, I'll show you how to turn that prototype into a production-grade Customer Support Automation system that actually delivers accurate answers, scales well, and feels intelligent.
If you're new to RAG, I recommend starting here: RAG with Ollama + Spring AI + ChromaDB
What We're Building
A smart customer support assistant that can:
- Answer FAQs instantly
- Search knowledge base (docs, PDFs, help articles)
- Assist support agents with contextual answers
- Reduce ticket load significantly
High-Level Architecture
This is the key difference between a basic RAG app and a production system.
Ingestion Pipeline
Before anything, we need clean and structured data.
Typical sources:
- Support articles
- FAQs
- PDF manuals
- Internal docs
Processing steps:
Tip: Spend time here. Most RAG failures come from poor chunking and missing metadata.
// pseudo flow
List<Document> docs = parser.parse(pdfFiles);
List<Chunk> chunks = chunker.split(docs, 400);
for (Chunk c : chunks) {
float[] embedding = embeddingModel.embed(c.getText());
elasticsearch.index(c.getText(), embedding, metadata);
}
For deeper implementation: Build AI Knowledge Assistant
Query Rewriting (Huge Accuracy Boost)
Users don't ask perfect questions.
Example:
User: "refund"
That's too vague. Rewrite it using LLM:
"refund policy for cancelled orders and eligibility conditions"
This improves retrieval quality dramatically.
In Spring AI, you can use a lightweight LLM call before retrieval to rewrite queries.
// Spring AI pseudo
String rewritten = chatClient.prompt()
.user("Rewrite for better search: " + userQuery)
.call()
.content();
Hybrid Search (Why Elasticsearch Matters)
Instead of only vector search, combine:
- Keyword search (BM25) -> precise matches
- Vector search -> semantic understanding
This ensures:
- You don't miss exact keyword matches
- You still capture meaning and intent
Example Query:
{
"query": {
"bool": {
"should": [
{ "match": { "content": "refund policy" }},
{ "knn": { "embedding": { "vector": [...], "k": 10 }}}
]
}
}
}
Hybrid search alone can improve results by 30-40%. Learn more:
Reranking
Initial results are noisy. Reranking improves relevance.
// simple LLM reranker idea
List<Doc> reranked = docs.stream()
.sorted((a, b) -> score(b) - score(a))
.limit(5)
.toList();
This step alone can drastically improve answer quality.
Answer Generation (LLM Layer)
Now pass only the top-ranked context to your LLM.
Prompt example:
You are a customer support assistant.
Answer based only on the provided context.
Context:
{top_documents}
Question:
{user_query}
String answer = chatClient.prompt()
.system("You are a customer support assistant")
.user("Context: " + context + "\nQuestion: " + query)
.call()
.content();
For streaming responses: Streaming AI Responses with SSE
Spring AI Orchestration Layer
public String handleQuery(String query) {
String rewritten = rewrite(query);
List<Doc> docs = search(rewritten);
List<Doc> ranked = rerank(docs);
return generateAnswer(ranked, query);
}
Add Redis for chat memory: AI Chat App with Redis
Production Essentials
Security
- JWT / OAuth2 authentication
- PII masking
Caching
- Cache frequent queries (Redis)
- Cache embeddings
Observability
- Track latency per step
- Log queries and responses
Feedback Loop
- User thumbs up/down
- Improve ranking over time
Real Impact in Customer Support
With this architecture, you can:
- Reduce support tickets by 40-60%
- Improve response time to seconds
- Assist agents with better answers
This is not just a chatbot - it's a knowledge engine.
Final Architecture
Conclusion
If your current RAG is:
Vector Search -> LLM
Upgrade it to:
Query Rewriting -> Hybrid Search -> Reranking -> LLM
That's the difference between:
"Basic bot" -> "Intelligent assistant"
What Next?
In the next post, I'll show:
- Spring AI code structure
- Elasticsearch index mappings
- Reranking implementation