Building a Stateful AI Chat Application with Spring AI, Memory & Semantic Cache

Building a Stateful AI Chat Application with Spring AI, Memory & Semantic Cache thumbnail

From Document Ingestion to Conversational AI

In the previous part of this series, we built the foundation of our AI Knowledge Assistant and defined key AI terminologies for Java Developers. We implemented a complete RAG (Retrieval-Augmented Generation) pipeline where documents were ingested, chunked, converted into embeddings, and stored inside a vector store.

At that point, our system is already capable of understanding data, but there was one thing missing - interaction.

What We Are Building in This Part

In this part, we will build a smart AI chat application using Spring AI that goes beyond simple prompt-response interactions.

Instead of calling an LLM directly, we will design a layered system that combines memory, caching, and retrieval to generate high-quality responses.

Here's what our chat system will be capable of:

  • Conversational Memory - The system will remember previous messages in a conversation, making responses more contextual and human-like.
  • Semantic Caching with Redis - Frequently asked or similar questions will be served instantly from Redis cache, reducing latency and improving performance.
  • RAG-based Answering - The chat system will fetch relevant document chunks from the vector store before generating answers.
  • Advisor Chain Execution - We will use a chain of Spring AI advisors to control how each request flows through memory, cache, and retrieval layers.

This approach ensures that our application is not just a chatbot, but a context-aware AI system capable of delivering accurate and efficient responses.

High-Level Architecture of Spring AI Chat Application

If you remember, below is the overall architecture of the app that we are building as discussed in part 1.

Spring AI RAG architecture with Ollama and vector store

Now that we understand what we are building, let's take a look at how all the components work together in our Spring AI chat architecture.

At a high level, every user query flows through multiple intelligent layers before reaching the LLM.

Here's the flow:

Spring AI Chat App Architecture

Let's break this down:

  • ChatController - Entry point that receives user queries via REST API.
  • ChatClient - Core abstraction provided by Spring AI to interact with the LLM.
  • SemanticCacheAdvisor - Checks Redis to see if a similar question was already answered.
  • MessageChatMemoryAdvisor - Injects previous conversation history into the prompt.
  • QuestionAnswerAdvisor - Performs similarity search on the vector store and enriches the prompt with relevant context.
  • Ollama LLM - Generates the final response using all the enriched context.

This layered approach makes the system:

  • Faster (thanks to caching)
  • Smarter (thanks to memory)
  • More accurate (thanks to RAG)

Introduction to Spring AI ChatClient

At the heart of our chat application lies the ChatClient, a powerful abstraction provided by Spring AI to interact with Large Language Models (LLMs).

If you've worked with LLM APIs before, you know that integrating them directly often involves handling HTTP calls, request formatting, response parsing, and error handling manually. Spring AI simplifies all of that.

The ChatClient provides a clean and fluent API that allows us to:

  • Send prompts to the LLM
  • Attach system instructions
  • Inject conversation history
  • Integrate custom processing logic using advisors

The key idea here is that instead of directly calling the LLM, we build a processing pipeline around the ChatClient using Advisors and each advisor modifies or enriches the request before it reaches the model.

In our case, the flow looks like this:

  • Cache advisor checks for existing answers
  • Memory advisor adds past conversation
  • RAG advisor injects relevant document context

By the time the request reaches the LLM, it is no longer a simple prompt - it is a fully enriched query containing context, history, and knowledge. This is what makes Spring AI extremely powerful for building production-grade AI applications in Java.

Below is the Spring implementation of the chat client that we will be using here.

@Configuration
@RequiredArgsConstructor
public class ChatClientConfig {

    private final AppProperties properties;

    private static final String SYSTEM_PROMPT = """
            You are an intelligent Knowledge Assistant with access to a curated document knowledge base.

            RESPONSE GUIDELINES:
            - Answer questions ONLY using information from the provided context documents.
            ...
			...
            """;

    @Bean
    public ChatClient chatClient(ChatModel chatModel) {
        return ChatClient.builder(chatModel)
                .defaultSystem(SYSTEM_PROMPT)
                .build();
    }
	
}

Implementing Chat Memory for Context-Aware Conversations

If you've ever used tools like ChatGPT, you'll notice that they "remember" what you said earlier in the conversation. This is what makes the interaction feel natural and intelligent. Without this, every request would behave like a completely new query, losing all prior context.

This is where Chat Memory comes into play in Spring AI.

Why Do We Need Chat Memory?

By default, LLMs are stateless. To solve this, we maintain a conversation history and send it along with every new request. This allows the model to generate responses that are context-aware and more meaningful.

Sliding Window Chat Memory

In our implementation, we are using a sliding window approach for managing chat history. Instead of storing the entire conversation (which can grow indefinitely), we only keep the most recent messages.

This helps in:

  • Controlling memory usage
  • Avoiding token overflow issues
  • Maintaining relevant conversational context

Spring AI provides a built-in implementation called MessageWindowChatMemory that does exactly this.

Chat Memory Configuration

Below is the configuration used to enable chat memory in our application:

@Bean
public ChatMemory chatMemory() {
    return MessageWindowChatMemory.builder()
            .chatMemoryRepository(new InMemoryChatMemoryRepository())
            .maxMessages(properties.getChat().getMemory().getWindowSize())
            .build();
}
  • MessageWindowChatMemory - Maintains a sliding window of recent messages instead of storing the full history.
  • InMemoryChatMemoryRepository - Stores chat history in memory. This is fast and simple, but data is lost on application restart.
  • maxMessages - Defines how many recent messages are retained in the conversation context.

Configuring Window Size

The window size is configurable via application.yml:

app:
  chat:
    memory:
      window-size: 20

This means that at any point in time, only the last 20 messages (user + assistant) will be considered when generating a response.

How It Works Internally

Every time a user sends a message:

  • The message is added to chat memory
  • Previous messages (within the window) are retrieved
  • The entire context is sent to the LLM
  • The response is generated and also stored in memory

Production Consideration

  • JDBC-based chat memory (database-backed)
  • Redis-based chat memory for distributed systems

With chat memory in place, our application is now capable of handling context-aware conversations. In the next step, we will enhance this further by introducing a semantic cache using Redis to optimize repeated queries and improve performance.

Improving Performance with Semantic Caching using Redis

Now that we have implemented chat memory, our application can handle context-aware conversations. But there's another important challenge to address - performance.

LLM calls are expensive and relatively slower compared to traditional API responses. If users ask the same or similar questions repeatedly, hitting the model every time is inefficient. This is where Semantic Caching becomes a game changer.

What is Semantic Caching?

Unlike traditional caching (which relies on exact key matching), semantic caching works on meaning. I have already explained it in detail in my another article - Semantic caching with Spring Boot and Redis

How We Implement Semantic Caching

In our application, we implement semantic caching using:

  • Embeddings - to represent questions as vectors
  • SimpleVectorStore - to store and search similar past queries
  • Redis - to store the actual cached responses with TTL

The simple idea here is:

  • Convert incoming user question into an embedding
  • Search for similar past questions in the cache index
  • If similarity is above threshold -> return cached answer
  • Otherwise -> call LLM and cache the new response

Semantic Cache Configuration

Below is the configuration that enables semantic caching in our application:

@Configuration
@RequiredArgsConstructor
@ConditionalOnProperty(name = "app.semantic-cache.enabled", havingValue = "true", matchIfMissing = false)
public class SemanticCacheConfig {

    private final AppProperties properties;
    private final EmbeddingModel embeddingModel;

    private SimpleVectorStore questionIndexStore;

    @Bean
    public SimpleVectorStore questionIndexStore() {
        questionIndexStore = SimpleVectorStore.builder(embeddingModel).build();

        File persistenceFile = cacheFile();
        if (persistenceFile.exists()) {
            log.info("Loading semantic cache index from: {}", persistenceFile.getAbsolutePath());
            questionIndexStore.load(persistenceFile);
        } else {
            log.info("No existing semantic cache index at {} - starting fresh",
                    persistenceFile.getAbsolutePath());
        }

        return questionIndexStore;
    }
}

Configuration Properties

We control the behavior of semantic caching via application.yml:

semantic-cache:
  enabled: true
  similarity-threshold: 0.92
  ttl-seconds: 3600
  persistence-path: ./data/semantic-cache.json
  • similarity-threshold - Defines how close two questions need to be to be considered a match. Higher values (e.g., 0.92+) ensure near-exact matches, while lower values allow more flexible matching.
  • ttl-seconds - Time-to-live for cached responses in Redis. After this duration, entries expire automatically.
  • persistence-path - File used to persist the semantic cache index across application restarts.

Why a Separate Vector Store?

You might be wondering - we already have a vector store for documents, so why create another one?

The reason is separation of concerns:

  • Document Vector Store -> stores knowledge base (RAG data)
  • Question Index Store -> stores past user queries for caching

End-to-End Flow

Here's how semantic caching fits into the request lifecycle:

User Question
    |
Generate Embedding
    |
Search in Question Index Store
    |
Match found (similarity > threshold)?
    |                     |
   Yes                   No
    |                     |
Return cached        Call LLM
response             |
    |                |
                 Store in Redis (TTL)
                 Store embedding in cache index

Building an Advisor Chain in Spring AI

Now that we have chat memory and semantic caching in place, it's time to bring everything together. This is where Spring AI Advisors come into play.

Advisors allow us to build a pipeline (or chain) of responsibilities that process a user's query step by step before reaching the LLM. Think of it like a middleware chain where each component gets a chance to inspect, modify, or even short-circuit the request.

Why Use an Advisor Chain?

  • Clean separation of concerns
  • Controlled execution order
  • Performance optimization via short-circuiting
  • Extensibility for future enhancements

In our application, the advisor chain looks like this:

Semantic Cache -> Chat Memory -> RAG (Vector Store) -> LLM

1. SemanticCacheAdvisor (Performance Layer)

This advisor sits at the very beginning of the chain and decides whether we even need to call the LLM.

  • If a similar question exists -> return cached answer instantly
  • If not -> delegate to the next advisor
public class SemanticCacheAdvisor implements CallAdvisor {

    static final int ORDER = Integer.MIN_VALUE + 100;

    @Override
    public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {

        String question = extractUserQuestion(request);
        String cachedAnswer = lookupCache(question);

        if (cachedAnswer != null) {
            return buildCacheHitResponse(request, cachedAnswer);
        }

        ChatClientResponse response = chain.nextCall(request);
        storeFromResponse(question, response);

        return response;
    }
}

This ensures repeated or similar queries never reach the LLM, making responses significantly faster.

@Bean
public SemanticCacheAdvisor semanticCacheAdvisor(
		VectorStore questionIndexStore,
		StringRedisTemplate redisTemplate) {
	log.info("Semantic caching ENABLED [threshold={}, ttl={}s, persistence={}]",
			properties.getSemanticCache().getSimilarityThreshold(),
			properties.getSemanticCache().getTtlSeconds(),
			properties.getSemanticCache().getPersistencePath());
	return new SemanticCacheAdvisor(questionIndexStore, redisTemplate, properties);
}

2. MessageChatMemoryAdvisor (Context Layer)

Once a cache miss occurs, the request flows into the chat memory layer.

This advisor ensures that previous conversation messages are included in the prompt, enabling context-aware responses.

We configure chat memory as a sliding window:

@Bean
public ChatMemory chatMemory() {
    return MessageWindowChatMemory.builder()
            .chatMemoryRepository(new InMemoryChatMemoryRepository())
            .maxMessages(properties.getChat().getMemory().getWindowSize())
            .build();
}

Key points:

  • Maintains last N messages (window-based memory)
  • Ensures prompt size stays controlled
  • Can be swapped with persistent storage (JDBC, Redis, etc.)

3. QuestionAnswerAdvisor (RAG Layer)

After enriching the request with memory, the system performs Retrieval-Augmented Generation (RAG).

This is where your ingested documents (from Part 1) come into play.

The vector store is configured as:

@Bean
public SimpleVectorStore vectorStore() {
    vectorStore = SimpleVectorStore.builder(embeddingModel).build();

    File persistenceFile = persistenceFile();
    if (persistenceFile.exists()) {
        log.info("Loading persisted vector store from: {}", persistenceFile.getAbsolutePath());
        vectorStore.load(persistenceFile);
        log.info("Vector store loaded successfully");
    } else {
        log.info("No existing vector store at {} - starting with empty knowledge base. Upload documents via POST /api/documents/upload",
                persistenceFile.getAbsolutePath());
    }

    return vectorStore;
}

This configuration defines the Vector Store used in the application, which acts as the core storage layer for all document embeddings generated during the ingestion process.

It creates a SimpleVectorStore bean backed by the configured EmbeddingModel. This store is later used by the QuestionAnswerAdvisor to perform similarity search and retrieve relevant document chunks for RAG-based responses.

Putting It All Together

We now combine all advisors into a single execution chain:

private List<Advisor> buildAdvisors(String conversationId) {

    List<Advisor> advisors = new ArrayList<>();

    if (semanticCacheAdvisor != null) {
        advisors.add(semanticCacheAdvisor);
    }

    advisors.add(MessageChatMemoryAdvisor.builder(chatMemory)
            .conversationId(conversationId)
            .build());

    advisors.add(QuestionAnswerAdvisor.builder(vectorStore)
            .searchRequest(ragSearchRequest())
            .build());

    return advisors;
}

End-to-End Flow

User Question
    |
SemanticCacheAdvisor
    |        |
   HIT      MISS
    |        |
Return     Chat Memory
cached         |
response       |
           RAG (Vector Store)
                  |
                  |
                 LLM
                  |
             Response
                  |
         Cache for future reuse

Configuring ChatClient with Advisors

Now that we have all the building blocks in place - Chat Memory, Semantic Cache, and Vector Store (RAG) - it's time to bring everything together using the ChatClient.

This is where the real power of Spring AI shines. Instead of manually orchestrating each component, we define an advisor chain that processes every user request in a structured and optimized way.

Notice the buildAdvisors() that we implemented above.

KnowledgeChatService

@RequiredArgsConstructor
public class KnowledgeChatService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    private final ChatMemory chatMemory;
    private final AppProperties properties;

    @Autowired(required = false)
    private SemanticCacheAdvisor semanticCacheAdvisor;

    public ChatResponse chat(ChatRequest request) {

        List<Document> retrieved = similaritySearch(request.getMessage());
        List<DocumentSource> sources = toDocumentSources(retrieved);

        String answer = chatClient.prompt()
                .user(request.getMessage())
                .advisors(buildAdvisors(request.getConversationId()))
                .call()
                .content();

        return ChatResponse.builder()
                .answer(answer)
                .conversationId(request.getConversationId())
                .sources(sources)
                .ragUsed(!sources.isEmpty())
                .timestamp(Instant.now())
                .build();
    }
}

How the Chat Flow Works

Every user request goes through a well-defined pipeline:

  • Step 1: Similarity Search (RAG) - We first query the vector store to retrieve relevant document chunks. These are later shown as sources in the response.
  • Step 2: Advisor Chain Execution - The request is passed through the advisor chain:
    • SemanticCacheAdvisor -> returns cached answer if available
    • MessageChatMemoryAdvisor -> injects conversation history
    • QuestionAnswerAdvisor -> fetches relevant context from vector store
  • Step 3: LLM Invocation - If no cache hit occurs, the LLM generates a response using memory + retrieved context.
  • Step 4: Response Enrichment - The final response includes:
    • AI-generated answer
    • Conversation ID
    • Document sources
    • RAG usage indicator

Similarity Search (RAG Retrieval)

private List<Document> similaritySearch(String query) {
    try {
        return vectorStore.similaritySearch(
                SearchRequest.builder()
                        .query(query)
                        .topK(properties.getRag().getTopK())
                        .similarityThreshold(properties.getRag().getSimilarityThreshold())
                        .build()
        );
    } catch (Exception ex) {
        return List.of();
    }
}

This method retrieves the most relevant document chunks from the vector store based on semantic similarity. These chunks are used both for answer grounding and source attribution.


RAG Search Configuration

private SearchRequest ragSearchRequest() {
    return SearchRequest.builder()
            .topK(properties.getRag().getTopK())
            .similarityThreshold(properties.getRag().getSimilarityThreshold())
            .build();
}

This configuration is passed to the QuestionAnswerAdvisor, which dynamically injects the user query at runtime and performs retrieval.


Chat Request Model

@Schema(description = "Request payload for the chat endpoint")
public class ChatRequest {

    @NotBlank
    @Size(max = 4000)
    private String message;

    @NotBlank
    private String conversationId;
}

The conversationId is critical - it ensures that chat memory works correctly by linking multiple user messages into the same conversation.


Chat Response Model

@Schema(description = "Response from the AI Knowledge Assistant")
public class ChatResponse {

    private String answer;
    private String conversationId;
    private List<DocumentSource> sources;
    private boolean ragUsed;
    private Instant timestamp;
}

This structured response makes your API production-ready by including not just the answer, but also traceability (sources) and metadata.

Testing the AI Chat Application

Now that our AI chat pipeline is fully wired - with ChatClient, advisor chain, chat memory, and semantic caching - it's time to expose an API endpoint and test everything end-to-end.

We will create a simple REST controller that accepts a user message along with a conversationId and returns an AI-generated response.

@RestController
@RequestMapping("/api/chat")
@RequiredArgsConstructor
@Validated
public class ChatController {

    private final KnowledgeChatService chatService;

    @PostMapping
    public ResponseEntity<ChatResponse> chat(
            @Valid @RequestBody ChatRequest request) {

        log.info("Chat request [conversationId={}]", request.getConversationId());

        return ResponseEntity.ok(chatService.chat(request));
    }
}

This endpoint acts as the entry point for all chat interactions. It delegates the request to KnowledgeChatService, which internally triggers the entire advisor chain (cache -> memory -> RAG -> LLM).

How to Test

You can test this endpoint using Postman, cURL, or any API client.

Sample Request:

POST /api/chat
Content-Type: application/json

{
  "message": "What is Spring AI?",
  "conversationId": "12345"
}

Sample Response:

{
  "answer": "Spring AI is a framework that simplifies integration with AI models...",
  "conversationId": "12345",
  "sources": [...],
  "ragUsed": true,
  "timestamp": "2026-04-04T10:15:30Z"
}

Make sure to reuse the same conversationId across multiple requests to observe how chat memory maintains context.

What's Coming Next

So far, we've built a powerful AI chat application with:

  • Context-aware conversations using Chat Memory
  • Faster responses using Semantic Caching (Redis)
  • Accurate answers using RAG (Vector Store)
  • A flexible Advisor Chain architecture

In the next part of this series, we will take things to the next level by introducing production-grade scalability and persistence.

Here's what's coming:

  • Replacing SimpleVectorStore with PgVector for scalable vector storage
  • Using Redis Stack for advanced semantic search capabilities
  • Optimizing retrieval pipelines for large datasets
  • Designing a more robust and production-ready AI architecture

Conclusion

In this part, we moved beyond basic document ingestion and built a fully functional AI-powered chat application using Spring AI.

We explored how different components come together:

  • ChatClient to interact with the LLM
  • Chat Memory to maintain conversational context
  • Semantic Cache to improve performance and reduce cost
  • Vector Store (RAG) to provide grounded and accurate responses
  • Advisor Chain to orchestrate the entire flow efficiently

Support Us!

Buying me a coffee helps keep the project running and supports new features.

cards
Powered by paypal

Thank you for helping this blog thrive!

About The Author

author-image
I write about cryptography, web security, and secure software development. Creator of practical crypto validation tools at Devglan.

Further Reading on spring-ai