Streaming AI Responses with SSE in Spring AI (ChatClient)

By Dhiraj Ray 05 April, 2026

If you've been following my previous articles on building AI applications with Spring AI, you already know how to build a solid foundation using RAG and chat services.

In those articles, we focused on building the backend, integrating vector stores, and generating AI responses. But one thing we didn't cover is streaming responses in real-time.

Instead of waiting for the entire response, what if we could stream tokens as they are generated? That's where Server-Sent Events (SSE) comes into play.

Why Streaming Matters?

Typical flow:

User -> Request -> Wait... -> Full response

Streaming flow:

User -> Request -> Tokens stream in real-time -> Better UX

Streaming improves perceived performance and makes your AI application feel more interactive.

Controller Layer

Here's the streaming endpoint:


@PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter chatStream(@Valid @RequestBody ChatRequest request) {
    log.info("Streaming chat request [conversationId={}]", request.getConversationId());
    return chatService.chatStream(request);
}

Key points:

TEXT_EVENT_STREAM_VALUE enables SSE
We return SseEmitter instead of a normal response
Controller remains thin (as it should be)

Service Layer (Streaming Logic)


public SseEmitter chatStream(ChatRequest request) {
    SseEmitter emitter = new SseEmitter(120_000L);

    Flux<String> tokenStream = chatClient.prompt()
            .user(request.getMessage())
            .advisors(buildAdvisors(request.getConversationId()))
            .stream()
            .content();

    tokenStream.subscribe(
            token -> {
                try {
                    emitter.send(SseEmitter.event().data(token).name("token"));
                } catch (IOException ex) {
                    emitter.completeWithError(ex);
                }
            },
            emitter::completeWithError,
            () -> {
                try {
                    emitter.send(SseEmitter.event().name("done").data("[DONE]"));
                    emitter.complete();
                } catch (IOException ex) {
                    emitter.completeWithError(ex);
                }
            }
    );

    return emitter;
}

What's Happening Here?

1. Create SSE Connection

SseEmitter emitter = new SseEmitter(120_000L);

This keeps the connection open for 2 minutes. You can adjust this depending on your use case.

2. Stream Tokens from Spring AI


Flux<String> tokenStream = chatClient.prompt()
        .user(request.getMessage())
        .advisors(buildAdvisors(request.getConversationId()))
        .stream()
        .content();

This is where Spring AI shines. Instead of a single response, we get a Flux<String>, which emits tokens as they are generated.

If you've implemented RAG from my previous article, those advisors here are what inject context into the prompt.

3. Send Tokens to Client

emitter.send(SseEmitter.event().data(token).name("token"));

Each token is pushed immediately to the client.

4. Handle Completion

emitter.send(SseEmitter.event().name("done").data("[DONE]"));
emitter.complete();

We explicitly send a done event so the frontend knows the stream has ended.

Frontend Example


const eventSource = new EventSource("/api/chat/stream");

eventSource.addEventListener("token", (event) => {
  console.log("Token:", event.data);
});

eventSource.addEventListener("done", () => {
  console.log("Stream completed");
  eventSource.close();
});

Edge Cases You Should Consider

1. Client Disconnect

If the client disconnects mid-stream, IOException will be thrown. You're already handling it with completeWithError, which is good.

2. Timeout Handling

If responses are long, increase the timeout. Otherwise, the connection may close prematurely.

3. Token Granularity

Sometimes tokens are too small (like single characters). You can buffer them:

.bufferTimeout(10, Duration.ofMillis(200))
.map(list -> String.join("", list))

4. High Traffic / Scaling

Since each request keeps a connection open:

Limit concurrent streams
Add rate limiting
Monitor thread usage

5. Context Handling

If you're using RAG (as discussed in my knowledge assistant article), make sure:

ConversationId is consistent
Advisors correctly inject context

This is crucial for maintaining conversational continuity.

Final Thoughts

Streaming AI responses is one of those small changes that dramatically improves user experience.

With Spring AI, it's surprisingly easy:

Use ChatClient.stream()
Wrap it with Flux
Deliver via SseEmitter

If you already have a working chat system from my earlier articles, adding streaming is just a small incremental step.

❤️ Liked this article?

If it saved you time, consider buying me a coffee to support future improvements.

I write about cryptography, web security, and secure software development. Creator of practical crypto validation tools at Devglan.