If you've been following my previous articles on building AI applications with Spring AI, you already know how to build a solid foundation using RAG and chat services.
In those articles, we focused on building the backend, integrating vector stores, and generating AI responses. But one thing we didn't cover is streaming responses in real-time.
Instead of waiting for the entire response, what if we could stream tokens as they are generated? That's where Server-Sent Events (SSE) comes into play.
Why Streaming Matters?
Typical flow:
User -> Request -> Wait... -> Full response
Streaming flow:
User -> Request -> Tokens stream in real-time -> Better UX
Streaming improves perceived performance and makes your AI application feel more interactive.
Controller Layer
Here's the streaming endpoint:
@PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter chatStream(@Valid @RequestBody ChatRequest request) {
log.info("Streaming chat request [conversationId={}]", request.getConversationId());
return chatService.chatStream(request);
}
Key points:
TEXT_EVENT_STREAM_VALUEenables SSE- We return
SseEmitterinstead of a normal response - Controller remains thin (as it should be)
Service Layer (Streaming Logic)
public SseEmitter chatStream(ChatRequest request) {
SseEmitter emitter = new SseEmitter(120_000L);
Flux<String> tokenStream = chatClient.prompt()
.user(request.getMessage())
.advisors(buildAdvisors(request.getConversationId()))
.stream()
.content();
tokenStream.subscribe(
token -> {
try {
emitter.send(SseEmitter.event().data(token).name("token"));
} catch (IOException ex) {
emitter.completeWithError(ex);
}
},
emitter::completeWithError,
() -> {
try {
emitter.send(SseEmitter.event().name("done").data("[DONE]"));
emitter.complete();
} catch (IOException ex) {
emitter.completeWithError(ex);
}
}
);
return emitter;
}
What's Happening Here?
1. Create SSE Connection
SseEmitter emitter = new SseEmitter(120_000L);
This keeps the connection open for 2 minutes. You can adjust this depending on your use case.
2. Stream Tokens from Spring AI
Flux<String> tokenStream = chatClient.prompt()
.user(request.getMessage())
.advisors(buildAdvisors(request.getConversationId()))
.stream()
.content();
This is where Spring AI shines. Instead of a single response, we get a Flux<String>, which emits tokens as they are generated.
If you've implemented RAG from my previous article, those advisors here are what inject context into the prompt.
3. Send Tokens to Client
emitter.send(SseEmitter.event().data(token).name("token"));
Each token is pushed immediately to the client.
4. Handle Completion
emitter.send(SseEmitter.event().name("done").data("[DONE]"));
emitter.complete();
We explicitly send a done event so the frontend knows the stream has ended.
Frontend Example
const eventSource = new EventSource("/api/chat/stream");
eventSource.addEventListener("token", (event) => {
console.log("Token:", event.data);
});
eventSource.addEventListener("done", () => {
console.log("Stream completed");
eventSource.close();
});
Edge Cases You Should Consider
1. Client Disconnect
If the client disconnects mid-stream, IOException will be thrown. You're already handling it with completeWithError, which is good.
2. Timeout Handling
If responses are long, increase the timeout. Otherwise, the connection may close prematurely.
3. Token Granularity
Sometimes tokens are too small (like single characters). You can buffer them:
.bufferTimeout(10, Duration.ofMillis(200))
.map(list -> String.join("", list))
4. High Traffic / Scaling
Since each request keeps a connection open:
- Limit concurrent streams
- Add rate limiting
- Monitor thread usage
5. Context Handling
If you're using RAG (as discussed in my knowledge assistant article), make sure:
- ConversationId is consistent
- Advisors correctly inject context
This is crucial for maintaining conversational continuity.
Final Thoughts
Streaming AI responses is one of those small changes that dramatically improves user experience.
With Spring AI, it's surprisingly easy:
- Use ChatClient.stream()
- Wrap it with Flux
- Deliver via SseEmitter
If you already have a working chat system from my earlier articles, adding streaming is just a small incremental step.