1. Home
  2. Tech Blogs
  3. Streaming AI Responses with SSE in Spring AI (ChatClient)

Streaming AI Responses with SSE in Spring AI (ChatClient)

Streaming AI Responses with SSE in Spring AI (ChatClient) thumbnail

If you've been following my previous articles on building AI applications with Spring AI, you already know how to build a solid foundation using RAG and chat services.

In those articles, we focused on building the backend, integrating vector stores, and generating AI responses. But one thing we didn't cover is streaming responses in real-time.

Instead of waiting for the entire response, what if we could stream tokens as they are generated? That's where Server-Sent Events (SSE) comes into play.


Why Streaming Matters?

Typical flow:

User -> Request -> Wait... -> Full response

Streaming flow:

User -> Request -> Tokens stream in real-time -> Better UX

Streaming improves perceived performance and makes your AI application feel more interactive.


Controller Layer

Here's the streaming endpoint:

@PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public SseEmitter chatStream(@Valid @RequestBody ChatRequest request) {
    log.info("Streaming chat request [conversationId={}]", request.getConversationId());
    return chatService.chatStream(request);
}

Key points:

  • TEXT_EVENT_STREAM_VALUE enables SSE
  • We return SseEmitter instead of a normal response
  • Controller remains thin (as it should be)

Service Layer (Streaming Logic)

public SseEmitter chatStream(ChatRequest request) {
    SseEmitter emitter = new SseEmitter(120_000L);

    Flux<String> tokenStream = chatClient.prompt()
            .user(request.getMessage())
            .advisors(buildAdvisors(request.getConversationId()))
            .stream()
            .content();

    tokenStream.subscribe(
            token -> {
                try {
                    emitter.send(SseEmitter.event().data(token).name("token"));
                } catch (IOException ex) {
                    emitter.completeWithError(ex);
                }
            },
            emitter::completeWithError,
            () -> {
                try {
                    emitter.send(SseEmitter.event().name("done").data("[DONE]"));
                    emitter.complete();
                } catch (IOException ex) {
                    emitter.completeWithError(ex);
                }
            }
    );

    return emitter;
}

What's Happening Here?

1. Create SSE Connection

SseEmitter emitter = new SseEmitter(120_000L);

This keeps the connection open for 2 minutes. You can adjust this depending on your use case.

2. Stream Tokens from Spring AI

Flux<String> tokenStream = chatClient.prompt()
        .user(request.getMessage())
        .advisors(buildAdvisors(request.getConversationId()))
        .stream()
        .content();

This is where Spring AI shines. Instead of a single response, we get a Flux<String>, which emits tokens as they are generated.

If you've implemented RAG from my previous article, those advisors here are what inject context into the prompt.

3. Send Tokens to Client

emitter.send(SseEmitter.event().data(token).name("token"));

Each token is pushed immediately to the client.

4. Handle Completion

emitter.send(SseEmitter.event().name("done").data("[DONE]"));
emitter.complete();

We explicitly send a done event so the frontend knows the stream has ended.

Frontend Example

const eventSource = new EventSource("/api/chat/stream");

eventSource.addEventListener("token", (event) => {
  console.log("Token:", event.data);
});

eventSource.addEventListener("done", () => {
  console.log("Stream completed");
  eventSource.close();
});

Edge Cases You Should Consider

1. Client Disconnect

If the client disconnects mid-stream, IOException will be thrown. You're already handling it with completeWithError, which is good.

2. Timeout Handling

If responses are long, increase the timeout. Otherwise, the connection may close prematurely.

3. Token Granularity

Sometimes tokens are too small (like single characters). You can buffer them:

.bufferTimeout(10, Duration.ofMillis(200))
.map(list -> String.join("", list))

4. High Traffic / Scaling

Since each request keeps a connection open:

  • Limit concurrent streams
  • Add rate limiting
  • Monitor thread usage

5. Context Handling

If you're using RAG (as discussed in my knowledge assistant article), make sure:

  • ConversationId is consistent
  • Advisors correctly inject context

This is crucial for maintaining conversational continuity.

Final Thoughts

Streaming AI responses is one of those small changes that dramatically improves user experience.

With Spring AI, it's surprisingly easy:

  • Use ChatClient.stream()
  • Wrap it with Flux
  • Deliver via SseEmitter

If you already have a working chat system from my earlier articles, adding streaming is just a small incremental step.

Support Us!

Buying me a coffee helps keep the project running and supports new features.

cards
Powered by paypal

Thank you for helping this blog thrive!

About The Author

author-image
I write about cryptography, web security, and secure software development. Creator of practical crypto validation tools at Devglan.

Further Reading on spring-ai