Build a monitoring system to track LLM token usage, latency, and cost in Spring AI using Spring Boot, Micrometer, Prometheus, and Actuator. Read Now!
  1. Home
  2. Tech Blogs
  3. Build a Monitoring System for Token Usage, Latency, and Cost in Spring AI

Build a Monitoring System for Token Usage, Latency, and Cost in Spring AI

Build a Monitoring System for Token Usage, Latency, and Cost in Spring AI thumbnail

When building real-world AI applications, generating responses using LLMs is just the beginning. The real challenge starts when the application goes into production - how to monitor token usage, control cost, and ensure observability?

In this article, we will build a LLM usage monitoring system using Spring Boot and Spring AI. This includes:

  • Token usage tracking (prompt + completion)
  • Response time monitoring
  • Billing-ready persistence layer
  • Decorator pattern (no AOP surprises)
  • Async and scalable architecture

We have already built below Spring AI apps in our previous articles and hence let's directly jump into building this app from there.

Architecture Overview of the Monitoring System

High-level flow:

  • Spring AI -> generates responses
  • Custom Advisor -> captures usage
  • MySQL -> persists usage + cost
  • Micrometer -> records metrics
  • Actuator -> exposes metrics endpoint
  • Prometheus -> scrapes metrics
spring-ai-monitoring-system-architecture

Setting Up Spring Boot with Spring AI and Micrometer

Maven Dependencies


<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <scope>runtime</scope>
</dependency>

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>
    

micrometer-registry-prometheus: It's used to export application metrics to Prometheus.

Application.yaml Config

Here is the yaml config for individual components:

LLM Model Config


spring:
  ai:
	ollama:
	  base-url: http://localhost:11434
	  chat:
		options:
		  model: llama3.2
		  temperature: 0.7
		  num-ctx: 4096
	  embedding:
		options:
		  model: nomic-embed-text
	  init:
		pull-model-strategy: never
		timeout: 60s

Spring Boot Actuator Config


management:
  endpoints:
	web:
	  exposure:
		include:
		  - health          #http://localhost:8080/actuator/health
		  - info
		  - metrics
		  - prometheus
		  - llm-usage
	  base-path: /actuator
  endpoint:
	health:
	  show-details: always
	prometheus:
	  enabled: true
	llm-usage:
	  enabled: true
  metrics:
	export:
	  prometheus:
		enabled: true
	tags:
	  application: ${spring.application.name}

/actuator/prometheus: Exposes all metrics collected via Micrometer and formats them in Prometheus text format.

/actuator/llm-usage: Returns aggregated LLM usage + cost data backed by MySQL tables.

DB Structure for Token Usage Records

We have 2 main tables for capturing the token(llm_usage_records) and cost(model_pricing_configs) usage. The third table daily_usage_summaries is for generating daily summary.


create table daily_usage_summaries (id bigint not null auto_increment, application_name varchar(128) not null, avg_latency_ms float(53), failed_calls bigint not null, last_computed_at datetime(6), model_name varchar(128) not null, p95_latency_ms bigint, p99_latency_ms bigint, provider varchar(64) not null, successful_calls bigint not null, summary_date date not null, total_calls bigint not null, total_completion_tokens bigint not null, total_cost decimal(18,10), total_prompt_tokens bigint not null, total_tokens bigint not null, primary key (id)) engine=InnoDB

create table llm_usage_records (id bigint not null auto_increment, application_name varchar(128) not null, completion_cost decimal(18,10), completion_tokens integer not null, created_at datetime(6) not null, currency varchar(8), endpoint varchar(256), error_code varchar(64), error_message varchar(512), first_token_latency_ms bigint, latency_ms bigint, model_name varchar(128) not null, model_version varchar(64), operation_type enum ('CHAT','COMPLETION','EMBEDDING','IMAGE_GENERATION','SPEECH_TO_TEXT','TEXT_TO_SPEECH') not null, prompt_cost decimal(18,10), prompt_tokens integer not null, provider varchar(64) not null, request_completed_at datetime(6), request_id varchar(64), request_started_at datetime(6), session_id varchar(128), status enum ('FAILURE','RATE_LIMITED','SUCCESS','TIMEOUT') not null, tags varchar(512), total_cost decimal(18,10), total_tokens integer not null, trace_id varchar(64), user_id varchar(128), primary key (id)) engine=InnoDB

create table model_pricing_configs (id bigint not null auto_increment, is_active bit, completion_cost_per_million decimal(18,10) not null, created_at datetime(6), currency varchar(8), display_name varchar(256), model_name varchar(128) not null, notes varchar(512), prompt_cost_per_million decimal(18,10) not null, provider varchar(64) not null, updated_at datetime(6), primary key (id)) engine=InnoDB

Spring AI Chat Client

Wires the UsageTrackingAdvisor into the default ChatClient bean. Any code that injects ChatClient will automatically have token usage tracked and persisted without any extra boilerplate.


@Bean
public ChatClient chatClient(OllamaChatModel chatModel, UsageTrackingAdvisor usageTrackingAdvisor) {
	return ChatClient.builder(chatModel)
			.defaultAdvisors(usageTrackingAdvisor)
			.build();
}

Callers can pass context for richer tracking via adviseContext such as llm.tracker.applicationName, llm.tracker.traceId, etc.


public ChatResponse chat(ChatRequest request) {

String answer = chatClient.prompt()
		.user(request.getMessage())
		.advisors(a -> a
				.param("llm.tracker.applicationName", "knowledge-service")
				.param("llm.tracker.userId", "dummy123")
				.param("llm.tracker.traceId", tracer.currentSpan().context().traceId()))
		.call()
		.content();

return ChatResponse.builder()
		.answer(answer)
		.conversationId(request.getConversationId())
		.timestamp(Instant.now())
		.build();
}

Capturing Token Usage and Latency Metrics

Use a custom advisor to capture response metadata:


public class UsageTrackingAdvisor implements CallAdvisor, StreamAdvisor {

@Override
public int getOrder() {
	// Run last (after security advisors, RAG, etc.) so we capture final token counts
	return Integer.MAX_VALUE - 100;
}

@Override
public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
	Instant start = Instant.now();
	try {
		ChatClientResponse response = callAdvisorChain.nextCall(chatClientRequest);
		long latencyMs = Instant.now().toEpochMilli() - start.toEpochMilli();
		persistLlmUsage(chatClientRequest, response, start, latencyMs,
				LlmUsageRecord.CallStatus.SUCCESS, null, null);
		return response;
	} catch (Exception ex) {
		long latencyMs = Instant.now().toEpochMilli() - start.toEpochMilli();
		persistLlmUsage(chatClientRequest, null, start, latencyMs,
				LlmUsageRecord.CallStatus.FAILURE, ex.getClass().getSimpleName(), ex.getMessage());
		throw ex;
	}
}

Extracting Usage and Token Metrics

Let's extract the prompt tokens, completion tokens usage and other metadata from Spring's ChatClientRequest and ChatClientResponse.


private void persistLlmUsage(ChatClientRequest request, ChatClientResponse response) {

    String appName = (String) request.context().getOrDefault(APPLICATION_NAME_KEY, "unknown");
    String userId = (String) request.context().get(USER_ID_KEY);
    String traceId = (String) request.context().get(TRACE_ID_KEY);

    try {
        int promptTokens = 0;
        int completionTokens = 0;
        int totalTokens = 0;
        String modelName = null;
        var chatResponse = response.chatResponse();
        if (chatResponse != null) {
            modelName = chatResponse.getMetadata().getModel();
            ChatResponseMetadata metadata = chatResponse.getMetadata();
            Usage usage = metadata.getUsage();
            promptTokens = usage.getPromptTokens();
            completionTokens = usage.getCompletionTokens();
            totalTokens = usage.getTotalTokens();
        }
        //Build the UsageRecordRequest record
        usageRecordService.record(usageRequest);

Tracking Cost with Custom Model Pricing

The TokenCostcalculator service calculates cost for a given provider/model and token counts based on the config available in the table model_pricing_configs


public CostBreakdown calculate(String provider, String modelName, int promptTokens, int completionTokens) {
	Optional<ModelPricingConfig> config = getPricingConfig(provider, modelName);

	ModelPricingConfig pricing = config.get();

	BigDecimal promptCost = BigDecimal.valueOf(promptTokens)
			.multiply(pricing.getPromptCostPerMillion(), PRECISION)
			.divide(ONE_MILLION, 10, RoundingMode.HALF_UP);

	BigDecimal completionCost = BigDecimal.valueOf(completionTokens)
			.multiply(pricing.getCompletionCostPerMillion(), PRECISION)
			.divide(ONE_MILLION, 10, RoundingMode.HALF_UP);

	BigDecimal totalCost = promptCost.add(completionCost).setScale(10, RoundingMode.HALF_UP);

	return new CostBreakdown(promptCost, completionCost, totalCost, pricing.getCurrency());
}

The usageRecordService finally persists the token usage with the calculated cost into DB.


public class UsageRecordService {

private final LlmUsageRecordRepository recordRepository;
private final TokenCostCalculator costCalculator;
private final LlmMetricsService metricsService;

@Transactional
public LlmUsageRecord record(UsageRecordRequest request) {
	// Compute cost from stored pricing config
	var cost = costCalculator.calculate(
			request.provider(),
			request.modelName(),
			request.promptTokens(),
			request.completionTokens()
	);
	//Convert UsageRecordRequest to LlmUsageRecord
	LlmUsageRecord saved = recordRepository.save(record);

	// Update Micrometer metrics
	metricsService.recordUsage(saved);
}

Integrating Micrometer for Metrics Collection


@Autowired
MeterRegistry meterRegistry;

meterRegistry.counter("llm.tokens.input").increment(inputTokens);
meterRegistry.counter("llm.tokens.output").increment(outputTokens);
meterRegistry.timer("llm.latency").record(latency, TimeUnit.MILLISECONDS);

Exposing Metrics via Spring Boot Actuator

This actuator endpoint can be accessed through Http call.

http://localhost:8080/actuator/llm-usage

@Component
@Endpoint(id = "llm-usage")
@RequiredArgsConstructor
public class LlmUsageActuatorEndpoint {

    private final DailyUsageSummaryRepository summaryRepository;
    private final LlmUsageRecordRepository recordRepository;

    @ReadOperation
    public Map<String, Object> today() {
        LocalDate today = LocalDate.now(ZoneOffset.UTC);
        return buildSummary("today", today, today);
    }
}

Build summary will pull the saved summary from the DB.


private Map<String, Object> buildSummary(String label, LocalDate from, LocalDate to) {
	var summaries = summaryRepository.findBySummaryDateBetweenOrderBySummaryDateAsc(from, to);

	long totalCalls    = summaries.stream().mapToLong(s -> s.getTotalCalls()).sum();
	long totalTokens   = summaries.stream().mapToLong(s -> s.getTotalTokens()).sum();
	long successCalls  = summaries.stream().mapToLong(s -> s.getSuccessfulCalls()).sum();
	long failedCalls   = summaries.stream().mapToLong(s -> s.getFailedCalls()).sum();
	BigDecimal cost    = summaryRepository.sumTotalCost(from, to, null);

	double avgLatency  = summaries.stream()
			.filter(s -> s.getAvgLatencyMs() != null)
			.mapToDouble(s -> s.getAvgLatencyMs())
			.average().orElse(0.0);

	// Per-model breakdown
	var modelBreakdown = recordRepository.aggregateByModel(
			from.atStartOfDay().toInstant(ZoneOffset.UTC),
			to.plusDays(1).atStartOfDay().toInstant(ZoneOffset.UTC)
	);
	//...
}

Access metrics:

http://localhost:8080/actuator/prometheus

Visualizing Metrics in Prometheus


scrape_configs:
  - job_name: 'knowledge-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['localhost:8080']

This visualization flow foes like this:

spring-ai-prometeus-grafana-flow

Putting It All Together

Now our system:

  • Captures token usage and latency
  • Calculates cost dynamically
  • Exports metrics via Actuator
  • Visualizes via Prometheus
  • Persists data in MySQL

This setup gives you full visibility into LLM usage and cost, making our Spring AI application production-ready.

Support Us!

Buying me a coffee helps keep the project running and supports new features.

cards
Powered by paypal

Thank you for helping this blog thrive!

About The Author

author-image
I write about cryptography, web security, and secure software development. Creator of practical crypto validation tools at Devglan.

Further Reading on Spring AI