Modern AI search systems rely heavily on semantic search powered by embeddings. Instead of matching keywords, semantic search understands context and meaning.
However, semantic search systems have one big drawback - they are computationally expensive. Each query usually requires:
- Embedding generation
- Vector similarity search
- Ranking and filtering
If your application receives thousands of similar queries, running the full pipeline repeatedly becomes inefficient.
This is where semantic caching becomes extremely powerful.
Instead of recomputing results, we cache previous results based on semantic similarity.
In this article we will build a production-grade semantic caching architecture using:
- Local cache (Caffeine)
- Redis LSH cache
- Redis Vector cache
- Elasticsearch semantic search
If you are new to Elasticsearch vector search, you may want to read these first:
- Elasticsearch Semantic Search Tutorial
- Spring Boot Elasticsearch Vector Search
- AI Search Engine using Spring Boot
Why Semantic Caching?
Imagine users searching for:
- "how to fix java memory leak"
- "java memory leak troubleshooting"
- "debug memory leak in java"
All three queries are semantically similar. Running the full semantic search pipeline every time is wasteful.
Semantic caching allows us to reuse results from similar queries.
Production Architecture
Below is the optimized architecture used in this implementation.
Why This Architecture Works Well
- Local Cache handles repeated queries instantly
- LSH cache detects semantically similar queries quickly
- Vector cache performs approximate vector similarity search
- Elasticsearch acts as the final semantic search engine
Request Flow
This layered architecture drastically reduces the load on Elasticsearch.
Project Structure
Maven Dependencies
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>5.1.0</version>
</dependency>
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
<version>3.1.8</version>
</dependency>
<dependency>
<groupId>co.elastic.clients</groupId>
<artifactId>elasticsearch-java</artifactId>
<version>9.2.5</version>
</dependency>
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-client</artifactId>
<version>9.2.5</version>
</dependency>
Configuration
RedisConfig.java
Creates the Redis client connection used across the application.
@Configuration
public class RedisConfig {
@Autowired
private RedisProperties redisProperties;
@Bean
public JedisPool jedisPool() {
JedisPoolConfig config = new JedisPoolConfig();
config.setMaxTotal(200);
config.setMaxIdle(50);
config.setMinIdle(10);
config.setTestOnBorrow(true);
config.setTestWhileIdle(true);
return new JedisPool(config,redisProperties.getHost(),redisProperties.getPort());
}
@Bean
public UnifiedJedis jedis() {
return new UnifiedJedis("redis://localhost:6379");
}
}
CacheConfig.java
Configures the local Caffeine cache.
@Configuration
public class CacheConfig {
@Bean
public Cache<String,String> localCache(){
return Caffeine.newBuilder()
.maximumSize(10000)
.expireAfterWrite(Duration.ofMinutes(30))
.build();
}
}
Embedding Service
The embedding service converts text queries into vector embeddings using Elasticsearch inference API.
public float[] generate(String text) {
List<Map<String, JsonData>> docs = new ArrayList<>();
docs.add(Map.of("text_field", JsonData.of(text)));
...
...
float[] arr = new float[vector.size()];
for (int i = 0; i < vector.size(); i++) {
arr[i] = vector.get(i).floatValue();
}
return arr;
}
LSH Hash Service
Locality Sensitive Hashing generates a hash that groups similar vectors together.
@Service
public class LshHashService {
public long hash(String text) {
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
long hash = 1125899906842597L;
for (byte b : bytes)
hash = 31 * hash + b;
return hash;
}
}
This allows fast approximate matching of semantically similar queries.
Local Cache Layer
The fastest cache layer using Caffeine.
If an identical query was executed recently, the response is returned instantly.
@Service public class LocalCacheService { @Autowired private Cache<String,String> cache; public String get(String key){ return cache.getIfPresent(key); } public void put(String key,String value){ cache.put(key,value); } }
Redis LSH Cache
This cache stores mappings from LSH hash -> cached results.
If a new query maps to an existing hash bucket, we can reuse results.
Redis Vector Cache
If LSH cache misses, we run a vector similarity search in Redis.
@Service public class VectorCacheService { @Autowired private JedisPool jedisPool; @Autowired private UnifiedJedis jedis; public String search(float[] vector) { byte[] vec = VectorUtil.toBytes(vector); Query q = new Query("*=>[KNN 1 @vector $vec AS score]") .addParam("vec", vec) .setSortBy("score", true) .returnFields("response", "score") .dialect(2); SearchResult result = jedis.ftSearch("semantic_cache_idx", q); if (result.getTotalResults() == 0) return null; Document doc = result.getDocuments().get(0); return (String) doc.get("response"); } public void store(String query, float[] vector, String response) { String key = "cache:" + UUID.randomUUID(); try (Jedis jedis = jedisPool.getResource()) { Map<String, String> map = new HashMap<>(); map.put("query", query); map.put("response", response); jedis.hset(key, map); jedis.hset( key.getBytes(), "vector".getBytes(), VectorUtil.toBytes(vector) ); } } }
Redis uses HNSW index for fast nearest neighbor search.
Elasticsearch Semantic Search
If no cache layers return results, we execute the full semantic search in Elasticsearch. This is the same query that we built in our last article.
SearchResponse<Map> response =
client.search(s -> s
.index("documents")
.knn(k -> k
.field("content_vector")
.queryVector(vector)
.k(5)
.numCandidates(50)
)
);
Hybrid Semantic Search Service
This service orchestrates the entire pipeline.
@Service public class HybridSemanticSearchService { private LocalCacheService localCache; private LshCacheService lshCache; private EmbeddingCacheService embeddingCache; private VectorCacheService vectorCache; private EmbeddingService embeddingService; private ElasticsearchService esSearch; private QueryNormalizer normalizer; private LshHashService hashService; public String search(String query) throws Exception { String normalized = normalizer.normalize(query); String result = localCache.get(normalized); if (Objects.nonNull(result)) { return result; } long hash = hashService.hash(normalized); result = lshCache.get(hash); if (Objects.nonNull(result)) { localCache.put(normalized, result); return result; } float[] vector = embeddingCache.get(normalized); if (Objects.isNull(vector)) { vector = embeddingService.generate(query); embeddingCache.store(normalized, vector); } result = vectorCache.search(vector); if (Objects.nonNull(result)) { localCache.put(normalized, result); lshCache.store(hash, result); return result; } result = esSearch.search(query, vector); if (Objects.nonNull(result)) { localCache.put(normalized, result); lshCache.store(hash, result); vectorCache.store(normalized, vector, result); } return result; } }
Performance Improvements
| Search Type | Latency |
|---|---|
| Direct Elasticsearch | 120-300 ms |
| Redis Vector Cache | 10-20 ms |
| LSH Cache | 2-5 ms |
| Local Cache | <1 ms |
With layered caching, most queries are served within a few milliseconds.
Conclusion
Semantic caching dramatically improves the scalability of AI search systems.
By combining:
- Local cache
- LSH semantic grouping
- Redis vector similarity search
- Elasticsearch semantic search
We get a highly optimized architecture capable of handling large query volumes with extremely low latency.
This architecture is similar to what many large-scale AI search platforms use internally.