Milvus Semantic Cache
The Milvus cache backend provides persistent, distributed semantic caching using the Milvus vector database. This is the recommended solution for production deployments requiring high availability, scalability, and data persistence.
Overview​
Milvus cache is ideal for:
- Production environments with high availability requirements
- Distributed deployments across multiple instances
- Large-scale applications with millions of cached queries
- Persistent storage requirements where cache survives restarts
- Advanced vector operations and similarity search optimization
Architecture​
Configuration​
Milvus Backend Configuration​
Configure in config/cache/milvus.yaml
:
# config/cache/milvus.yaml
connection:
host: "localhost"
port: 19530
auth:
enabled: false
username: ""
password: ""
tls:
enabled: false
collection:
name: "semantic_cache"
dimension: 384 # Must match embedding model dimension
index_type: "IVF_FLAT"
metric_type: "COSINE"
nlist: 1024
performance:
search_params:
nprobe: 10
insert_batch_size: 1000
search_batch_size: 100
development:
drop_collection_on_startup: false
auto_create_collection: true
log_level: "info"
Setup and Deployment​
Start Milvus Service:
# Using Docker
make start-milvus
# Verify Milvus is running
curl http://localhost:19530/health
2. Configure Semantic Router​
Basic Milvus Configuration:
- Set
backend_type: "milvus"
inconfig/config.yaml
- Set
backend_config_path: "config/cache/milvus.yaml"
inconfig/config.yaml
# config/config.yaml
semantic_cache:
enabled: true
backend_type: "milvus"
backend_config_path: "config/cache/milvus.yaml"
similarity_threshold: 0.8
ttl_seconds: 7200
Run Semantic Router:
# Start router
make run-router
Run EnvoyProxy:
# Start Envoy proxy
make run-envoy
4. Test Milvus Cache​
# Send identical requests to see cache hits
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}'
# Send similar request (should hit cache due to semantic similarity)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain machine learning"}]
}'
Next Steps​
- In-Memory Cache - Compare with in-memory caching
- Cache Overview - Learn semantic caching concepts
- Observability - Monitor Milvus performance