In-Memory Semantic Cache
The in-memory cache backend provides fast, local caching for development environments and single-instance deployments. It stores semantic embeddings and cached responses directly in memory for maximum performance.
Overview​
The in-memory cache is ideal for:
- Development and testing environments
- Single-instance deployments
- Quick prototyping and experimentation
- Low-latency requirements where external dependencies should be minimized
Architecture​
Configuration​
Basic Configuration​
# config/config.yaml
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo"
Configuration Options​
Parameter | Type | Default | Description |
---|---|---|---|
enabled | boolean | false | Enable/disable semantic caching |
backend_type | string | "memory" | Cache backend type (must be "memory") |
similarity_threshold | float | 0.8 | Minimum similarity for cache hits (0.0-1.0) |
max_entries | integer | 1000 | Maximum number of cached entries |
ttl_seconds | integer | 3600 | Time-to-live for cache entries (seconds, 0 = no expiration) |
eviction_policy | string | "fifo" | Eviction policy: "fifo" , "lru" , "lfu" |
Environment Examples​
Development Environment​
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.9 # Strict matching for testing
max_entries: 500 # Small cache for development
ttl_seconds: 1800 # 30 minutes
eviction_policy: "fifo"
Setup and Testing​
1. Enable In-Memory Cache​
Update your configuration file:
# Edit config/config.yaml
cat >> config/config.yaml << EOF
semantic_cache:
enabled: true
backend_type: "memory"
similarity_threshold: 0.85
max_entries: 1000
ttl_seconds: 3600
EOF
2. Start the Router​
# Start the semantic router
make run-router
# Or run directly
./bin/router --config config/config.yaml
3. Test Cache Functionality​
Send identical requests to verify cache hits:
# First request (cache miss)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}'
# Second identical request (cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}'
# Similar request (semantic cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain machine learning concepts"}]
}'
Advantages​
- Ultra-low latency: Direct memory access, no network overhead
- Simple setup: No external dependencies required
- High throughput: Can handle thousands of cache operations per second
- Immediate availability: Cache is ready as soon as the router starts
Limitations​
- Volatile storage: Cache is lost when the router restarts
- Single instance: Cannot be shared across multiple router instances
- Memory constraints: Limited by available system memory
- No persistence: No data recovery after crashes
Memory Management​
Automatic Cleanup​
The in-memory cache automatically manages memory through:
- TTL Expiration: Entries are removed after
ttl_seconds
- LRU Eviction: Least recently used entries are removed when
max_entries
is reached - Periodic Cleanup: Expired entries are cleaned every
cleanup_interval_seconds
- Memory Pressure: Aggressive cleanup when approaching
memory_limit_mb
Next Steps​
- Milvus Cache - Set up persistent, distributed caching
- Cache Overview - Learn about semantic caching concepts
- Observability - Monitor cache performance