Skip to main content

Overview

MySafeCache employs a sophisticated dual-caching strategy that combines the speed of exact matching with the intelligence of semantic similarity. This approach ensures you get the best of both worlds: lightning-fast responses when possible and intelligent matches when exact duplicates aren’t available.

Dual Caching Architecture

Exact Caching

How It Works

Exact caching uses SHA-256 hashing to create unique identifiers for message combinations. When a request comes in, MySafeCache:
  1. Generates a hash from the message array
  2. Checks Redis for an exact match
  3. Returns the cached response if found

Benefits

  • Ultra-fast: 1-5ms response times
  • Deterministic: Same input always returns same output
  • Resource efficient: Minimal computational overhead

Example

// These two requests will hit the exact cache
Request 1: [{"role": "user", "content": "What is Docker?"}]
Request 2: [{"role": "user", "content": "What is Docker?"}]

// This will NOT hit exact cache (different capitalization)
Request 3: [{"role": "user", "content": "what is docker?"}]

When Exact Caching Works Best

Repeated Queries

Applications that frequently ask identical questions

Template-based Prompts

Systems using consistent prompt templates

FAQ Systems

Knowledge bases with standard questions

Fixed Workflows

Automated processes with predictable inputs

Semantic Caching

How It Works

Semantic caching uses vector embeddings to find similar queries:
  1. Converts messages to vector embeddings using OpenAI’s embedding models
  2. Stores embeddings in Qdrant vector database
  3. Performs similarity search using cosine similarity
  4. Returns matches above configurable threshold (default: 0.85)

Benefits

  • Intelligent matching: Finds similar queries regardless of exact wording
  • Flexible: Handles paraphrasing and variations
  • Learning: Gets better as more data is cached

Example

// These queries will likely hit semantic cache
Original: "What is Docker?"
Variations that match:
- "Can you explain Docker to me?"
- "Tell me about Docker technology"
- "How does Docker work?"
- "What's Docker used for?"

Similarity Thresholds

ThresholdUse CaseTrade-off
0.95+High precisionFewer matches, very similar queries only
0.85-0.94Balanced (default)Good mix of precision and recall
0.75-0.84High recallMore matches, but potentially less relevant

When Semantic Caching Works Best

Natural Language Queries

User-generated questions with variations

Chatbots

Conversational AI with paraphrased questions

Search Systems

Knowledge retrieval with flexible queries

Content Generation

Similar creative requests with variations

Optimization Strategies

1. Cache Warming

Pre-populate your cache with common queries:
import requests

common_queries = [
    "What is artificial intelligence?",
    "How does machine learning work?",
    "Explain deep learning",
    "What are neural networks?",
    # Add your common queries
]

def warm_cache(queries, api_key):
    for query in queries:
        # Check if already cached
        check_response = requests.post(
            "https://api.mysafecache.com/api/v1/check",
            headers={"Authorization": f"Bearer {api_key}"},
            json={"messages": [{"role": "user", "content": query}]}
        )
        
        if not check_response.json()["cache_hit"]:
            # Generate response and cache it
            llm_response = get_llm_response(query)  # Your LLM call
            
            requests.post(
                "https://api.mysafecache.com/api/v1/store",
                headers={"Authorization": f"Bearer {api_key}"},
                json={
                    "messages": [{"role": "user", "content": query}],
                    "answer": llm_response,
                    "model": "gpt-4"
                }
            )

warm_cache(common_queries, "your-api-key")

2. Prompt Standardization

Standardize prompts to increase exact cache hits:
# Different prompt variations
prompts = [
    "Summarize this text: {text}",
    "Can you summarize: {text}",
    "Please provide a summary of: {text}",
    "Give me a summary for: {text}"
]

3. Message Array Consistency

Keep message structures consistent:
# These won't hit exact cache due to different structures
messages1 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Docker?"}
]

messages2 = [
    {"role": "user", "content": "What is Docker?"}
]

Performance Tuning

Cache Hit Rate Optimization

Monitor and optimize your cache hit rate:
def analyze_cache_performance(api_key):
    response = requests.get(
        "https://api.mysafecache.com/api/v1/usage",
        headers={"Authorization": f"Bearer {api_key}"}
    )
    
    stats = response.json()
    
    print(f"Total requests: {stats['total_requests']}")
    print(f"Cache hit rate: {stats['hit_rate_percentage']:.1f}%")
    print(f"Exact hits: {stats['exact_hits']}")
    print(f"Semantic hits: {stats['semantic_hits']}")
    
    # Recommendations
    if stats['hit_rate_percentage'] < 30:
        print("💡 Consider standardizing prompts for better exact matching")
    elif stats['semantic_hits'] > stats['exact_hits']:
        print("💡 Semantic cache is working well, consider prompt templates")

Storage Optimization

Shorter messages cache more efficiently. Consider breaking long prompts into reusable components.
Store high-quality responses that will be useful for similar queries. Poor responses reduce semantic matching effectiveness.
For time-sensitive content, implement your own expiration logic by including timestamps in queries.

Cache Strategy Selection

Choose the right strategy based on your use case:
Use CaseRecommended StrategyWhy
FAQ BotFocus on exact cachingRepeated identical questions
Research AssistantBalanced approachMix of similar and exact queries
Content GenerationSemantic-heavyCreative variations
API DocumentationExact cachingConsistent technical queries
Customer SupportBalanced approachSimilar issues, different wording

Monitoring and Analytics

Track cache performance with built-in analytics:
def get_detailed_analytics(api_key):
    response = requests.get(
        "https://api.mysafecache.com/api/v1/analytics",
        headers={"Authorization": f"Bearer {api_key}"}
    )
    
    analytics = response.json()
    
    return {
        "cache_efficiency": analytics["hit_rate_percentage"],
        "average_response_time": analytics["average_lookup_time_ms"],
        "cost_savings": analytics["estimated_savings"],
        "top_queries": analytics["popular_queries"]
    }

Best Practices

Standardize Prompts

Use consistent prompt templates to maximize exact cache hits

Monitor Performance

Regularly check analytics to optimize cache strategy

Warm the Cache

Pre-populate with common queries during off-peak hours

Quality Control

Only cache high-quality responses to maintain semantic matching accuracy

Next Steps