0004 LLM Caching Strategy#

Status#

Provisional

Context#

As the number of AI-powered workflows in the platform grows, so does the volume and cost of LLM API calls. Repeated or structurally similar requests — such as summarizing the same content, generating questions for a shared lesson, or running system-prompted workflows — represent an opportunity for caching to reduce both latency and API spend.

Two distinct caching mechanisms are available through LiteLLM, and they operate at different layers:

LiteLLM Proxy Response Caching — LiteLLM intercepts requests and returns a previously stored full response when an identical request is detected (same model, messages, temperature, etc.). This is a client-side cache from the LLM provider’s perspective. Supported backends include: in-memory, disk, Redis, S3, GCS, and semantic (Qdrant/Redis). Cache behaviour can be controlled per-request using ttl, s-maxage, no-cache, no-store, and namespace parameters.

Critical limitation: This mechanism does not support the OpenAI /responses API endpoint (responses call type). It only works for /chat/completions, /completions, /embeddings, and /audio/transcriptions. Any workflow that relies on the Responses API cannot benefit from this cache layer.
Provider-Side Prompt Caching — The LLM provider (OpenAI, Anthropic, Bedrock, Deepseek) caches the KV-tensor computation for the prompt prefix. The API is still called, but token processing cost and latency are reduced for cache hits. LiteLLM transparently surfaces cache usage through a normalised prompt_tokens_details.cached_tokens field in the response.

Provider behaviour differs:
- OpenAI: Automatic for any prompt with 1,024+ tokens. No code changes needed. Optional prompt_cache_key and prompt_cache_retention parameters allow routing hints and TTL control ("in_memory" ≈ 5–10 min vs "24h").
- Anthropic: Opt-in. Specific message blocks must be annotated with "cache_control": {"type": "ephemeral"}. Anthropic charges for cache-write tokens (cache_creation_input_tokens) in addition to normal input tokens, so it is only cost-effective when the same annotated content is reused across multiple requests.
- Bedrock / Deepseek: Follow the same pattern as OpenAI (automatic or opt-in depending on the model).
LiteLLM can automatically inject cache_control annotations via cache_control_injection_points without requiring application code changes.

The combination of these two mechanisms must be evaluated against the call types actually used in the platform. Currently several workflows rely on the OpenAI Responses API, which makes LiteLLM proxy caching unavailable for those paths.

Decision#

No single caching strategy is adopted globally. Instead, the following approach is proposed as a baseline to be refined as usage patterns become clearer:

Proxy response caching (Redis-backed) is enabled only for call types that are compatible with it (acompletion, atext_completion, aembedding). The supported_call_types setting in cache_params must explicitly exclude responses until LiteLLM adds support for that endpoint.
Prompt caching is enabled where possible at the provider level:
- For OpenAI-backed workflows, no code change is needed beyond ensuring system prompts exceed 1,024 tokens or structuring requests so the cacheable prefix is long enough.
- For Anthropic-backed workflows, cache_control annotations are added to large, stable system-prompt blocks. Cost implications (write charges) must be assessed per workflow before enabling.
- LiteLLM’s cache_control_injection_points feature is preferred over manual annotation when the system message is the primary caching target.
Cache TTL defaults to 600 seconds for proxy caching. Workflows with highly dynamic outputs (e.g., chat with real-time context) should set no-store on individual requests.
Cache hits and cached_tokens values are monitored via LiteLLM’s /cache/ping health endpoint and usage objects to validate cost savings.

Consequences#

Workflows using /chat/completions and /embeddings benefit from full response deduplication via Redis, reducing redundant API calls for repeated identical requests.
Workflows using the OpenAI Responses API receive no proxy-level caching benefit until LiteLLM adds responses to its supported cache call types.
Provider-side prompt caching provides latency and cost reductions for large, stable system prompts regardless of call type, including the Responses API.
Enabling Anthropic prompt caching without usage analysis may increase costs due to cache-write token charges; this must be verified per workflow.
Introducing Redis as a cache backend adds an infrastructure dependency that operators must provision and maintain.

0004 LLM Caching Strategy

Contents

0004 LLM Caching Strategy#

Status#

Context#

Decision#

Consequences#