Prompt caching: the 90% cost reduction nobody talks about

Anthropic's ephemeral cache is not new. It has been available since 2024. But most teams using Claude in production have not implemented it correctly — and are paying 60-90% more than necessary on calls that should be cached.

This is not because caching is difficult. It is because placement is counterintuitive, and because the documentation is technically correct without being practically useful.

This article explains the mechanics, the placement pattern, and what can and cannot actually be cached.

What is ephemeral cache

Anthropic's ephemeral cache is a server-side caching mechanism that stores parts of your prompt on Anthropic's infrastructure for up to five minutes. When you send the cached content again within those five minutes, you do not pay for input tokens — only for output tokens and a small cache-read fee.

Concrete discount: typically 90% on cached input tokens. If you have a 5,000-token system prompt sent on every call, and you cache it, you pay the cache-read price (typically 10% of normal input price) instead of the full input price.

For systems with long, static system prompts, this is transformative.

Caching is not an optimisation. It is a fundamental part of the cost model for AI systems in production. A system that sends 5,000-token instructions on every call without caching is not cost-optimised — it is not set up correctly.

The 1,024-token threshold

The most important technical detail is the minimum size: your prompt segment must be at least 1,024 tokens for caching to activate.

This means you cannot cache a short 200-token system prompt and expect a discount. You must either:

Have sufficient content in your cached block (1,024+ tokens), or
Structure your prompt so that the repeatable parts are consolidated into one block above the threshold.

In practice, this is rarely a problem for enterprise systems. System prompts with domain context, examples (few-shot), and instructions are typically well above 1,024 tokens.

For simpler systems with short instructions, this is a real limitation — and a signal that you should consider adding few-shot examples to your prompt anyway (they typically improve output consistency and give you caching as a side benefit).

It is worth emphasising: 1,024 tokens is lower than it sounds. A well-constructed system prompt with two pages of instructions and three examples of desired output is typically 1,200-2,000 tokens. Most meaningful system prompts are already above the threshold.

The cacheControl placement pattern

This is the part that confuses most teams implementing caching for the first time: you do not cache "your system prompt as a whole." You mark specific parts of your messages array as cacheable.

const result = await generateText({
  model: models.deep,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: STATIC_SYSTEM_INSTRUCTIONS, // 2000+ tokens of static text
          providerOptions: {
            anthropic: {
              cacheControl: { type: "ephemeral" },
            },
          },
        },
        {
          type: "text",
          text: userInput, // dynamic — not cached
        },
      ],
    },
  ],
  experimental_telemetry: {
    isEnabled: true,
    functionId: "my-feature",
  },
})

The critical point: cacheControl is set on providerOptions on the specific TextPart — not on the entire messages array, not in the system field, and not in the text itself.

Setting it in the wrong location is a common mistake, and the result is that caching does not activate — you pay full price and believe you are caching.

Another frequent mistake: using experimental_providerMetadata instead of providerOptions. Both syntaxes have existed across AI SDK versions, but in AI SDK v6 the correct field is providerOptions. Check the Anthropic documentation for the version you are using.

What can and cannot be cached

Can be cached: Static content that is identical across calls. Instructions, frameworks, few-shot examples, domain context, role definitions. The more static, the better.

Cannot be cached effectively: User input, real-time data, timestamps, content with unique IDs. Caching only works when the cached content is bit-for-bit identical to what is in the cache.

Placement hierarchy: Place the cached segment as early as possible in the messages array. Caching works from the beginning to the point you have marked — it is not possible to cache a fragment in the middle of a long conversation array.

TTL: Cache lives for up to five minutes. For systems with frequent calls (chat, real-time analysis), this is rarely a problem. For batch jobs with long pauses between calls, the cache expires and you pay full price on the next call.

Multi-turn conversations: You can cache the initial system context and pay for it once per session window. User turns (which change) cannot be cached.

Verification: is it actually cached

The most critical mistake is implementing caching and assuming it works. Verification requires checking response metadata.

Via Langfuse or directly via the Anthropic SDK, you can see:

// From Langfuse trace or direct API response
const usage = result.usage
// cache_creation_input_tokens: tokens used to create the cache (first call)
// cache_read_input_tokens: tokens read from cache (subsequent calls)
// input_tokens: tokens not cached

If cache_read_input_tokens is 0 on calls that should hit the cache, caching is not activated correctly. This can be due to incorrect placement of cacheControl, tokens below the 1,024 threshold, or the TTL window having expired.

Add observability from day one. You cannot debug caching without seeing what is actually happening at the token level.

A good practice: log cache hit rate as a metric in your AI observability system. If hit rate drops below 60% on calls that should hit the cache, there is a problem — either with TTL, placement, or with system prompts changing unexpectedly.

Strategy for system prompts with context

The most powerful use case is caching domain context that is shared across many calls — but specific enough to drive output quality.

Pattern 1 — Static + dynamic:

const DOMAIN_CONTEXT = `
You are an enterprise architecture assistant. You help analyse 
IT portfolios based on the following principles:
${PRINCIPLES}  // 500 tokens
${EXAMPLES}    // 800 tokens
${METHODOLOGY} // 600 tokens
`
// Total: ~1900 tokens — above the 1024 threshold, can be cached

const result = await generateText({
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: DOMAIN_CONTEXT, providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } },
        { type: "text", text: `Analyse this application: ${userApp}` }
      ]
    }
  ]
})

Pattern 2 — Few-shot caching:

Few-shot examples are ideal for caching. They are static, relatively long (200-300 tokens per example), and substantially improve output consistency. Three examples typically produce 600-900 tokens — close enough to the 1,024 threshold to be supplemented with a shorter instruction.

Pattern 3 — Workspace-specific context:

For multi-tenant systems, workspace-specific context (company terminology, industry-specific vocabulary, configuration) can usefully be cached per session. This requires the context to be stable within a session duration of under five minutes — and consolidated enough to exceed the 1,024-token threshold.

Multiple cache breakpoints in a single call

A detail that rarely appears in tutorials: you can set cacheControl at multiple points in the same messages array. This lets you cache in layers.

A practical example: an enterprise analysis call with three layers of static content:

Layer 1 — Global instructions (3,000 tokens): The system's role, methodology, and output format. Rarely changes.

Layer 2 — Workspace context (1,200 tokens): The company's IT principles, strategic themes, and relevant definitions. Changes monthly.

Layer 3 — Session context (800 tokens): The current analysis topic and related data. Changes per session.

With three cache breakpoints, you only pay for the layer that changes. Layer 1 is almost never recreated — you pay the cache-read price on 19 out of 20 calls. Layer 2 only recreates when workspace context is updated. Layer 3 is new per session, but the other two layers still reduce total token consumption by 75-85%.

Implementation-wise, you set cacheControl: { type: "ephemeral" } at the end of each layer — not only on the first.

The consequence of model upgrades

A cache is tied to the specific model version. If you switch from claude-sonnet-4-6 to a newer version, the cache resets, and you pay cache_creation_input_tokens on the first call after the switch.

This is not a problem — just something to plan for. When you deploy a model upgrade, expect a spike in cache_creation_input_tokens in the first hour. It is normal and expected.

This also underscores why AI observability and prompt caching belong together: without cache_read_input_tokens in your traces, you would not know whether cache hit rate dropped after a model upgrade — and you would only discover it from the invoice.

What to do tomorrow

Caching has the lowest implementation cost of any AI cost optimisation — and one of the highest savings potentials. Three steps:

Week 1: Identify the calls in the product that send long, static system prompts. Measure the token volume on these calls. Calculate the potential savings.

Week 2: Implement cacheControl on the static parts of the identified calls. Verify via telemetry that cache_read_input_tokens increases.

Week 3: Measure the actual cost reduction in Langfuse or the Anthropic console. Establish caching as the default practice for all new features with static system context.

Prompt caching is not an advanced optimisation. It is a fundamental part of building AI systems that are cost-sustainable in production.

References

[1] Anthropic, "Prompt Caching", available at docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-23).

[2] Anthropic, "Models Overview — Pricing", available at docs.anthropic.com/en/docs/models-overview (accessed 2026-04-23).

[3] Vercel AI SDK, "Provider Options — Anthropic", available at sdk.vercel.ai/providers/ai-sdk-providers/anthropic (accessed 2026-04-23).

ai-cost architecture claude

ShareLinkedIn X

Spekir builds the layer that connects strategy to the IT portfolio. See Atlas →

EU AI Act for Midmarket — What You Actually Need to Do

A pragmatic roadmap for the IT manager or compliance coordinator who needs to translate the EU AI Act into action without a dedicated compliance team. The 20 things, prioritisation, and what is realistic.

9 min read →

Annex III Explained — When Is Your AI 'High-Risk'?

The eight Annex III categories explained with concrete examples from Nordic midmarket. When is your recruitment tool, credit scoring, or OT system high-risk under the EU AI Act?

8 min read →

Your AI Policy — 8 Sections You Cannot Skip

What must an AI policy contain? The eight mandatory sections, common mistakes, and what separates a policy that is actually used from one that lives in a PDF folder nobody opens.