Claude streaming with prompt caching

Prompt caching is the single biggest knob in production Claude apps. A cached system prompt cuts TTFT by ~40% and prompt cost by ~10× on repeat calls within the 5-minute TTL. Below is the minimal pattern for a streaming chat endpoint with caching.

Install

npm i @anthropic-ai/sdk

Streaming chat endpoint

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = await client.messages.stream({
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: `You are a helpful coding assistant for Carson's portfolio site.
You answer concisely and prefer code over prose.
${LARGE_SYSTEM_CONTEXT}`,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages,
  });

  // Forward to the client as SSE
  const encoder = new TextEncoder();
  const body = new ReadableStream({
    async start(controller) {
      for await (const event of stream) {
        if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ delta: event.delta.text })}\n\n`)
          );
        }
      }
      const final = await stream.finalMessage();
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ done: true, usage: final.usage })}\n\n`)
      );
      controller.close();
    },
  });

  return new Response(body, {
    headers: { "Content-Type": "text/event-stream", "Cache-Control": "no-cache" },
  });
}

How it works

Two ideas are combined here. The first is streaming: instead of waiting for the whole completion, client.messages.stream(...) returns an async iterator of events. The handler watches for content_block_delta events whose delta is a text_delta, and re-emits each chunk of text to the browser. The endpoint wraps those chunks in a ReadableStream and serves them as Server-Sent Events — each chunk is written as a data: {...}\n\n line with a text/event-stream content type — so the UI can render tokens as they arrive rather than after a long pause. Once the model finishes, stream.finalMessage() resolves to the complete message, and the handler sends one last SSE frame carrying done: true and the usage block before closing the stream.

The second idea is prompt caching. The system field is an array of content blocks, and the single text block carries cache_control: { type: "ephemeral" }. That marker tells the API to cache everything up to and including that block. Caching is a prefix match, so the stable part of the prompt — the instructions plus LARGE_SYSTEM_CONTEXT — must come before the volatile per-request content (the messages, which always differ). On the first call the prompt is written to the cache; on subsequent calls within the cache's lifetime, the same prefix is served from cache instead of being reprocessed from scratch. The usage numbers you log are how you confirm it's actually working.

What's worth caching

Long, stable system prompts (style guide, tool definitions, schema docs).
RAG context that gets reused across turns within the same conversation.
Few-shot examples — they're often the biggest contributor to a hot system prompt.

What you'll see in `usage`

{
  "input_tokens": 12,
  "cache_creation_input_tokens": 4321,
  "cache_read_input_tokens": 0,
  "output_tokens": 287
}

On the next call within the 5-minute window, cache_read_input_tokens jumps and cache_creation_input_tokens drops to 0. Always log these — they're the most reliable signal that caching is wired up.

Notes & gotchas

The cache key is the exact bytes of the prefix, so the most common mistake is silently invalidating it. If anything dynamic creeps into the system prompt — a timestamp, a per-request ID, a non-deterministically serialized JSON blob — the prefix changes on every call and you'll see cache_read_input_tokens sitting at zero no matter how many requests you send. Keep the system prompt frozen and push anything that varies into the messages array, which lives after the cached prefix. The same applies to the model: caches are scoped per model, so switching models mid-conversation starts the cache over.

There's also a minimum prefix size below which nothing caches at all — short system prompts won't produce a cache entry, and you'll just see cache_creation_input_tokens: 0 with no error. That's why this pattern pays off specifically when the system prompt is large and reused, which is exactly the chat-endpoint shape. On the streaming side, the main thing to handle is interruption: if the client disconnects mid-stream you may have only a partial response, so don't treat a closed connection as a completed turn. Stream for any request with long input or output — it sidesteps the request timeouts that long non-streaming completions can hit.