Claude streaming with prompt caching
Anthropic SDK chat with streaming + cache_control on the system prompt — the fastest, cheapest path to a chat product.
Prompt caching is the single biggest knob in production Claude apps. A cached system prompt cuts TTFT by ~40% and prompt cost by ~10× on repeat calls within the 5-minute TTL. Below is the minimal pattern for a streaming chat endpoint with caching.
Install
npm i @anthropic-ai/sdk
Streaming chat endpoint
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function POST(req: Request) {
const { messages } = await req.json();
const stream = await client.messages.stream({
model: "claude-opus-4-7",
max_tokens: 1024,
system: [
{
type: "text",
text: `You are a helpful coding assistant for Carson's portfolio site.
You answer concisely and prefer code over prose.
${LARGE_SYSTEM_CONTEXT}`,
cache_control: { type: "ephemeral" },
},
],
messages,
});
// Forward to the client as SSE
const encoder = new TextEncoder();
const body = new ReadableStream({
async start(controller) {
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ delta: event.delta.text })}\n\n`)
);
}
}
const final = await stream.finalMessage();
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ done: true, usage: final.usage })}\n\n`)
);
controller.close();
},
});
return new Response(body, {
headers: { "Content-Type": "text/event-stream", "Cache-Control": "no-cache" },
});
}
What's worth caching
- Long, stable system prompts (style guide, tool definitions, schema docs).
- RAG context that gets reused across turns within the same conversation.
- Few-shot examples — they're often the biggest contributor to a hot system prompt.
What you'll see in usage
{
"input_tokens": 12,
"cache_creation_input_tokens": 4321,
"cache_read_input_tokens": 0,
"output_tokens": 287
}
On the next call within the 5-minute window, cache_read_input_tokens jumps and cache_creation_input_tokens drops to 0. Always log these — they're the most reliable signal that caching is wired up.