RAG with Algolia + Claude rerank

Pure vector search has a recall problem on real corpora — exact matches and rare tokens slip through. Pure keyword has a synonym problem. Production RAG that I trust = Algolia (BM25 + dense) for top-K, then Claude as a reranker. Below is the whole pipeline.

Retrieve top-K

import { algoliasearch } from "algoliasearch";

const client = algoliasearch(
  process.env.ALGOLIA_APP_ID!,
  process.env.ALGOLIA_API_KEY!
);

export async function retrieve(query: string, k = 20) {
  const { results } = await client.search({
    requests: [
      {
        indexName: "docs",
        query,
        hitsPerPage: k,
        attributesToRetrieve: ["title", "body", "url"],
      },
    ],
  });
  return (results[0] as any).hits as Array<{
    objectID: string;
    title: string;
    body: string;
    url: string;
  }>;
}

Rerank with Claude

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const client = new Anthropic();

const Ranking = z.object({
  ranking: z.array(z.object({ id: z.string(), score: z.number() })),
});

export async function rerank(
  query: string,
  hits: { objectID: string; title: string; body: string }[]
) {
  const corpus = hits
    .map((h, i) => `<doc id="${h.objectID}">\n${h.title}\n\n${h.body.slice(0, 800)}\n</doc>`)
    .join("\n\n");

  const res = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: "Rank docs by relevance to the query. Return JSON: { ranking: [{ id, score }] } where score is 0..1. Output JSON only.",
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      { role: "user", content: `Query: ${query}\n\n${corpus}` },
    ],
  });

  const text = res.content
    .filter((b): b is Anthropic.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");

  return Ranking.parse(JSON.parse(text)).ranking
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);
}

Compose

const candidates = await retrieve(userQuery, 20);
const ranked = await rerank(userQuery, candidates);
const context = ranked
  .map((r) => candidates.find((c) => c.objectID === r.id)!)
  .map((c) => `[${c.title}](${c.url})\n${c.body}`)
  .join("\n\n---\n\n");

// Now pass `context` into your answering call as cached system content.

Why hybrid + rerank

BM25 catches exact tokens (error codes, function names, library names) that embeddings smear.
Dense catches paraphrase ("how do I sign in" ≈ "login flow").
Rerank is cheap. Haiku on 20 short docs is ~150 ms. Recall@K jumps because the ranker actually understands the query, not just lexical overlap.
Truncate aggressively. Send only the first 800 chars per doc to the reranker. The full body goes to the answering call.

How it works

The pipeline runs in two passes. The first pass, retrieve, asks Algolia for the top k hits (defaulting to 20) for the query and pulls back just the fields it needs — title, body, and url. This is the recall stage: cast a wide net and accept that some of the 20 will be noise.

The second pass, rerank, is where Claude earns its place. It packs the candidate hits into a single prompt, wrapping each document in a <doc id="..."> tag and slicing the body to the first 800 characters so the prompt stays small. The system message instructs the model to return JSON of the form { ranking: [{ id, score }] } with a relevance score from 0 to 1, and it carries a cache_control: { type: "ephemeral" } marker so that fixed instruction block can be cached across calls. After the response comes back, the code concatenates the text blocks, parses the JSON, and — importantly — validates it against the Ranking Zod schema before trusting it. It then sorts by score descending and keeps the top 5.

The Compose step closes the loop: it maps the reranked IDs back to the original candidates (which still hold the full, untruncated body and the URL), and formats them into a single context string of linked titles and bodies, ready to drop into the answering call as cached system content.

When to use it

Use this when neither keyword nor vector search alone is good enough — which, on real documentation corpora, is most of the time. As the notes above explain, BM25 nails exact tokens like error codes and function names that embeddings tend to smear together, while dense retrieval catches paraphrase, and a model-based reranker fixes the ordering because it actually reads the query rather than measuring lexical overlap.

A few practical points. The truncation is deliberate and asymmetric: the reranker only sees the first 800 characters of each doc, but the full body flows to the answering call, so you pay for short context where you only need a ranking signal and reserve the long context for the final answer. Keep the Algolia and Anthropic credentials in environment variables on the server (as the code does) and never ship them to the client. And because the model returns free-form text, the Zod parse around JSON.parse is not optional — it is your guard against a malformed ranking silently corrupting the context. If a parse fails, falling back to the raw Algolia order is a reasonable degradation.