RAG with Algolia + Claude rerank

Hybrid retrieval (BM25 + vector) plus Claude as a relevance judge — the stack I use in production.

RAG with Algolia + Claude rerank

Pure vector search has a recall problem on real corpora — exact matches and rare tokens slip through. Pure keyword has a synonym problem. Production RAG that I trust = Algolia (BM25 + dense) for top-K, then Claude as a reranker. Below is the whole pipeline.

Retrieve top-K

import { algoliasearch } from "algoliasearch";

const client = algoliasearch(
  process.env.ALGOLIA_APP_ID!,
  process.env.ALGOLIA_API_KEY!
);

export async function retrieve(query: string, k = 20) {
  const { results } = await client.search({
    requests: [
      {
        indexName: "docs",
        query,
        hitsPerPage: k,
        attributesToRetrieve: ["title", "body", "url"],
      },
    ],
  });
  return (results[0] as any).hits as Array<{
    objectID: string;
    title: string;
    body: string;
    url: string;
  }>;
}

Rerank with Claude

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const client = new Anthropic();

const Ranking = z.object({
  ranking: z.array(z.object({ id: z.string(), score: z.number() })),
});

export async function rerank(
  query: string,
  hits: { objectID: string; title: string; body: string }[]
) {
  const corpus = hits
    .map((h, i) => `<doc id="${h.objectID}">\n${h.title}\n\n${h.body.slice(0, 800)}\n</doc>`)
    .join("\n\n");

  const res = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: "Rank docs by relevance to the query. Return JSON: { ranking: [{ id, score }] } where score is 0..1. Output JSON only.",
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      { role: "user", content: `Query: ${query}\n\n${corpus}` },
    ],
  });

  const text = res.content
    .filter((b): b is Anthropic.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");

  return Ranking.parse(JSON.parse(text)).ranking
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);
}

Compose

const candidates = await retrieve(userQuery, 20);
const ranked = await rerank(userQuery, candidates);
const context = ranked
  .map((r) => candidates.find((c) => c.objectID === r.id)!)
  .map((c) => `[${c.title}](${c.url})\n${c.body}`)
  .join("\n\n---\n\n");

// Now pass `context` into your answering call as cached system content.

Why hybrid + rerank

  • BM25 catches exact tokens (error codes, function names, library names) that embeddings smear.
  • Dense catches paraphrase ("how do I sign in" ≈ "login flow").
  • Rerank is cheap. Haiku on 20 short docs is ~150 ms. Recall@K jumps because the ranker actually understands the query, not just lexical overlap.
  • Truncate aggressively. Send only the first 800 chars per doc to the reranker. The full body goes to the answering call.