Phiny.ai · Case study · Carson Rodrigues

The problem

Career-prep platforms have a credibility gap. Candidates can run a resume tool, watch a YouTube interview, or post on a forum — but the loop between practice and feedback is async, generic, and rarely tied to the actual job description in front of them. Recruiters feel the other half of the same gap: they see thousands of applications, most of them untargeted, and a handful that nail the framing. Every minute of that mismatch is cost on both sides.

Phiny.ai was built to collapse that loop. The product had to do three things at once: (1) tailor a candidate's resume to the specific JD they're applying to, (2) run a real-time voice mock interview with sub-second latency that felt like a live conversation rather than a chat-bot transcript, and (3) feed every signal — interview answers, resume edits, application status — back into a single tracker the candidate could trust. The only acceptable failure mode was "the model said something dumb." Latency hiccups, dropped audio, payment confusion, or a slow application page would all kill the trust the product depends on.

Constraints

Three constraints shaped every decision:

Latency budget under 1.2s end-to-end for voice. Anything above that breaks conversational flow. The cascaded ASR → LLM → TTS pipeline routinely overruns this on the naive path, and the budget had to be defended at every stage.
Cost per minute had to fit a freemium model. ElevenLabs TTS, Deepgram ASR, and Claude inference all bill per usage. Without aggressive prompt caching, intent routing, and TTS chunking, a single 25-minute mock interview would have eaten the unit economics.
One backend serving three product surfaces — web, mobile (eventual), and the voice agent. Smart Recruiter had to be a single source of truth for users, jobs, applications, payments, and interview state. No copies of the data layer per surface.

Architecture

Phiny is a two-process system bound by a shared MongoDB cluster, a Redis cache + queue layer, and a typed event contract.

Smart Recruiter — NestJS unified backend

516 TypeScript files, ~101k LoC of NestJS modules. Every business surface is a module: auth, users, jobs, applications, resume, llm, interview, payments, analytics. Cross-cutting concerns — observability, request guards, rate limits, BullMQ workers — live in dedicated modules and are shared by every feature.

The llm module fronts every model call through OpenRouter, which lets the product route between Claude, GPT-4, and Gemini at the call site without the rest of the codebase knowing which model answered. System prompts and tool definitions sit behind cache_control: ephemeral, so the second call within a 5-minute window pays roughly 10× less. The Anthropic SDK is the default for anything intent-shaped (resume diff explanations, interview question grading) because Haiku is fast enough to keep the UI responsive and Opus is reserved for the deeper resume rewrite pass.

Voice agent — Python WebRTC microservice

~17k LoC of Python in a separate FastAPI process so the voice runtime never blocks the API. Pipecat orchestrates the pipeline: LiveKit rooms for transport, Deepgram for streaming ASR with VAD endpointing, Claude for intent + question generation, ElevenLabs for streaming TTS. The interview state machine — current question, previous answers, time-pressure signals — is passed through as Pipecat context, not stored in the LLM's memory. State outlives any one model call.

The two processes talk over a typed REST contract for setup (start-session, end-session) and a Redis pub/sub channel for in-flight events (interview.answer.completed,interview.session.evaluated). The frontend joins the LiveKit room with a short-lived JWT minted server-side; once connected, the audio path is browser ↔ LiveKit ↔ voice-agent — the API server is out of the hot path entirely.

Resume + RAG pipeline

Resumes come in as PDFs, DOCX, or images. The ingest pipeline runs Tesseract OCR for photographed docs, pdf-parse for native PDFs, and mammoth for DOCX, normalising everything to a structured Markdown skeleton. The skeleton plus the JD plus a small few-shot corpus go to Claude with cache control on the system + few-shots, leaving only the resume + JD as the per-call delta. The output is a side-by-side diff the user can accept or reject section by section.

What I built

The whole NestJS Smart Recruiter unified backend — module architecture, auth (OAuth + email/password + Apple), payments (Razorpay + Stripe + Apple Pay), the application tracker, the JD scraper (Apify + Playwright), the resume parser, the cached LLM router, BullMQ background workers, AWS S3 / SES / SQS glue, Sentry + PostHog wiring.
The FastAPI WebRTC microservice — Pipecat pipeline assembly, LiveKit token issuance, Deepgram + ElevenLabs + Claude integrations, the interview state machine, Redis events, the structured-output question grader.
The cross-process contract: a single typed event model so the API server, the voice agent, and the frontend all agree on what an "interview answer" looks like and which actor is allowed to mutate it.
The performance pass that took the voice loop from a 3.4s p50 baseline to under 1.2s — five separate optimisations described in my Voice AI Latency Optimization paper, applied to this exact pipeline.

Trade-offs

Three calls were close-run:

OpenRouter vs. direct vendor SDKs. Direct Anthropic SDK calls are slightly faster and let us use prompt caching cleanly. OpenRouter adds 30–80ms per call and partial cache support. We kept OpenRouter for non-cached one-shots (resume diffs, JD parsing) so the product can experiment with new models without code changes, and used the Anthropic SDK directly for the hot paths (interview Q-generation, answer grading) where caching paid for itself.
Pipecat vs. building the voice loop ourselves. Pipecat is a young framework. Owning the orchestration code would have given us tighter control of every frame. The cost would have been three months of engineering rebuilding what Pipecat already does. We took the framework, contributed back patches, and accepted that some bugs would land in someone else's repo before our fix could ship.
MongoDB vs. PostgreSQL. Resumes and interviews are deeply nested, semi-structured documents. Mongo's schema flexibility kept iteration fast in the first six months. The price was weaker analytical queries — anything cross-cutting (cohort retention, conversion by JD source) needed a separate aggregation pipeline. We'd revisit this if Phiny scaled to a multi-tenant B2B surface.

Outcome

Phiny shipped end-to-end across web with the voice agent live in production. The interview p50 latency dropped from a baseline 3.4s to under 1.2s — a 65% reduction — by stage-overlapping intent and TTS, compressing prompts behind cache, and chunking TTS at the first sentence boundary. Prompt caching alone cut per-session LLM cost by roughly 10× on the hot paths. The platform now hosts paid interview prep packages, a recruiter-side JD intake, and university partnerships routing students into the same flow. The latency work was written up as a peer-reviewed paper — Latency Optimization in Production Voice AI Pipelines (Rodrigues, 2026).

The system that came out of Phiny is what I keep reaching for now: typed contracts between a small number of single-purpose processes, prompt-cached LLMs at the edges, and an orchestrator that stays out of the audio hot path. It's the same skeleton I now use for every voice product.