Real-time voice agent (Pipecat + LiveKit)

The cascaded ASR → LLM → TTS pipeline is the workhorse of production voice AI. Below is the minimum Pipecat program that joins a LiveKit room and runs a fully-streaming agent.

Install

pip install "pipecat-ai[livekit,deepgram,anthropic,elevenlabs,silero]"

Agent

import asyncio
import os

from pipecat.frames.frames import EndFrame, LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.livekit import LiveKitTransport, LiveKitParams
from pipecat.vad.silero import SileroVADAnalyzer


async def main(room: str, token: str):
    transport = LiveKitTransport(
        url=os.environ["LIVEKIT_URL"],
        token=token,
        room_name=room,
        params=LiveKitParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    stt = DeepgramSTTService(api_key=os.environ["DEEPGRAM_API_KEY"])
    llm = AnthropicLLMService(
        api_key=os.environ["ANTHROPIC_API_KEY"],
        model="claude-opus-4-7",
    )
    tts = ElevenLabsTTSService(
        api_key=os.environ["ELEVENLABS_API_KEY"],
        voice_id=os.environ["ELEVEN_VOICE_ID"],
    )

    context = OpenAILLMContext(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful concierge. Keep replies under 2 sentences.",
            }
        ]
    )

    pipeline = Pipeline([
        transport.input(),
        stt,
        llm.create_context_aggregator(context).user(),
        llm,
        tts,
        transport.output(),
        llm.create_context_aggregator(context).assistant(),
    ])

    task = PipelineTask(pipeline)

    @transport.event_handler("on_first_participant_joined")
    async def _on_join(_t, participant):
        await task.queue_frames([LLMMessagesFrame(context.messages)])

    @transport.event_handler("on_participant_left")
    async def _on_leave(_t, _p, _r):
        await task.queue_frame(EndFrame())

    await PipelineRunner().run(task)


if __name__ == "__main__":
    asyncio.run(main(os.environ["ROOM"], os.environ["TOKEN"]))

Latency budget that worked at 40k locations

| Stage | Target | |---|---| | ASR partial → final | < 250 ms | | LLM TTFT | < 350 ms | | TTS TTFB | < 250 ms | | End-to-end | < 1.2 s |

Five optimizations gave a measured −41.8% E2E latency in production:

Streaming-first TTS chunking (start synthesizing on the first sentence boundary, not on full reply).
Concurrent intent detection and synthesis — biggest single win.
Prompt compression on the LLM call (cache the long system prompt; trim turn history).
Session-state caching to avoid re-priming Claude every turn.
Adaptive VAD endpointing — Silero sensitivity tuned per environment noise floor.

Source: my paper Latency Optimization in Production Voice AI Pipelines (Rodrigues, 2026).

How it works

The whole agent is one Pipecat Pipeline — an ordered list of processors that audio frames flow through. At the front, LiveKitTransport joins a WebRTC room using a URL and a token, with audio in and out enabled and voice activity detection wired to a SileroVADAnalyzer. VAD is what lets the agent know when the user has actually started and stopped speaking, rather than guessing on silence timers.

The three services are the cascade itself: DeepgramSTTService turns incoming audio into text, AnthropicLLMService generates the reply, and ElevenLabsTTSService synthesizes that reply back to audio. Each is configured from environment variables for its API key (and, for TTS, a voice ID). The conversation history lives in an OpenAILLMContext seeded with a system message that tells the model to act as a concierge and keep replies under two sentences — a deliberate choice for a voice agent, where long replies feel slow.

The pipeline order matters: transport.input() feeds audio to STT, the user-side context aggregator records what was said, the LLM produces a reply, TTS speaks it through transport.output(), and the assistant-side context aggregator appends the reply back into the running history so the next turn has context. Two event handlers bracket the session — on_first_participant_joined queues the initial messages so the agent greets the caller, and on_participant_left queues an EndFrame to shut the pipeline down cleanly. PipelineRunner().run(task) drives the whole thing, and main is launched with asyncio.run, reading the room name and token from the environment.

Notes & gotchas

This is the cascaded ASR to LLM to TTS pattern, and its defining challenge is latency — the numbers in the budget table above exist because every stage adds delay that the caller hears. The single most useful design rule is to keep everything streaming: synthesize on the first sentence boundary instead of waiting for the full reply, and keep the model's replies short so there is less to speak.

On the operational side, the agent reads a token from the environment, which means something has to mint it — a server-side token endpoint (see the companion LiveKit token-server snippet). Never generate that token on the client or embed your LIVEKIT_API_SECRET in the agent's caller; the agent should receive an already-signed, short-lived token. Likewise, every service key here (DEEPGRAM_API_KEY, ANTHROPIC_API_KEY, ELEVENLABS_API_KEY) is read from the environment for a reason: this process runs server-side and the keys must stay there. Tune the Silero VAD sensitivity to the deployment's noise floor — too aggressive and it clips the user mid-sentence, too lax and the agent talks over them.