Building Conversational AI Agents | ElevenLabs API

Establish WebSocket connection

Connect to the ElevenAgents WebSocket endpoint with your API key. Configure the agent's voice, model, and system prompt.

Stream user audio

Send user microphone audio in PCM or mu law format. The server runs STT in real time.

LLM processes transcript

The user's speech is transcribed and passed to the configured LLM. The agent generates a contextual response, optionally calling external tools.

Receive streamed TTS audio

The agent's response is synthesized and streamed back as audio chunks. Total round trip under 300ms with Flash models.

Handle turn taking and interruptions

The system automatically manages when to listen and when to speak. Users can interrupt, and the agent adapts naturally.

Concurrency scaling rule of thumb

A concurrency limit of 5 supports approximately 100 simultaneous voice conversations. This is because TTS generation time is much shorter than audio playback time; while one conversation plays audio to the user, the API capacity is free for other requests. Scale tier (15 Multilingual, 30 Flash concurrency) can support hundreds of simultaneous calls.