1
Establish WebSocket connection
Connect to the ElevenAgents WebSocket endpoint with your API key. Configure the agent's voice, model, and system prompt.
2
Stream user audio
Send user microphone audio in PCM or mu law format. The server runs STT in real time.
3
LLM processes transcript
The user's speech is transcribed and passed to the configured LLM. The agent generates a contextual response, optionally calling external tools.
4
Receive streamed TTS audio
The agent's response is synthesized and streamed back as audio chunks. Total round trip under 300ms with Flash models.
5
Handle turn taking and interruptions
The system automatically manages when to listen and when to speak. Users can interrupt, and the agent adapts naturally.
Concurrency scaling rule of thumb
A concurrency limit of 5 supports approximately 100 simultaneous voice conversations. This is because TTS generation time is much shorter than audio playback time; while one conversation plays audio to the user, the API capacity is free for other requests. Scale tier (15 Multilingual, 30 Flash concurrency) can support hundreds of simultaneous calls.