Building Conversational AI Agents

All plans (Enterprise recommended for production)2 min read

Step 1: Establish WebSocket connection

Connect to the ElevenAgents WebSocket endpoint with your API key. Configure the agent's voice, model, and system prompt.

Step 2: Stream user audio

Send user microphone audio in PCM or mu law format. The server runs STT in real time.

Step 3: LLM processes transcript

The user's speech is transcribed and passed to the configured LLM. The agent generates a contextual response, optionally calling external tools.

Step 4: Receive streamed TTS audio

The agent's response is synthesized and streamed back as audio chunks. Total round trip under 300ms with Flash models.

Step 5: Handle turn taking and interruptions

The system automatically manages when to listen and when to speak. Users can interrupt, and the agent adapts naturally.

Concurrency scaling rule of thumb

A concurrency limit of 5 supports approximately 100 simultaneous voice conversations. This is because TTS generation time is much shorter than audio playback time; while one conversation plays audio to the user, the API capacity is free for other requests. Scale tier (15 Multilingual, 30 Flash concurrency) can support hundreds of simultaneous calls.