Try It Live
Call our demo number to experience the AI voice agent in action.
It demonstrates natural conversation and can answer your questions.
How It Works
Phase 1: Connection
Phase 2: Conversation Loop
Total response time: under 1 second
How It's Built
A real-time voice pipeline powered by best-in-class APIs, orchestrated on Fly.io infrastructure.
What It Does
Performance
Challenges Faced
Two problems surfaced during live demo testing. First, after the agent closed the WebSocket the caller stayed on a silent line for nearly a minute because the TwiML had a <Pause> verb running after <Connect><Stream> ended. Removed the Pause so Twilio ends the PSTN leg when the stream closes. Second, the original goodbye detector matched natural phrases like "take care" and "goodbye" anywhere in an AI response, which meant the system hung up on callers mid-question whenever the model happened to use one of those phrases. Replaced it with a silent sentinel token [[END_CALL]] that the model emits only after the caller has explicitly wrapped up. The handler strips the token before sending to TTS, then closes the connection once the spoken audio finishes. Still tuning the prompt's negative examples so the model does not emit the sentinel proactively.
Conversation history grew with each turn (user message + AI response + tool results), increasing token count from ~1,200 to ~2,500+ over 4-5 turns. This caused latency creep and would have eventually hit context limits. Solution: added context window management to keep token usage stable across longer calls.
Concurrent TTS requests were hitting Cartesia's WebSocket connection limit (max 2), causing audio failures. Solution: added queueSpeakText() method in cartesia.js that processes TTS requests sequentially, preventing rate limit errors.
The Cartesia TTS SDK v2.2.9 had a critical bug where websocket.send() hung indefinitely in Node.js, causing 10-second timeouts and complete silence. Solution: bypassed the SDK entirely and implemented direct WebSocket connection to Cartesia's API.
LLMs sometimes "leak" internal function call syntax into their spoken responses, saying things like "Note: customer wants..." instead of speaking naturally while silently collecting data. Solution: explicit prompt instructions with examples of bad vs good responses.
The agent was loading 1,286 Q&A entries instead of 7, causing 13-14k tokens per LLM call and growing latency (3s → 4s → 5s). Root cause: a database view was aggregating Q&As from ALL users instead of just the current user. Diagnosis came from noticing the massive token count in logs. Solution: fixed the data source to return only the user's own Q&A entries.