The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.
One easy way to build voice agents and connect them to Twilio is the Pipecat open source framework. Pipecat supports a wide variety of network transports, including the Twilio MediaStream WebSocket protocol so you don't have to bounce through a SIP server. Here's a getting started doc.[1]
(If you do need SIP, this Asterisk project looks really great.)
Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].
Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/
I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.
Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)
Please don't. I had a talk with a shitty AI bot on a Fedex line. It's absolute crap. Just give me a 'Type 1 for x, type 2 for y'. Then I don't need to guess what are the possibilities.
Voice-controlled phone systems are hugely rage-inducing for me. I am often in loud setting with background chatter. Muting my audio and using a touchtone keypad is so much more accurate and easy than having to find a quiet place and worrying that somebody is going to say something that the voice response system detects.
One problem is once you’re in deep building a phone IVR workflow beyond X or Y (yes, these are intentional), callers don’t care about some deep and featured input menu. They just mash 0 or pick a random option and demand a human finish the job and transfer them - understandably.
When you’re committed to phone intent complexity (hell), the AI assisted options are sort of less bad since you don’t have to explain the menu to callers, they just make demands.
I’m honestly surprised it hasn’t been more prevalent yet. I still get call centre type spam calls where you can hear all the background noise of the rest of the call centre.
I assume it's to make it seem like an actual call center rather than a scam. I recently got two phone scam attempts (credit card related) that sounded exactly like this.
I built a voice AI stack and background noise can be really helpful to a restaurant AI for example. Italian background music or cafe background is part of the brand. It’s not meant to make the caller believe this is not a bot but only to make the AI call on brand.
Is that really where SOTA is right now?
500-1000ms is borderline acceptable.
Sub-300ms is closer to SOTA.
2000ms or more means people will hang up.
(If you do need SIP, this Asterisk project looks really great.)
Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].
[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn
Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/
That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.
Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).
In your opinion, how close is Pipecat + OSS to replacing proprietary infra from Vapi, Retell, Sierra, etc?
p.s. do you do paid consult?
Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)
When you’re committed to phone intent complexity (hell), the AI assisted options are sort of less bad since you don’t have to explain the menu to callers, they just make demands.
Sort of like how Jira can be a streamlined tool or a prison of 50-step workflows, it's all up to the designer.