Building AI Voice Agents That Sound Human

The uncanny valley of voice AI

Most voice AIs sound like robots reading a script. They respond too fast, never pause, and transition between topics with the smoothness of a brick wall. Callers know within 3 seconds that they're not talking to a person.

We set a different bar for getarbol: callers shouldn't be able to tell. Not "maybe it's AI." Not "it's good for AI." Genuinely indistinguishable.

73% of callers in blind tests couldn't tell. Here's how. (Read the full getarbol case study for business impact and technical architecture.)

The three things that break the illusion

1. Latency

Humans respond within 300-800ms in phone conversations. Any faster feels robotic. Any slower feels like a bad connection. Our AI needed to fall in that exact window.

The challenge: an AI voice pipeline has 5 steps (receive audio → transcribe → reason → generate text → synthesize speech), and each one takes time.

Step	Target Latency	Actual
Audio reception	50ms	30ms
Transcription	100ms	80ms
LLM reasoning	200ms	150ms
Text generation	100ms	90ms
Speech synthesis	50ms	40ms

Total pipeline: under 400ms. We hit it by streaming every stage — the LLM starts generating while transcription is still finishing, and speech synthesis begins on the first tokens of the response.

2. Conversational rhythm

Real conversations aren't a clean back-and-forth. People interrupt. They say "um." They pause mid-sentence. They laugh. They change topics mid-thought.

We modeled conversational rhythm from 10,000 recorded phone calls (anonymized, of course). Key patterns we replicated:

Thinking pauses — When asked a complex question, the AI pauses 200-400ms before responding. Humans do this naturally.
Filler acknowledgments — "Mm-hmm," "right," "I see" during the caller's speech. Not silence.
Graceful interruption handling — If the caller interrupts, the AI stops immediately and listens. No "please wait while I finish."
Topic transitions — A small pause and acknowledgment before switching contexts: "Got it. Now, about your reservation..."

3. Voice quality

Text-to-speech has gotten remarkably good. But "good" TTS still sounds like a news anchor — polished, consistent, emotionally flat.

We fine-tuned voice profiles for warmth. The AI sounds like a friendly, competent person — not a professional announcer. Slight pitch variations. Emphasis on key words. Occasionally speeding up or slowing down based on content.

The single biggest improvement in our human-likeness score came from adding micro-pauses before important information. Instead of "Your reservation is at 7 PM," the AI says "Your reservation is at... 7 PM." That tiny pause makes the information feel deliberate, not scripted.

Multi-agent workflows

A single AI personality isn't enough for complex calls. Real interactions require different "skills" — and the caller shouldn't notice the switch. Learn more about our AI & Voice Agent solutions.

How agent handoffs work

Each specialized agent maintains its own context window and tool access:

Greeting agent — Identifies the caller's intent within the first 10 seconds
Booking agent — Access to the reservation system, calendar, and availability
Information agent — Access to the knowledge base (menu, hours, location, policies)
Escalation agent — Recognizes frustration and routes to a human when needed

The handoff happens mid-conversation with no audible gap. The new agent receives the full conversation transcript and continues seamlessly.

Booking

42% of calls

Information

31% of calls

Modification

15% of calls

Escalation

8% of calls

Other

4% of calls

The escalation decision

Knowing when to transfer to a human is as important as handling the call. We built a frustration detector that monitors:

Repeated questions (the caller isn't getting what they need)
Rising voice volume or speed
Explicit requests ("Let me talk to a person")
Complex multi-step requests the AI isn't confident about

When the score crosses a threshold, the AI says: "Let me connect you with someone who can help with this directly." No arguing. No "Are you sure?" Just a clean handoff.

What we measure

Every call generates structured data:

Transcription confidence — How sure was the speech-to-text model?
Intent classification accuracy — Did we correctly understand what the caller wanted?
Response latency — End-to-end pipeline time for each exchange
Resolution rate — Did the AI fully handle the call without escalation?
Caller satisfaction signals — Tone analysis, explicit feedback, callback rate

92%

Resolution rate

390ms

Average latency

Escalation rate

4.6/5

Caller satisfaction

The 73% number in context

The blind test protocol was simple: 200 callers, randomly assigned to AI or human. Post-call survey: "Were you speaking to a person or a computer?"

For the human agents, 91% correctly identified them as human. For the AI, only 27% correctly identified it as AI. The remaining 73% either thought it was human (58%) or were unsure (15%).

The callers who identified the AI correctly cited "too consistent" as the reason — the AI never stumbled, never coughed, never asked a coworker a side question. Ironically, being too perfect was the giveaway.

We're working on that.

Building voice AI that passes for human isn't about better models — it's about understanding how humans actually talk. If you're exploring voice AI for your business, we've shipped this at scale.