AI Voice Models in 2026: What GPT-Realtime, Gemini Live, and Nova Sonic Actually Cost to Run

Voice model pricing, limits, and session behavior change quickly. The pricing and feature details below were verified against official provider docs and pricing sources on April 6, 2026 UTC and should be rechecked before production rollout.

Most teams still underestimate voice AI cost because they compare vendors like this is just another chatbot pricing page. It is not. Production voice systems bill for listening time, speech output, tool calls, conversation history, interruption handling, and often a second hidden text or reasoning model behind the live session.

That is why list price alone rarely tells you what a voice agent will cost to run. The useful question is how GPT-Realtime, Gemini Live, and Nova Sonic meter speech versus text, how their realtime architecture affects latency and barge-in behavior, and when the business case is strong enough to justify voice at all.

Key takeaways

  • Voice AI economics are usually driven more by audio output, interruption waste, and backend routing than by a single headline model price.
  • OpenAI’s current gpt-realtime-1.5 is the premium-priced option in this group; Gemini 2.5 Flash Native Audio is materially cheaper; Amazon Nova Sonic is the cheapest on verified list pricing in US East.
  • Speech tokens and text tokens are priced separately for all three providers, which means transcripts, tool calls, and memory can become their own line item.
  • The AI Models app is most useful here for the non-voice layer: comparing the text and reasoning models you pair behind the live voice system and tracking adjacent price changes or deprecations.

Current pricing snapshot

Model Realtime architecture Verified list pricing What the bill is really counting
OpenAI gpt-realtime-1.5 Realtime API over WebRTC, WebSocket, or SIP Text input: $4.00 per 1M
Text output: $16.00 per 1M
Audio input: $32.00 per 1M
Audio output: $64.00 per 1M
One realtime model handles voice interaction directly, but audio tokens are priced at a steep premium versus text. Good fit when voice quality, tooling, and telephony/browser flexibility matter more than raw cost.
Google Gemini 2.5 Flash Native Audio (Live API preview) Stateful Live API sessions over WebSockets with VAD Text input: $0.50 per 1M
Text output: $2.00 per 1M
Audio or video input: $3.00 per 1M
Audio output: $12.00 per 1M
Google’s native audio Live API is priced much closer to text than OpenAI’s realtime lane. It is compelling on price-performance, but it is still a preview product and should be treated that way operationally.
Amazon Nova Sonic on Bedrock Bedrock bidirectional streaming API US East (N. Virginia) verified pricing:
Speech input: $3.40 per 1M
Speech output: $13.60 per 1M
Text input: $0.06 per 1M
Text output: $0.24 per 1M
AWS separates speech and text billing clearly. Text pricing applies to things like transcription, tool calls, grounding, and conversation history, not just visible replies. Region-specific pricing matters.

That table is the starting point, not the full answer. AWS is cheapest on verified list pricing here, Gemini Live sits in the middle, and OpenAI is the premium option. But production cost depends on how much audio you stream, how much text you generate behind the scenes, and how often your system escalates into a second model.

Where voice AI cost actually comes from

Cost driver Why it matters What disciplined teams do
Listening time An always-on session can accumulate speech input tokens even when the user is hesitant, noisy, or silent. Use VAD, idle cutoffs, and clear session boundaries instead of leaving sessions open by default.
Response length Audio output is often the most expensive part of a successful turn, especially on premium realtime models. Optimize for short spoken answers first, then expand only when the user asks for more detail.
Interruptions and barge-in Natural interruption handling improves UX, but partial generations that get canceled still consume time, tokens, or both. Tune turn-taking carefully instead of assuming more aggressive interruption handling is always cheaper.
Tool calls and retrieval Voice agents that look up orders, schedules, CRM records, or internal documents create extra hidden text-token spend. Track tool-heavy flows separately from pure conversation flows.
Conversation history Realtime sessions can keep state, but long histories increase text-token usage and can raise latency. Summarize or trim history instead of replaying every turn forever.
Fallback reasoning models Many voice systems route hard turns into a second model for policy, planning, or post-call summarization. Measure voice-front-end cost and backend escalation cost separately so routing economics stay visible.

The practical formula is simple: live voice cost equals speech input plus speech output plus hidden text work plus any backend model you invoke for the difficult parts. That is why a team can choose a cheap voice model and still end up with an expensive system.

Text pricing and audio pricing are different budgets

One of the biggest mistakes buyers make is assuming that a voice model is basically a text model with a microphone attached. The verified pricing says otherwise. OpenAI’s current realtime model charges $4 per 1M text input tokens but $32 per 1M audio input tokens, and $16 per 1M text output tokens but $64 per 1M audio output tokens. That is a major pricing gap.

Google’s current Live API native audio pricing is much tighter, which is part of why Gemini Live looks commercially attractive for high-volume assistants. AWS Nova Sonic is even more aggressive on list price in the Bedrock public price file, but AWS also makes an important architectural point on its pricing page: text-token pricing applies to things like transcription, tool use, grounding, and conversation history. In other words, your voice bill can quietly become a text bill too.

That matters because production voice agents are rarely just audio in and audio out. They usually perform structured backend work between turns. If you only model the speech line items, you will under-budget the system.

Latency and interruption handling are product decisions, not just model features

Low latency is not only a user experience metric. It changes how long sessions stay open, how often users talk over the assistant, and how much wasted generation you pay for. That makes interruption design an economic decision.

OpenAI’s realtime stack is operationally flexible because the current model page supports WebRTC, WebSocket, and SIP endpoints. That is useful for browser-native apps, call flows, and telephony bridges. The tradeoff is cost discipline. On a premium audio-output model, long voice responses and repeated interruptions become expensive quickly.

Google’s Live API capabilities are more explicit about barge-in behavior. Its Voice Activity Detection lets users interrupt the model at any time, and when VAD detects an interruption, the current generation is canceled and discarded. That is good for natural conversation, but it also means the cheapest implementation is not always the most natural one. If the assistant starts speaking too early or too long, you create interruption waste.

AWS frames Nova Sonic similarly but from an enterprise voice-agent angle. Its official documentation emphasizes low latency, Bedrock bidirectional streaming, and natural handling of pauses, hesitations, and interruptions while keeping conversational context. That makes Nova Sonic particularly relevant to call automation and task-oriented voice agents where turn efficiency matters as much as raw model quality.

The commercial rule is straightforward: if your users interrupt often, shorter first responses usually improve both perceived quality and cost. If your model speaks in long paragraphs, you will pay for verbosity and for the tokens generated right before the user cuts it off.

Why the voice model is rarely the whole stack

Most serious voice systems are really two systems. The first is the live conversation layer that listens, speaks, and manages turn-taking. The second is the hidden intelligence layer that does retrieval, structured decision-making, escalation, summarization, QA, or policy-heavy logic.

That second layer is where many teams should be using the AI Models app. The app does not replace a voice API. It helps you compare the text and reasoning models that sit behind the voice interface by price, context window, benchmarks, compatibility, freshness, and change history. For production voice systems, that is often more valuable than another superficial voice-model ranking because the backend routing choice can change total unit economics more than the live voice model itself.

This is also where the AI Models public endpoints become useful in practice. If your team is already monitoring /api/catalog, /api/changelog, and /api/benchmarks, you can keep the hidden model layer current without manually checking every provider whenever prices, capabilities, or deprecations move.

When voice ROI is worth it

Use case When voice usually pays off When it usually does not
Customer support triage When the agent can deflect repetitive calls, gather structured details, and route the right cases fast. When the conversation is mostly empathy-heavy edge cases that still end up with a human every time.
Scheduling and booking When tool integration is strong and the task flow is narrow enough to automate reliably. When downstream systems are fragmented and every booking still requires manual cleanup.
Status checks and account actions When users need quick answers while driving, walking, or calling in. When the same task is already faster in a clean self-serve web flow.
Premium concierge or sales assistant When higher conversion, larger average order value, or lower handle time clearly justifies the model spend. When voice is added because it sounds modern rather than because it changes revenue or labor economics.

Voice ROI is strongest when speed and convenience are part of the product value, not just an interface preference. If users are happier clicking than talking, voice can become an expensive layer that adds latency, engineering complexity, and monitoring burden without raising revenue or reducing labor enough to matter.

How to evaluate these models more honestly

A better evaluation framework is to compare each provider on four dimensions at once: list price, interruption behavior, tool or telephony integration fit, and the cost of the hidden model layer behind the voice session. That is a much better proxy for production economics than asking which vendor has the lowest speech-token rate.

  • Use OpenAI when premium realtime quality, telephony flexibility, and tooling are worth paying for.
  • Use Gemini Live when you want strong price-performance and are comfortable with a preview-native audio stack.
  • Use Nova Sonic when Bedrock fits your architecture and aggressive speech-token pricing materially improves the business case.
  • Use AI Models to choose and monitor the text or reasoning models that power the hard parts behind the live voice layer.

FAQ

Which of these voice models is cheapest right now?

On verified list pricing as of April 6, 2026, Amazon Nova Sonic is the cheapest of the three in US East (N. Virginia), Gemini 2.5 Flash Native Audio sits in the middle, and OpenAI gpt-realtime-1.5 is the premium-priced option. That is a list-price statement, not a guarantee of lowest total task cost.

What is the biggest production voice AI cost mistake?

Letting the assistant talk too much and hiding backend escalation cost. Audio output and premium fallback routing are usually where margins disappear first.

Should one voice model handle everything?

Usually no. Many production stacks use one live voice model for the interaction layer and a separate text or reasoning model for tool planning, policy-sensitive decisions, summaries, or QA.

How does AI Models help if it is not itself a voice API?

It helps with the part of the stack that teams often ignore in budgeting: the non-voice models behind the voice agent. You can compare those models by price, context, benchmarks, compatibility, and recency instead of treating the voice layer as the whole system.

The most useful way to think about voice AI in 2026 is not as a demo problem, but as a routing and margin problem. Once you separate speech cost, text cost, interruption waste, and backend model cost, GPT-Realtime, Gemini Live, and Nova Sonic stop looking like abstract brand choices and start looking like operating decisions.