Voice is the most natural human interface, and multimodal AI is making it programmable. The convergence of high-quality speech recognition (Whisper, Deepgram), expressive text-to-speech (ElevenLabs, Cartesia), and real-time orchestration frameworks (LiveKit, Pipecat) has made it possible to build voice-first conversational AI that feels responsive and natural. Combined with vision capabilities, these systems can see what users see and respond in real time. This section covers the complete voice and multimodal stack, from individual components to production-ready pipelines.
1. Speech-to-Text (STT)
Speech-to-text converts spoken audio into text that the LLM can process. The quality of transcription directly impacts the quality of the conversational experience, because every transcription error propagates through the entire pipeline. Modern STT systems offer near-human accuracy for clear speech, but performance degrades with background noise, accents, domain-specific terminology, and overlapping speakers.
STT Provider Comparison
| Provider | Model | Latency | Strengths | Best For |
|---|---|---|---|---|
| OpenAI Whisper | whisper-1, whisper-large-v3 | Batch (seconds) | Multilingual, open-source, strong accuracy | Batch processing, self-hosted |
| Deepgram | Nova-2, Nova-3 | Streaming (~300ms) | Low latency, streaming, keyword boosting | Real-time voice AI, call centers |
| AssemblyAI | Universal-2 | Near real-time | Speaker diarization, sentiment, summarization | Meeting transcription, analytics |
| Google Cloud STT | Chirp 2 | Streaming (~200ms) | 100+ languages, medical/telephony models | Enterprise, multilingual |
| Groq (Whisper) | whisper-large-v3-turbo | Very fast batch | Extremely fast inference on Whisper | High-throughput batch transcription |
Using Whisper for Transcription
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def transcribe_audio(audio_path: str, language: str = None) -> dict:
"""Transcribe audio using OpenAI's Whisper API."""
with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # ISO 639-1 code, e.g., "en"
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
return {
"text": transcript.text,
"language": transcript.language,
"duration": transcript.duration,
"segments": [
{
"text": seg.text,
"start": seg.start,
"end": seg.end
}
for seg in (transcript.segments or [])
]
}
def transcribe_with_deepgram(audio_path: str) -> dict:
"""Transcribe audio using Deepgram's Nova-2 model."""
from deepgram import DeepgramClient, PrerecordedOptions
deepgram = DeepgramClient() # Uses DEEPGRAM_API_KEY env var
with open(audio_path, "rb") as audio_file:
payload = {"buffer": audio_file.read()}
options = PrerecordedOptions(
model="nova-2",
smart_format=True, # Adds punctuation and formatting
utterances=True, # Detects speaker turns
diarize=True, # Speaker identification
language="en"
)
response = deepgram.listen.rest.v("1").transcribe_file(
payload, options
)
result = response.results
return {
"transcript": result.channels[0].alternatives[0].transcript,
"confidence": result.channels[0].alternatives[0].confidence,
"words": [
{
"word": w.word,
"start": w.start,
"end": w.end,
"confidence": w.confidence,
"speaker": w.speaker
}
for w in result.channels[0].alternatives[0].words
]
}
# Example usage
result = transcribe_audio("user_query.wav")
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']:.1f}s")
For real-time voice AI, streaming transcription is essential. Batch transcription processes the entire audio file at once, introducing latency proportional to the audio length. Streaming transcription processes audio in chunks as it arrives, producing partial transcripts that update in real time. Deepgram and Google Cloud STT offer true streaming; Whisper is primarily batch-oriented, though Groq's accelerated Whisper inference narrows this gap significantly.
2. Text-to-Speech (TTS)
Text-to-speech converts the LLM's text response into spoken audio. The quality bar for TTS has risen dramatically; modern systems produce speech that is nearly indistinguishable from human voice in controlled settings. The key differentiators are naturalness, emotional expressiveness, latency (time to first audio byte), and voice cloning capabilities.
TTS Provider Comparison
| Provider | Latency (TTFB) | Voice Quality | Key Features |
|---|---|---|---|
| ElevenLabs | ~300ms | Excellent | Voice cloning, emotional control, 32 languages |
| PlayHT | ~200ms | Very good | Ultra-low latency mode, voice cloning, streaming |
| Cartesia | ~100ms | Very good | Fastest TTFB, emotion/speed control, streaming |
| OpenAI TTS | ~400ms | Good | Simple API, 6 built-in voices, affordable |
| Azure Neural TTS | ~200ms | Very good | SSML support, 400+ voices, enterprise SLAs |
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def text_to_speech_openai(text: str, voice: str = "nova",
output_path: str = "response.mp3") -> str:
"""Generate speech from text using OpenAI's TTS API."""
response = client.audio.speech.create(
model="tts-1-hd", # Higher quality; use "tts-1" for lower latency
voice=voice, # alloy, echo, fable, onyx, nova, shimmer
input=text,
speed=1.0 # 0.25 to 4.0
)
response.stream_to_file(output_path)
return output_path
async def text_to_speech_elevenlabs(
text: str,
voice_id: str = "21m00Tcm4TlvDq8ikWAM", # Rachel
output_path: str = "response.mp3"
) -> str:
"""Generate speech using ElevenLabs with streaming."""
from elevenlabs import ElevenLabs
eleven = ElevenLabs() # Uses ELEVEN_API_KEY env var
audio_generator = eleven.text_to_speech.convert(
voice_id=voice_id,
text=text,
model_id="eleven_turbo_v2_5",
output_format="mp3_44100_128",
voice_settings={
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.3,
"use_speaker_boost": True
}
)
# Write streaming audio to file
with open(output_path, "wb") as f:
for chunk in audio_generator:
f.write(chunk)
return output_path
# Simple usage
output = text_to_speech_openai(
"Hello! I'd be happy to help you check your order status. "
"Could you provide me with your order number?",
voice="nova"
)
print(f"Audio saved to: {output}")
In voice AI, time-to-first-byte (TTFB) matters more than total generation time. Users perceive a system as fast if audio starts playing quickly, even if the full response takes several seconds to generate. This is why streaming TTS (where audio chunks are sent as they are generated) is critical for real-time voice applications. A system with 100ms TTFB that streams audio progressively feels much faster than a system with 500ms TTFB that delivers the complete audio at once.
3. Real-Time Voice AI Pipelines
A real-time voice AI pipeline connects STT, LLM, and TTS into a seamless flow where the user speaks, the system processes their speech, generates a response, and speaks it back, all with minimal perceptible delay. The total round-trip latency (from when the user finishes speaking to when the first audio of the response plays) is the key performance metric. Users expect sub-second response times for conversational interactions.
Building a Voice Pipeline with Pipecat
import asyncio
from pipecat.frames.frames import TextFrame, AudioRawFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.openai import OpenAILLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.services.daily import DailyTransport
async def create_voice_bot():
"""Create a real-time voice AI bot using Pipecat."""
# Transport layer (handles WebRTC audio/video)
transport = DailyTransport(
room_url="https://your-domain.daily.co/room-name",
token="your-daily-token",
bot_name="Assistant"
)
# Speech-to-text
stt = DeepgramSTTService(
api_key="your-deepgram-key",
model="nova-2",
language="en"
)
# Language model
llm = OpenAILLMService(
api_key="your-openai-key",
model="gpt-4o",
system_prompt=(
"You are a helpful voice assistant. Keep responses "
"concise (1-2 sentences) since this is a voice conversation. "
"Be natural and conversational."
)
)
# Text-to-speech
tts = CartesiaTTSService(
api_key="your-cartesia-key",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
model_id="sonic-english",
sample_rate=16000
)
# Build the pipeline: audio in -> STT -> LLM -> TTS -> audio out
pipeline = Pipeline([
transport.input(), # Receive user audio
stt, # Transcribe to text
llm, # Generate response
tts, # Synthesize speech
transport.output() # Send audio to user
])
task = PipelineTask(pipeline)
await task.run()
# Run the voice bot
asyncio.run(create_voice_bot())
Voice interfaces impose constraints that text-based chat does not. Responses must be concise (users cannot "scan" audio the way they scan text). Latency above 1.5 seconds feels unresponsive. The system needs voice activity detection (VAD) to know when the user has finished speaking. It must handle interruptions (the user speaking while the system is still talking). And the response text must be optimized for spoken delivery: avoid parenthetical asides, complex lists, URLs, or code snippets that work in text but sound terrible when spoken aloud.
4. Voice-Specific Orchestration Challenges
Beyond the basic STT/LLM/TTS pipeline, real-time voice AI requires solving several orchestration challenges that do not arise in text-based chat.
Interruption Handling
Users may interrupt the system while it is speaking. The system needs to detect the interruption, stop the current audio playback, process the new input, and respond without losing context. This requires coordination between the STT, TTS, and transport layers.
class InterruptionHandler:
"""Manages user interruptions during system speech."""
def __init__(self):
self.is_speaking = False
self.current_utterance: str = ""
self.spoken_so_far: str = ""
async def on_speech_started(self, text: str):
"""Called when the system starts speaking."""
self.is_speaking = True
self.current_utterance = text
self.spoken_so_far = ""
async def on_speech_chunk_played(self, chunk_text: str):
"""Track how much of the response has been spoken."""
self.spoken_so_far += chunk_text
async def on_user_interruption(self, user_audio_detected: bool):
"""Handle user interrupting system speech."""
if not self.is_speaking or not user_audio_detected:
return None
self.is_speaking = False
# Calculate what was and was not heard
unspoken = self.current_utterance[len(self.spoken_so_far):]
return {
"action": "interrupted",
"spoken_portion": self.spoken_so_far.strip(),
"unspoken_portion": unspoken.strip(),
"context_note": (
f"System was saying: '{self.spoken_so_far.strip()}' "
f"but was interrupted. The rest ('{unspoken.strip()[:50]}...') "
"was not heard by the user."
)
}
async def on_speech_completed(self):
"""Called when system finishes speaking without interruption."""
self.is_speaking = False
self.spoken_so_far = ""
self.current_utterance = ""
5. Vision in Conversations
Multimodal LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini) can process images alongside text, enabling conversational AI systems that can see. Users can share photos, screenshots, documents, or live camera feeds, and the system can discuss what it sees. This capability transforms many use cases: visual troubleshooting ("what is wrong with this error message?"), product identification ("what plant is this?"), accessibility assistance, and interactive tutoring with visual materials.
import base64
from openai import OpenAI
client = OpenAI()
def encode_image_to_base64(image_path: str) -> str:
"""Read an image file and encode it as base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
class MultimodalConversation:
"""Conversational AI with vision capabilities."""
def __init__(self, system_prompt: str):
self.system_prompt = system_prompt
self.history: list[dict] = []
def send_text(self, user_message: str) -> str:
"""Send a text-only message."""
self.history.append({
"role": "user",
"content": user_message
})
return self._get_response()
def send_image(self, image_path: str,
question: str = "What do you see?") -> str:
"""Send an image with an optional question."""
b64_image = encode_image_to_base64(image_path)
# Determine MIME type from extension
ext = image_path.rsplit(".", 1)[-1].lower()
mime_map = {"jpg": "jpeg", "jpeg": "jpeg",
"png": "png", "gif": "gif", "webp": "webp"}
mime_type = f"image/{mime_map.get(ext, 'jpeg')}"
self.history.append({
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{b64_image}",
"detail": "high"
}
}
]
})
return self._get_response()
def send_image_url(self, url: str, question: str) -> str:
"""Send an image via URL with a question."""
self.history.append({
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {"url": url, "detail": "auto"}
}
]
})
return self._get_response()
def _get_response(self) -> str:
"""Get a response from the multimodal LLM."""
messages = [
{"role": "system", "content": self.system_prompt},
*self.history
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=1000
)
assistant_msg = response.choices[0].message.content
self.history.append({
"role": "assistant", "content": assistant_msg
})
return assistant_msg
# Example: Visual troubleshooting assistant
troubleshooter = MultimodalConversation(
system_prompt=(
"You are a technical support assistant that can analyze "
"screenshots and photos to help users troubleshoot issues. "
"When shown an image, describe what you see and provide "
"specific, actionable solutions."
)
)
# User shares a screenshot of an error
response = troubleshooter.send_image(
"error_screenshot.png",
"I keep getting this error when I try to start the application. "
"What should I do?"
)
print(response)
The voice AI landscape is evolving rapidly. OpenAI's GPT-4o natively processes audio without a separate STT step, significantly reducing latency and enabling the model to understand tone, emotion, and non-verbal cues. Google's Gemini 2.0 offers similar native multimodal processing. These "speech-native" models are beginning to replace the traditional STT/LLM/TTS pipeline with a single model that hears, thinks, and speaks. However, the component-based pipeline remains important for customization, cost control, and vendor flexibility.
6. Comparing Voice AI Orchestration Frameworks
| Framework | Type | Key Strengths | Best For |
|---|---|---|---|
| LiveKit Agents | Open-source framework | WebRTC transport, plugin system, self-hostable | Custom voice bots, self-hosted deployments |
| Pipecat | Open-source framework | Composable pipelines, multi-provider, Python-native | Rapid prototyping, flexible architectures |
| Vapi | Managed platform | Turnkey API, phone integration, low-code setup | Phone bots, rapid deployment |
| Retell AI | Managed platform | Telephony focus, call analytics, enterprise features | Call center automation, enterprise voice |
| Custom WebSocket | DIY | Full control, no vendor lock-in | Specialized requirements, existing infrastructure |
Section 20.5 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- STT accuracy is the foundation: Every transcription error propagates through the entire pipeline. Choose STT providers based on your specific requirements: Deepgram for low-latency streaming, Whisper for multilingual batch processing, AssemblyAI for analytics-rich transcription.
- TTFB trumps total latency: In voice AI, time-to-first-byte is the most important latency metric. Streaming TTS that starts playing quickly creates a perception of responsiveness even if total generation takes longer.
- Orchestration is the hard part: The individual components (STT, LLM, TTS) are well-solved problems. The engineering challenge is orchestrating them with proper VAD, turn-taking, interruption handling, and transport. Frameworks like Pipecat and LiveKit Agents abstract much of this complexity.
- Voice requires different response design: Text responses do not translate well to voice. Keep responses short, avoid visual formatting, optimize for spoken delivery, and use conversational phrasing. Add filler phrases to indicate processing.
- Multimodal is more than voice: Vision-capable models enable powerful new interaction patterns where users share images, screenshots, or camera feeds. The conversation history must track both text and image references to maintain context across multimodal interactions.