Section 20.5: Voice & Multimodal Interfaces

★ Big Picture

Voice is the most natural human interface, and multimodal AI is making it programmable. The convergence of high-quality speech recognition (Whisper, Deepgram), expressive text-to-speech (ElevenLabs, Cartesia), and real-time orchestration frameworks (LiveKit, Pipecat) has made it possible to build voice-first conversational AI that feels responsive and natural. Combined with vision capabilities, these systems can see what users see and respond in real time. This section covers the complete voice and multimodal stack, from individual components to production-ready pipelines.

1. Speech-to-Text (STT)

Speech-to-text converts spoken audio into text that the LLM can process. The quality of transcription directly impacts the quality of the conversational experience, because every transcription error propagates through the entire pipeline. Modern STT systems offer near-human accuracy for clear speech, but performance degrades with background noise, accents, domain-specific terminology, and overlapping speakers.

STT Provider Comparison

Provider	Model	Latency	Strengths	Best For
OpenAI Whisper	whisper-1, whisper-large-v3	Batch (seconds)	Multilingual, open-source, strong accuracy	Batch processing, self-hosted
Deepgram	Nova-2, Nova-3	Streaming (~300ms)	Low latency, streaming, keyword boosting	Real-time voice AI, call centers
AssemblyAI	Universal-2	Near real-time	Speaker diarization, sentiment, summarization	Meeting transcription, analytics
Google Cloud STT	Chirp 2	Streaming (~200ms)	100+ languages, medical/telephony models	Enterprise, multilingual
Groq (Whisper)	whisper-large-v3-turbo	Very fast batch	Extremely fast inference on Whisper	High-throughput batch transcription

Using Whisper for Transcription

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def transcribe_audio(audio_path: str, language: str = None) -> dict:
    """Transcribe audio using OpenAI's Whisper API."""
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language=language,  # ISO 639-1 code, e.g., "en"
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    return {
        "text": transcript.text,
        "language": transcript.language,
        "duration": transcript.duration,
        "segments": [
            {
                "text": seg.text,
                "start": seg.start,
                "end": seg.end
            }
            for seg in (transcript.segments or [])
        ]
    }


def transcribe_with_deepgram(audio_path: str) -> dict:
    """Transcribe audio using Deepgram's Nova-2 model."""
    from deepgram import DeepgramClient, PrerecordedOptions

    deepgram = DeepgramClient()  # Uses DEEPGRAM_API_KEY env var

    with open(audio_path, "rb") as audio_file:
        payload = {"buffer": audio_file.read()}

    options = PrerecordedOptions(
        model="nova-2",
        smart_format=True,    # Adds punctuation and formatting
        utterances=True,      # Detects speaker turns
        diarize=True,         # Speaker identification
        language="en"
    )

    response = deepgram.listen.rest.v("1").transcribe_file(
        payload, options
    )

    result = response.results
    return {
        "transcript": result.channels[0].alternatives[0].transcript,
        "confidence": result.channels[0].alternatives[0].confidence,
        "words": [
            {
                "word": w.word,
                "start": w.start,
                "end": w.end,
                "confidence": w.confidence,
                "speaker": w.speaker
            }
            for w in result.channels[0].alternatives[0].words
        ]
    }


# Example usage
result = transcribe_audio("user_query.wav")
print(f"Transcription: {result['text']}")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']:.1f}s")

Transcription: Hi, I'd like to check the status of my order from last week. Language: en Duration: 3.2s

📝 Note: Streaming vs. Batch Transcription

For real-time voice AI, streaming transcription is essential. Batch transcription processes the entire audio file at once, introducing latency proportional to the audio length. Streaming transcription processes audio in chunks as it arrives, producing partial transcripts that update in real time. Deepgram and Google Cloud STT offer true streaming; Whisper is primarily batch-oriented, though Groq's accelerated Whisper inference narrows this gap significantly.

2. Text-to-Speech (TTS)

Text-to-speech converts the LLM's text response into spoken audio. The quality bar for TTS has risen dramatically; modern systems produce speech that is nearly indistinguishable from human voice in controlled settings. The key differentiators are naturalness, emotional expressiveness, latency (time to first audio byte), and voice cloning capabilities.

TTS Provider Comparison

Provider	Latency (TTFB)	Voice Quality	Key Features
ElevenLabs	~300ms	Excellent	Voice cloning, emotional control, 32 languages
PlayHT	~200ms	Very good	Ultra-low latency mode, voice cloning, streaming
Cartesia	~100ms	Very good	Fastest TTFB, emotion/speed control, streaming
OpenAI TTS	~400ms	Good	Simple API, 6 built-in voices, affordable
Azure Neural TTS	~200ms	Very good	SSML support, 400+ voices, enterprise SLAs

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def text_to_speech_openai(text: str, voice: str = "nova",
                          output_path: str = "response.mp3") -> str:
    """Generate speech from text using OpenAI's TTS API."""
    response = client.audio.speech.create(
        model="tts-1-hd",  # Higher quality; use "tts-1" for lower latency
        voice=voice,        # alloy, echo, fable, onyx, nova, shimmer
        input=text,
        speed=1.0           # 0.25 to 4.0
    )
    response.stream_to_file(output_path)
    return output_path


async def text_to_speech_elevenlabs(
    text: str,
    voice_id: str = "21m00Tcm4TlvDq8ikWAM",  # Rachel
    output_path: str = "response.mp3"
) -> str:
    """Generate speech using ElevenLabs with streaming."""
    from elevenlabs import ElevenLabs

    eleven = ElevenLabs()  # Uses ELEVEN_API_KEY env var

    audio_generator = eleven.text_to_speech.convert(
        voice_id=voice_id,
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="mp3_44100_128",
        voice_settings={
            "stability": 0.5,
            "similarity_boost": 0.75,
            "style": 0.3,
            "use_speaker_boost": True
        }
    )

    # Write streaming audio to file
    with open(output_path, "wb") as f:
        for chunk in audio_generator:
            f.write(chunk)

    return output_path


# Simple usage
output = text_to_speech_openai(
    "Hello! I'd be happy to help you check your order status. "
    "Could you provide me with your order number?",
    voice="nova"
)
print(f"Audio saved to: {output}")

💡 Key Insight

In voice AI, time-to-first-byte (TTFB) matters more than total generation time. Users perceive a system as fast if audio starts playing quickly, even if the full response takes several seconds to generate. This is why streaming TTS (where audio chunks are sent as they are generated) is critical for real-time voice applications. A system with 100ms TTFB that streams audio progressively feels much faster than a system with 500ms TTFB that delivers the complete audio at once.

3. Real-Time Voice AI Pipelines

A real-time voice AI pipeline connects STT, LLM, and TTS into a seamless flow where the user speaks, the system processes their speech, generates a response, and speaks it back, all with minimal perceptible delay. The total round-trip latency (from when the user finishes speaking to when the first audio of the response plays) is the key performance metric. Users expect sub-second response times for conversational interactions.

Figure 20.11: Real-time voice AI pipeline showing the flow from speech input through STT, LLM, and TTS to audio output, with orchestration handling VAD, turn-taking, and transport.

Building a Voice Pipeline with Pipecat

import asyncio
from pipecat.frames.frames import TextFrame, AudioRawFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.task import PipelineTask
from pipecat.services.openai import OpenAILLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.services.daily import DailyTransport

async def create_voice_bot():
    """Create a real-time voice AI bot using Pipecat."""

    # Transport layer (handles WebRTC audio/video)
    transport = DailyTransport(
        room_url="https://your-domain.daily.co/room-name",
        token="your-daily-token",
        bot_name="Assistant"
    )

    # Speech-to-text
    stt = DeepgramSTTService(
        api_key="your-deepgram-key",
        model="nova-2",
        language="en"
    )

    # Language model
    llm = OpenAILLMService(
        api_key="your-openai-key",
        model="gpt-4o",
        system_prompt=(
            "You are a helpful voice assistant. Keep responses "
            "concise (1-2 sentences) since this is a voice conversation. "
            "Be natural and conversational."
        )
    )

    # Text-to-speech
    tts = CartesiaTTSService(
        api_key="your-cartesia-key",
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        model_id="sonic-english",
        sample_rate=16000
    )

    # Build the pipeline: audio in -> STT -> LLM -> TTS -> audio out
    pipeline = Pipeline([
        transport.input(),   # Receive user audio
        stt,                  # Transcribe to text
        llm,                  # Generate response
        tts,                  # Synthesize speech
        transport.output()    # Send audio to user
    ])

    task = PipelineTask(pipeline)
    await task.run()


# Run the voice bot
asyncio.run(create_voice_bot())

⚠ Voice-Specific Design Constraints

Voice interfaces impose constraints that text-based chat does not. Responses must be concise (users cannot "scan" audio the way they scan text). Latency above 1.5 seconds feels unresponsive. The system needs voice activity detection (VAD) to know when the user has finished speaking. It must handle interruptions (the user speaking while the system is still talking). And the response text must be optimized for spoken delivery: avoid parenthetical asides, complex lists, URLs, or code snippets that work in text but sound terrible when spoken aloud.

4. Voice-Specific Orchestration Challenges

Beyond the basic STT/LLM/TTS pipeline, real-time voice AI requires solving several orchestration challenges that do not arise in text-based chat.

Interruption Handling

Users may interrupt the system while it is speaking. The system needs to detect the interruption, stop the current audio playback, process the new input, and respond without losing context. This requires coordination between the STT, TTS, and transport layers.

class InterruptionHandler:
    """Manages user interruptions during system speech."""

    def __init__(self):
        self.is_speaking = False
        self.current_utterance: str = ""
        self.spoken_so_far: str = ""

    async def on_speech_started(self, text: str):
        """Called when the system starts speaking."""
        self.is_speaking = True
        self.current_utterance = text
        self.spoken_so_far = ""

    async def on_speech_chunk_played(self, chunk_text: str):
        """Track how much of the response has been spoken."""
        self.spoken_so_far += chunk_text

    async def on_user_interruption(self, user_audio_detected: bool):
        """Handle user interrupting system speech."""
        if not self.is_speaking or not user_audio_detected:
            return None

        self.is_speaking = False

        # Calculate what was and was not heard
        unspoken = self.current_utterance[len(self.spoken_so_far):]

        return {
            "action": "interrupted",
            "spoken_portion": self.spoken_so_far.strip(),
            "unspoken_portion": unspoken.strip(),
            "context_note": (
                f"System was saying: '{self.spoken_so_far.strip()}' "
                f"but was interrupted. The rest ('{unspoken.strip()[:50]}...') "
                "was not heard by the user."
            )
        }

    async def on_speech_completed(self):
        """Called when system finishes speaking without interruption."""
        self.is_speaking = False
        self.spoken_so_far = ""
        self.current_utterance = ""

5. Vision in Conversations

Multimodal LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini) can process images alongside text, enabling conversational AI systems that can see. Users can share photos, screenshots, documents, or live camera feeds, and the system can discuss what it sees. This capability transforms many use cases: visual troubleshooting ("what is wrong with this error message?"), product identification ("what plant is this?"), accessibility assistance, and interactive tutoring with visual materials.

import base64
from openai import OpenAI

client = OpenAI()

def encode_image_to_base64(image_path: str) -> str:
    """Read an image file and encode it as base64."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")


class MultimodalConversation:
    """Conversational AI with vision capabilities."""

    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.history: list[dict] = []

    def send_text(self, user_message: str) -> str:
        """Send a text-only message."""
        self.history.append({
            "role": "user",
            "content": user_message
        })
        return self._get_response()

    def send_image(self, image_path: str,
                   question: str = "What do you see?") -> str:
        """Send an image with an optional question."""
        b64_image = encode_image_to_base64(image_path)

        # Determine MIME type from extension
        ext = image_path.rsplit(".", 1)[-1].lower()
        mime_map = {"jpg": "jpeg", "jpeg": "jpeg",
                    "png": "png", "gif": "gif", "webp": "webp"}
        mime_type = f"image/{mime_map.get(ext, 'jpeg')}"

        self.history.append({
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{mime_type};base64,{b64_image}",
                        "detail": "high"
                    }
                }
            ]
        })
        return self._get_response()

    def send_image_url(self, url: str, question: str) -> str:
        """Send an image via URL with a question."""
        self.history.append({
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {"url": url, "detail": "auto"}
                }
            ]
        })
        return self._get_response()

    def _get_response(self) -> str:
        """Get a response from the multimodal LLM."""
        messages = [
            {"role": "system", "content": self.system_prompt},
            *self.history
        ]
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=1000
        )
        assistant_msg = response.choices[0].message.content
        self.history.append({
            "role": "assistant", "content": assistant_msg
        })
        return assistant_msg


# Example: Visual troubleshooting assistant
troubleshooter = MultimodalConversation(
    system_prompt=(
        "You are a technical support assistant that can analyze "
        "screenshots and photos to help users troubleshoot issues. "
        "When shown an image, describe what you see and provide "
        "specific, actionable solutions."
    )
)

# User shares a screenshot of an error
response = troubleshooter.send_image(
    "error_screenshot.png",
    "I keep getting this error when I try to start the application. "
    "What should I do?"
)
print(response)

Figure 20.12: Multimodal conversational pipeline accepting voice, text, and image inputs, processing through a multimodal LLM, and producing voice or text outputs.

📝 Note: Emerging Models for Voice

The voice AI landscape is evolving rapidly. OpenAI's GPT-4o natively processes audio without a separate STT step, significantly reducing latency and enabling the model to understand tone, emotion, and non-verbal cues. Google's Gemini 2.0 offers similar native multimodal processing. These "speech-native" models are beginning to replace the traditional STT/LLM/TTS pipeline with a single model that hears, thinks, and speaks. However, the component-based pipeline remains important for customization, cost control, and vendor flexibility.

6. Comparing Voice AI Orchestration Frameworks

Framework	Type	Key Strengths	Best For
LiveKit Agents	Open-source framework	WebRTC transport, plugin system, self-hostable	Custom voice bots, self-hosted deployments
Pipecat	Open-source framework	Composable pipelines, multi-provider, Python-native	Rapid prototyping, flexible architectures
Vapi	Managed platform	Turnkey API, phone integration, low-code setup	Phone bots, rapid deployment
Retell AI	Managed platform	Telephony focus, call analytics, enterprise features	Call center automation, enterprise voice
Custom WebSocket	DIY	Full control, no vendor lock-in	Specialized requirements, existing infrastructure

Section 20.5 Quiz

1. Why is streaming TTS more important than total TTS generation speed for voice AI?

Show Answer

Users perceive responsiveness based on time-to-first-byte (TTFB), not total generation time. Streaming TTS begins playing audio as soon as the first chunk is synthesized, while the rest generates in parallel. A system with 100ms TTFB that streams feels much faster than one with 500ms total time that delivers everything at once. Since conversational responses are listened to sequentially, the user hears the beginning while the system is still generating the end.

2. What is voice activity detection (VAD) and why is it critical for voice AI?

Show Answer

Voice activity detection determines when the user is speaking versus when they have paused or finished their turn. It is critical because the system needs to know when to start processing the user's input. Without good VAD, the system might start responding during a natural pause (cutting the user off), or wait too long after the user finishes (adding unnecessary latency). VAD also helps distinguish meaningful speech from background noise, coughs, or other non-speech audio.

3. How does interruption handling work in a voice AI pipeline?

Show Answer

When the user starts speaking while the system is still outputting audio, the interruption handler must: (1) detect the user's speech through VAD, (2) immediately stop the current audio playback, (3) track how much of the response was actually heard by the user versus what was cut off, (4) add this context to the conversation history so the LLM knows the user only heard a partial response, and (5) process the user's new input and generate a response that accounts for the interrupted context.

4. What design constraints does voice impose compared to text-based chat?

Show Answer

Voice interfaces require: (1) concise responses since users cannot scan audio, (2) sub-1.5 second latency to feel responsive, (3) text optimized for spoken delivery (no URLs, code snippets, or complex formatting), (4) natural phrasing without parenthetical asides or bullet points, (5) VAD for turn-taking, (6) interruption handling, and (7) filler phrases or acknowledgment sounds to indicate the system is processing. Voice also lacks visual affordances like buttons or typing indicators that text chat uses for interaction cues.

5. How do speech-native models (like GPT-4o audio mode) differ from the traditional STT/LLM/TTS pipeline?

Show Answer

Speech-native models process audio directly without a separate STT step, and can generate audio output without a separate TTS step. This eliminates two pipeline stages and their associated latency. More importantly, the model can understand paralinguistic cues (tone, emotion, emphasis, hesitation) that are lost in text transcription. The trade-off is less flexibility: you cannot mix providers (e.g., Deepgram STT with Anthropic LLM with ElevenLabs TTS), customize voices independently, or control costs at the component level. The component pipeline remains valuable for customization and vendor independence.

Key Takeaways

STT accuracy is the foundation: Every transcription error propagates through the entire pipeline. Choose STT providers based on your specific requirements: Deepgram for low-latency streaming, Whisper for multilingual batch processing, AssemblyAI for analytics-rich transcription.
TTFB trumps total latency: In voice AI, time-to-first-byte is the most important latency metric. Streaming TTS that starts playing quickly creates a perception of responsiveness even if total generation takes longer.
Orchestration is the hard part: The individual components (STT, LLM, TTS) are well-solved problems. The engineering challenge is orchestrating them with proper VAD, turn-taking, interruption handling, and transport. Frameworks like Pipecat and LiveKit Agents abstract much of this complexity.
Voice requires different response design: Text responses do not translate well to voice. Keep responses short, avoid visual formatting, optimize for spoken delivery, and use conversational phrasing. Add filler phrases to indicate processing.
Multimodal is more than voice: Vision-capable models enable powerful new interaction patterns where users share images, screenshots, or camera feeds. The conversation history must track both text and image references to maintain context across multimodal interactions.