Module 23 · Section 23.2

Audio, Music & Video Generation

Text-to-speech, voice cloning, real-time conversational audio, music generation, text-to-video, and 3D generation with modern generative models
★ Big Picture

Generative AI has expanded beyond text and images into audio, music, and video. Modern text-to-speech systems produce natural-sounding voices from seconds of reference audio. Music models compose original songs in specified genres and styles. Video generation models create cinematic clips from text descriptions. These modalities share core architectural ideas with image generation (diffusion, transformers, flow matching) but introduce unique challenges: temporal coherence, audio waveform synthesis, and the enormous computational demands of high-resolution video. Together with image generation, they form the complete stack of multimodal generative AI.

1. Text-to-Speech (TTS) Systems

Text-to-speech has undergone a revolution in the past few years. Traditional concatenative and parametric systems have been replaced by neural models that produce speech nearly indistinguishable from human recordings. The modern TTS pipeline typically involves a text encoder that converts input text to phonemes or tokens, an acoustic model that generates mel-spectrograms or audio tokens, and a vocoder that converts these representations into audible waveforms.

VITS: End-to-End Speech Synthesis

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) combines a variational autoencoder, normalizing flows, and adversarial training into a single end-to-end model. Unlike earlier two-stage approaches (text to spectrogram, then spectrogram to waveform), VITS generates raw audio directly from text, producing high-quality speech with natural prosody. It remains one of the most efficient architectures for real-time TTS.

# Using Coqui TTS (open-source VITS implementation)
from TTS.api import TTS

# List available models
print(TTS().list_models())

# Load a VITS model for English
tts = TTS(model_name="tts_models/en/ljspeech/vits")

# Generate speech from text
tts.tts_to_file(
    text="Neural text-to-speech has made enormous progress in recent years.",
    file_path="output_vits.wav",
)

# Multi-speaker model with voice cloning
tts_multi = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
tts_multi.tts_to_file(
    text="Voice cloning requires only a few seconds of reference audio.",
    file_path="cloned_output.wav",
    speaker_wav="reference_voice.wav",    # 6+ seconds of target speaker
    language="en",
)

Bark: Generative Audio with Paralinguistics

Bark, developed by Suno, takes a different approach by modeling speech as a sequence of audio tokens using an autoregressive transformer (similar to how GPT models text). This token-based approach naturally handles not just speech but also laughter, music, background noise, and paralinguistic cues. Bark generates semantic tokens from text, converts them to coarse acoustic tokens, then refines them to fine acoustic tokens, with each stage handled by a separate transformer.

from transformers import AutoProcessor, BarkModel
import scipy

# Load Bark model
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
model = model.to("cuda")

# Generate speech with paralinguistic cues
text = "Hello! [laughs] This is an example of Bark generating speech with emotion."

inputs = processor(text, voice_preset="v2/en_speaker_6")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()

# Save the generated audio
sample_rate = model.generation_config.sample_rate
scipy.io.wavfile.write("bark_output.wav", rate=sample_rate, data=audio_array)

F5-TTS and Zero-Shot Voice Cloning

F5-TTS represents the latest generation of TTS models built on flow matching (the same technique behind Flux for image generation). It uses a diffusion transformer (DiT) architecture to generate mel-spectrograms from text, conditioned on a reference speech sample. The flow matching approach enables fast, high-quality generation with natural prosody. F5-TTS achieves remarkable zero-shot voice cloning from as little as 3 seconds of reference audio, making it one of the most accessible voice cloning systems available.

Input Text phonemes Reference Audio Text Encoder Speaker Encoder DiT / Flow Matching Vocoder mel to wav Modern TTS Pipeline (F5-TTS / XTTS)
Figure 23.3: Modern TTS architecture. Text and speaker embeddings condition a diffusion or flow matching model that generates mel-spectrograms, which a vocoder converts to audio.

Real-Time Conversational Audio

GPT-4o introduced native audio input and output, meaning the model can listen, understand, and respond with natural speech in real time. Rather than the traditional pipeline of speech-to-text followed by LLM processing followed by text-to-speech, GPT-4o processes audio tokens directly within the transformer, preserving nuances like intonation, emotion, and speaking pace. This enables sub-200ms latency for conversational applications. Kyutai's Moshi follows a similar approach as an open-source alternative, using a multi-stream architecture that processes both the user's speech and its own generated speech simultaneously, enabling natural turn-taking and even interruption handling.

📘 Token-Based vs. Spectrogram-Based TTS

Two paradigms dominate modern TTS. Spectrogram-based approaches (VITS, F5-TTS) generate mel-spectrograms that a vocoder converts to waveforms. They offer fine-grained control over prosody and are well-understood. Token-based approaches (Bark, VALL-E, GPT-4o) discretize audio into tokens using neural codecs like EnCodec, then model speech as a sequence prediction problem. Token-based systems naturally handle non-speech sounds and enable unified multimodal models, but may produce artifacts at token boundaries. The field is converging toward token-based representations as codec quality improves.

2. Music Generation

Music generation with AI has progressed from simple MIDI patterns to full song production with vocals, instrumentation, and complex arrangements. The core challenge is modeling long-range temporal structure: music has hierarchical patterns from individual notes (milliseconds) to phrases (seconds) to sections (minutes) that must all be coherent.

MusicLM and MusicGen

Google's MusicLM was the first model to generate high-fidelity music from text descriptions at 24kHz. It uses a hierarchical sequence-to-sequence approach: MuLan embeddings (a music-text joint embedding model) condition semantic tokens from w2v-BERT, which then condition acoustic tokens from SoundStream. Meta's MusicGen simplified this into a single autoregressive transformer that generates EnCodec audio tokens directly, conditioned on text or melody. MusicGen introduced efficient codebook interleaving patterns that allow generating multiple codec streams with a single model pass.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# Load the medium-sized MusicGen model
model = MusicGen.get_pretrained("facebook/musicgen-medium")
model.set_generation_params(
    duration=15,           # Generate 15 seconds of audio
    top_k=250,
    top_p=0.0,
    temperature=1.0,
    cfg_coef=3.0,          # Classifier-free guidance strength
)

# Text-conditional music generation
descriptions = [
    "An upbeat electronic dance track with driving bass and synth arpeggios",
    "A peaceful acoustic guitar melody with gentle fingerpicking",
]

wav = model.generate(descriptions)

# Save each generated track
for idx, one_wav in enumerate(wav):
    audio_write(
        f"music_{idx}",
        one_wav.cpu(),
        model.sample_rate,
        strategy="loudness",
    )

Suno and Udio: Full Song Generation

Suno and Udio represent the current state of the art in full song generation, producing complete songs with vocals, instrumentation, and lyrics. These commercial systems accept text descriptions of genre, mood, and style, along with optional lyrics, and generate radio-quality tracks of 2 to 4 minutes. While the exact architectures are proprietary, they likely combine text-conditioned audio generation with separate vocal synthesis and mixing stages. The quality has reached a point where generated music is difficult to distinguish from human-produced tracks in blind listening tests.

⚠ Copyright and Music Generation

Music generation raises significant legal and ethical questions. Models trained on copyrighted music may reproduce recognizable melodies, chord progressions, or production styles. Suno and Udio face ongoing lawsuits from major record labels. When deploying music generation, consider: the training data provenance, whether outputs could constitute derivative works, the legal landscape in your jurisdiction, and whether your use case requires royalty-free generation. Using models trained exclusively on licensed or public domain music reduces legal exposure.

3. Text-to-Video Generation

Text-to-video is arguably the most challenging generative modality, requiring the model to produce temporally coherent sequences of frames that are individually high quality and collectively tell a consistent visual story. A single second of 24fps 1080p video contains roughly 50 million pixels, compared to about 1 million for a single 1024x1024 image.

Architecture: Diffusion Transformers (DiT) for Video

Most modern video generation models extend the Diffusion Transformer (DiT) architecture to handle spatiotemporal data. Instead of processing a single image's latent representation, the model processes a 3D latent volume (height, width, time). Attention operates across both spatial and temporal dimensions, either through factored attention (separate spatial and temporal attention layers) or full 3D attention. The latent space comes from a video VAE that compresses frames both spatially and temporally.

Text Prompt T5 / CLIP 3D Noise Video DiT Spatial Attention Temporal Attention Cross-Attention (text) Video VAE Decoder Video Frames
Figure 23.4: Video DiT architecture. A 3D diffusion transformer with spatial, temporal, and cross-attention layers denoises video latents, which the video VAE decodes into frames.

Leading Video Generation Models

OpenAI's Sora demonstrated that scaling DiT architectures to video produces remarkable results, generating up to 60-second clips with consistent characters, realistic physics, and cinematic camera movements. Runway's Gen-3 Alpha focuses on controllable commercial video generation with features like motion brush and camera control. Kuaishou's Kling 2 achieves strong temporal consistency using a 3D VAE that compresses video in both space and time. Google's Veo 2 generates high-definition video with excellent prompt adherence and physical realism.

Model Developer Max Duration Resolution Key Strength
Sora OpenAI 60 sec 1080p Temporal coherence, physics
Runway Gen-3 Alpha Runway 10 sec 1080p Controllability, motion brush
Kling 2 Kuaishou 10 sec 1080p 3D VAE, consistency
Veo 2 Google 8 sec 4K Resolution, prompt adherence
CogVideoX Zhipu AI 6 sec 720p Open-source, extensible
Wan 2.1 Alibaba 5 sec 720p Open-source, image-to-video
# Using CogVideoX (open-source) via diffusers
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

prompt = "A golden retriever running through a sunlit meadow, slow motion, cinematic"

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6.0,
).frames[0]

export_to_video(video, "golden_retriever.mp4", fps=8)
🔍 Key Insight

Video generation's hardest problem is not frame quality (which borrows from image generation) but temporal coherence. Objects must maintain consistent appearance across frames, physics must be plausible (gravity, reflections, shadows), and camera motion must feel natural. Models achieve this through temporal attention layers, 3D latent spaces that encode time alongside space, and training on large-scale video datasets. The rapid quality improvements from Sora onward suggest that scaling compute and data for video DiT architectures is a reliable path to better results.

4. 3D Generation

3D content generation from text or images is an emerging frontier. Current approaches include score distillation sampling (SDS), which uses a pre-trained 2D diffusion model to optimize a 3D representation (NeRF or Gaussian splatting) so that it looks correct from every viewing angle. DreamFusion pioneered this approach, and subsequent work like Instant3D and LRM (Large Reconstruction Model) has made 3D generation faster and more reliable.

Multimodal Composition Pipelines

Real-world applications often combine multiple generative modalities into a single pipeline. A film production tool might use an LLM to write a script, a video model to generate scenes, a TTS model for dialogue, and a music model for the soundtrack. Orchestrating these components requires careful attention to temporal synchronization, style consistency, and resource management. Frameworks like ComfyUI provide node-based interfaces for building such pipelines, while programmatic approaches use Python to chain models together.

LLM Script / Storyboard Video Gen scenes TTS dialogue Music Gen soundtrack SFX Model sound effects Compositor sync + mix Final Video
Figure 23.5: Multimodal composition pipeline. An LLM generates the script, which drives parallel video, speech, music, and SFX generation before final composition.
📘 Compute Requirements for Video

Video generation is extremely compute-intensive. Generating a single 5-second clip at 720p can take 5 to 15 minutes on a high-end GPU. Training video models requires thousands of GPUs running for weeks. This computational cost is the primary reason video generation lags behind image generation in quality and accessibility. Cloud APIs (Runway, Sora, Kling) abstract this cost into per-second pricing, while open-source models like CogVideoX and Wan let researchers experiment on smaller scales with reduced resolution and frame counts.

Knowledge Check

1. What are the two main paradigms for modern TTS, and how do they differ?
Show Answer
The two paradigms are spectrogram-based (VITS, F5-TTS) and token-based (Bark, VALL-E, GPT-4o). Spectrogram-based systems generate mel-spectrograms that a vocoder converts to waveforms, offering fine-grained prosody control. Token-based systems discretize audio into tokens using neural codecs like EnCodec and model speech as sequence prediction, naturally handling non-speech sounds and enabling unified multimodal models.
2. How does GPT-4o's approach to audio differ from traditional speech-to-text plus text-to-speech pipelines?
Show Answer
Traditional pipelines convert speech to text, process it with an LLM, then convert the response back to speech, losing paralinguistic information (tone, emotion, pacing) at each conversion. GPT-4o processes audio tokens directly within the transformer, preserving these nuances and achieving sub-200ms latency by eliminating the separate conversion stages.
3. Why is temporal coherence the hardest challenge in video generation?
Show Answer
Temporal coherence requires maintaining consistent object appearance, plausible physics (gravity, reflections, shadows), and natural camera motion across dozens to hundreds of frames. Unlike single images, video generation must ensure that each frame is not only individually high-quality but also consistent with all neighboring frames, creating an exponentially larger constraint space.
4. How does MusicGen use codebook interleaving to efficiently generate audio?
Show Answer
Neural audio codecs like EnCodec produce multiple parallel streams (codebooks) of tokens at different frequency resolutions. MusicGen uses codebook interleaving patterns to flatten these parallel streams into a single sequence that one autoregressive transformer can generate. This avoids the need for separate models per codebook while maintaining audio quality across frequency bands.
5. What is score distillation sampling (SDS) and how does it enable text-to-3D generation?
Show Answer
SDS uses a pre-trained 2D diffusion model as a critic to optimize a 3D representation (NeRF or Gaussian splatting). The 3D object is rendered from random viewpoints, and the diffusion model provides gradients indicating how to make each rendering look more like the text description. By optimizing across many viewpoints, the 3D object converges to something that looks correct from every angle, effectively lifting 2D image generation knowledge into 3D.

Key Takeaways