Generative AI has expanded beyond text and images into audio, music, and video. Modern text-to-speech systems produce natural-sounding voices from seconds of reference audio. Music models compose original songs in specified genres and styles. Video generation models create cinematic clips from text descriptions. These modalities share core architectural ideas with image generation (diffusion, transformers, flow matching) but introduce unique challenges: temporal coherence, audio waveform synthesis, and the enormous computational demands of high-resolution video. Together with image generation, they form the complete stack of multimodal generative AI.
1. Text-to-Speech (TTS) Systems
Text-to-speech has undergone a revolution in the past few years. Traditional concatenative and parametric systems have been replaced by neural models that produce speech nearly indistinguishable from human recordings. The modern TTS pipeline typically involves a text encoder that converts input text to phonemes or tokens, an acoustic model that generates mel-spectrograms or audio tokens, and a vocoder that converts these representations into audible waveforms.
VITS: End-to-End Speech Synthesis
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) combines a variational autoencoder, normalizing flows, and adversarial training into a single end-to-end model. Unlike earlier two-stage approaches (text to spectrogram, then spectrogram to waveform), VITS generates raw audio directly from text, producing high-quality speech with natural prosody. It remains one of the most efficient architectures for real-time TTS.
# Using Coqui TTS (open-source VITS implementation) from TTS.api import TTS # List available models print(TTS().list_models()) # Load a VITS model for English tts = TTS(model_name="tts_models/en/ljspeech/vits") # Generate speech from text tts.tts_to_file( text="Neural text-to-speech has made enormous progress in recent years.", file_path="output_vits.wav", ) # Multi-speaker model with voice cloning tts_multi = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2") tts_multi.tts_to_file( text="Voice cloning requires only a few seconds of reference audio.", file_path="cloned_output.wav", speaker_wav="reference_voice.wav", # 6+ seconds of target speaker language="en", )
Bark: Generative Audio with Paralinguistics
Bark, developed by Suno, takes a different approach by modeling speech as a sequence of audio tokens using an autoregressive transformer (similar to how GPT models text). This token-based approach naturally handles not just speech but also laughter, music, background noise, and paralinguistic cues. Bark generates semantic tokens from text, converts them to coarse acoustic tokens, then refines them to fine acoustic tokens, with each stage handled by a separate transformer.
from transformers import AutoProcessor, BarkModel import scipy # Load Bark model processor = AutoProcessor.from_pretrained("suno/bark") model = BarkModel.from_pretrained("suno/bark") model = model.to("cuda") # Generate speech with paralinguistic cues text = "Hello! [laughs] This is an example of Bark generating speech with emotion." inputs = processor(text, voice_preset="v2/en_speaker_6") inputs = {k: v.to("cuda") for k, v in inputs.items()} audio_array = model.generate(**inputs) audio_array = audio_array.cpu().numpy().squeeze() # Save the generated audio sample_rate = model.generation_config.sample_rate scipy.io.wavfile.write("bark_output.wav", rate=sample_rate, data=audio_array)
F5-TTS and Zero-Shot Voice Cloning
F5-TTS represents the latest generation of TTS models built on flow matching (the same technique behind Flux for image generation). It uses a diffusion transformer (DiT) architecture to generate mel-spectrograms from text, conditioned on a reference speech sample. The flow matching approach enables fast, high-quality generation with natural prosody. F5-TTS achieves remarkable zero-shot voice cloning from as little as 3 seconds of reference audio, making it one of the most accessible voice cloning systems available.
Real-Time Conversational Audio
GPT-4o introduced native audio input and output, meaning the model can listen, understand, and respond with natural speech in real time. Rather than the traditional pipeline of speech-to-text followed by LLM processing followed by text-to-speech, GPT-4o processes audio tokens directly within the transformer, preserving nuances like intonation, emotion, and speaking pace. This enables sub-200ms latency for conversational applications. Kyutai's Moshi follows a similar approach as an open-source alternative, using a multi-stream architecture that processes both the user's speech and its own generated speech simultaneously, enabling natural turn-taking and even interruption handling.
Two paradigms dominate modern TTS. Spectrogram-based approaches (VITS, F5-TTS) generate mel-spectrograms that a vocoder converts to waveforms. They offer fine-grained control over prosody and are well-understood. Token-based approaches (Bark, VALL-E, GPT-4o) discretize audio into tokens using neural codecs like EnCodec, then model speech as a sequence prediction problem. Token-based systems naturally handle non-speech sounds and enable unified multimodal models, but may produce artifacts at token boundaries. The field is converging toward token-based representations as codec quality improves.
2. Music Generation
Music generation with AI has progressed from simple MIDI patterns to full song production with vocals, instrumentation, and complex arrangements. The core challenge is modeling long-range temporal structure: music has hierarchical patterns from individual notes (milliseconds) to phrases (seconds) to sections (minutes) that must all be coherent.
MusicLM and MusicGen
Google's MusicLM was the first model to generate high-fidelity music from text descriptions at 24kHz. It uses a hierarchical sequence-to-sequence approach: MuLan embeddings (a music-text joint embedding model) condition semantic tokens from w2v-BERT, which then condition acoustic tokens from SoundStream. Meta's MusicGen simplified this into a single autoregressive transformer that generates EnCodec audio tokens directly, conditioned on text or melody. MusicGen introduced efficient codebook interleaving patterns that allow generating multiple codec streams with a single model pass.
from audiocraft.models import MusicGen from audiocraft.data.audio import audio_write # Load the medium-sized MusicGen model model = MusicGen.get_pretrained("facebook/musicgen-medium") model.set_generation_params( duration=15, # Generate 15 seconds of audio top_k=250, top_p=0.0, temperature=1.0, cfg_coef=3.0, # Classifier-free guidance strength ) # Text-conditional music generation descriptions = [ "An upbeat electronic dance track with driving bass and synth arpeggios", "A peaceful acoustic guitar melody with gentle fingerpicking", ] wav = model.generate(descriptions) # Save each generated track for idx, one_wav in enumerate(wav): audio_write( f"music_{idx}", one_wav.cpu(), model.sample_rate, strategy="loudness", )
Suno and Udio: Full Song Generation
Suno and Udio represent the current state of the art in full song generation, producing complete songs with vocals, instrumentation, and lyrics. These commercial systems accept text descriptions of genre, mood, and style, along with optional lyrics, and generate radio-quality tracks of 2 to 4 minutes. While the exact architectures are proprietary, they likely combine text-conditioned audio generation with separate vocal synthesis and mixing stages. The quality has reached a point where generated music is difficult to distinguish from human-produced tracks in blind listening tests.
Music generation raises significant legal and ethical questions. Models trained on copyrighted music may reproduce recognizable melodies, chord progressions, or production styles. Suno and Udio face ongoing lawsuits from major record labels. When deploying music generation, consider: the training data provenance, whether outputs could constitute derivative works, the legal landscape in your jurisdiction, and whether your use case requires royalty-free generation. Using models trained exclusively on licensed or public domain music reduces legal exposure.
3. Text-to-Video Generation
Text-to-video is arguably the most challenging generative modality, requiring the model to produce temporally coherent sequences of frames that are individually high quality and collectively tell a consistent visual story. A single second of 24fps 1080p video contains roughly 50 million pixels, compared to about 1 million for a single 1024x1024 image.
Architecture: Diffusion Transformers (DiT) for Video
Most modern video generation models extend the Diffusion Transformer (DiT) architecture to handle spatiotemporal data. Instead of processing a single image's latent representation, the model processes a 3D latent volume (height, width, time). Attention operates across both spatial and temporal dimensions, either through factored attention (separate spatial and temporal attention layers) or full 3D attention. The latent space comes from a video VAE that compresses frames both spatially and temporally.
Leading Video Generation Models
OpenAI's Sora demonstrated that scaling DiT architectures to video produces remarkable results, generating up to 60-second clips with consistent characters, realistic physics, and cinematic camera movements. Runway's Gen-3 Alpha focuses on controllable commercial video generation with features like motion brush and camera control. Kuaishou's Kling 2 achieves strong temporal consistency using a 3D VAE that compresses video in both space and time. Google's Veo 2 generates high-definition video with excellent prompt adherence and physical realism.
| Model | Developer | Max Duration | Resolution | Key Strength |
|---|---|---|---|---|
| Sora | OpenAI | 60 sec | 1080p | Temporal coherence, physics |
| Runway Gen-3 Alpha | Runway | 10 sec | 1080p | Controllability, motion brush |
| Kling 2 | Kuaishou | 10 sec | 1080p | 3D VAE, consistency |
| Veo 2 | 8 sec | 4K | Resolution, prompt adherence | |
| CogVideoX | Zhipu AI | 6 sec | 720p | Open-source, extensible |
| Wan 2.1 | Alibaba | 5 sec | 720p | Open-source, image-to-video |
# Using CogVideoX (open-source) via diffusers import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16, ) pipe.enable_model_cpu_offload() prompt = "A golden retriever running through a sunlit meadow, slow motion, cinematic" video = pipe( prompt=prompt, num_videos_per_prompt=1, num_inference_steps=50, num_frames=49, guidance_scale=6.0, ).frames[0] export_to_video(video, "golden_retriever.mp4", fps=8)
Video generation's hardest problem is not frame quality (which borrows from image generation) but temporal coherence. Objects must maintain consistent appearance across frames, physics must be plausible (gravity, reflections, shadows), and camera motion must feel natural. Models achieve this through temporal attention layers, 3D latent spaces that encode time alongside space, and training on large-scale video datasets. The rapid quality improvements from Sora onward suggest that scaling compute and data for video DiT architectures is a reliable path to better results.
4. 3D Generation
3D content generation from text or images is an emerging frontier. Current approaches include score distillation sampling (SDS), which uses a pre-trained 2D diffusion model to optimize a 3D representation (NeRF or Gaussian splatting) so that it looks correct from every viewing angle. DreamFusion pioneered this approach, and subsequent work like Instant3D and LRM (Large Reconstruction Model) has made 3D generation faster and more reliable.
Multimodal Composition Pipelines
Real-world applications often combine multiple generative modalities into a single pipeline. A film production tool might use an LLM to write a script, a video model to generate scenes, a TTS model for dialogue, and a music model for the soundtrack. Orchestrating these components requires careful attention to temporal synchronization, style consistency, and resource management. Frameworks like ComfyUI provide node-based interfaces for building such pipelines, while programmatic approaches use Python to chain models together.
Video generation is extremely compute-intensive. Generating a single 5-second clip at 720p can take 5 to 15 minutes on a high-end GPU. Training video models requires thousands of GPUs running for weeks. This computational cost is the primary reason video generation lags behind image generation in quality and accessibility. Cloud APIs (Runway, Sora, Kling) abstract this cost into per-second pricing, while open-source models like CogVideoX and Wan let researchers experiment on smaller scales with reduced resolution and frame counts.
Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Modern TTS uses either spectrogram-based (VITS, F5-TTS) or token-based (Bark, VALL-E) approaches, with zero-shot voice cloning from seconds of reference audio now readily available.
- Real-time conversational audio (GPT-4o, Moshi) processes audio tokens directly in the transformer, enabling natural voice interaction with sub-200ms latency.
- Music generation has progressed from instrumental snippets to full songs with vocals. MusicGen provides an open-source baseline, while Suno and Udio push commercial quality.
- Video generation extends image diffusion to spatiotemporal volumes. DiT architectures with spatial and temporal attention achieve remarkable quality, though compute costs remain very high.
- Temporal coherence is the defining challenge of video generation, requiring consistent object appearance, plausible physics, and natural motion across frames.
- Multimodal composition pipelines combine text, image, audio, video, and music generation into end-to-end production workflows, orchestrated by LLMs or visual programming tools.