Section 23.1: Image Generation & Vision-Language Models

★ Big Picture

Image generation and visual understanding have converged into a unified multimodal AI stack. On the generation side, diffusion models and flow matching produce photorealistic images from text prompts, while controlled generation techniques (ControlNet, IP-Adapter) give fine-grained creative control. On the understanding side, vision encoders like CLIP and SigLIP bridge the gap between pixels and language, enabling vision-language models (GPT-4V, LLaVA, Gemini) that can see, describe, and reason about images. Together, these technologies form the foundation of multimodal AI.

1. Diffusion Models for Image Generation

Diffusion models generate images by learning to reverse a noise-addition process. During training, the model learns to predict and remove noise from progressively corrupted images. During inference, it starts from pure random noise and iteratively denoises it into a coherent image, guided by a text prompt. This elegant framework has become the dominant paradigm for high-quality image generation, surpassing earlier approaches like GANs and VAEs in both quality and controllability.

The Forward and Reverse Process

The forward process adds Gaussian noise to a clean image over T timesteps until the image becomes pure noise. The reverse process, parameterized by a neural network (typically a U-Net or transformer), learns to denoise at each step. The key insight is that predicting the noise added at each step is equivalent to learning the score function (gradient of the log probability) of the data distribution.

Figure 23.1: The diffusion process. Forward: progressively add noise. Reverse: learn to denoise, generating images from random noise.

Latent Diffusion (Stable Diffusion)

Running diffusion directly in pixel space is computationally expensive. Stable Diffusion solves this by operating in the latent space of a pre-trained variational autoencoder (VAE). The image is first encoded into a compact latent representation (typically 64x64 instead of 512x512), diffusion happens in this smaller space, and the result is decoded back to pixel space. This reduces computation by roughly 50x while maintaining image quality.

import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

# Load Stable Diffusion XL with optimized scheduler
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

# Generate with classifier-free guidance
prompt = "A serene Japanese garden with cherry blossoms, watercolor style"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,       # Higher = more prompt adherence
    width=1024,
    height=1024,
).images[0]

image.save("japanese_garden.png")

📘 Classifier-Free Guidance

Classifier-free guidance (CFG) is the key technique that makes text-conditional diffusion work well. During training, the text condition is randomly dropped a fraction of the time, teaching the model to generate both conditionally and unconditionally. At inference, the model output is extrapolated away from the unconditional prediction: output = uncond + scale * (cond - uncond). Higher guidance scales produce images that more closely match the prompt but with less diversity. Typical values range from 5 to 15.

Flow Matching and Rectified Flows

Flow matching is a newer paradigm that learns a velocity field transporting samples from a noise distribution to the data distribution along straight paths. Unlike diffusion models that follow curved trajectories through noise space, flow matching creates straighter paths that require fewer sampling steps. Stable Diffusion 3 and Flux use this approach, achieving higher quality with fewer inference steps (often 4 to 8 steps instead of 20 to 50 for traditional diffusion).

# Using Flux (flow matching model) via the diffusers library
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")

# Flux-schnell needs only 4 steps thanks to flow matching
image = pipe(
    prompt="A cyberpunk cityscape at sunset, neon reflections on wet streets",
    num_inference_steps=4,
    guidance_scale=0.0,   # Schnell uses guidance distillation
    height=1024,
    width=1024,
).images[0]

DALL-E and Midjourney

OpenAI's DALL-E 3 generates images by first using GPT-4 to rewrite user prompts into detailed descriptions, then feeding those descriptions to a diffusion model. This prompt rewriting step significantly improves output quality, because users often write terse prompts that lack the detail needed for good generation. Midjourney, while proprietary, is notable for its aesthetic quality and has become the benchmark for artistic image generation. Both systems demonstrate that engineering around the core model (prompt rewriting, aesthetic fine-tuning, safety filtering) is as important as the model architecture itself.

from openai import OpenAI

client = OpenAI()

# DALL-E 3 via the OpenAI API
response = client.images.generate(
    model="dall-e-3",
    prompt="An isometric illustration of a cozy bookshop with warm lighting",
    size="1024x1024",
    quality="hd",
    n=1,
)

image_url = response.data[0].url
revised_prompt = response.data[0].revised_prompt
print(f"Revised prompt: {revised_prompt}")
print(f"Image URL: {image_url}")

2. Controlled Image Generation

While text prompts offer creative flexibility, many applications require more precise spatial and stylistic control. ControlNet, IP-Adapter, and related techniques add conditioning signals beyond text, enabling edge-guided generation, pose transfer, style reference, and inpainting.

ControlNet

ControlNet adds spatial conditioning to a pre-trained diffusion model by attaching a trainable copy of the encoder blocks. The original model weights are frozen, and the ControlNet learns to inject spatial information (edges, depth maps, poses, segmentation masks) into the generation process. This allows precise control over the structure of generated images while preserving the model's learned generative capabilities.

Figure 23.2: ControlNet architecture. A trainable copy of the encoder injects spatial conditions into the frozen U-Net via zero convolutions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image

# Load a Canny edge ControlNet
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Extract edges from a reference image
ref_image = load_image("reference_building.jpg")
ref_np = np.array(ref_image)
edges = cv2.Canny(ref_np, 100, 200)
control_image = Image.fromarray(edges)

# Generate with edge guidance
result = pipe(
    prompt="A futuristic glass skyscraper, photorealistic",
    image=control_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.8,
).images[0]

IP-Adapter: Style and Subject Transfer

IP-Adapter (Image Prompt Adapter) enables using images as prompts alongside text. It works by encoding a reference image through CLIP's image encoder and injecting those features into the cross-attention layers of the diffusion model. This enables style transfer (generating images in the style of a reference) and subject consistency (maintaining the same character or object across multiple generations).

🔍 Key Insight

The shift from text-only to multi-signal conditioning (text + edges + depth + style reference) transforms diffusion models from creative toys into production tools. ControlNet preserves spatial structure, IP-Adapter transfers style and identity, and combining them gives designers the precise control needed for commercial workflows. The same base model serves radically different use cases through different conditioning signals.

3. Vision Encoders: Bridging Pixels and Language

While diffusion models generate images from text, vision encoders work in the opposite direction: they convert images into representations that language models can understand. The evolution from ViT to CLIP to SigLIP has produced increasingly powerful visual representations that form the backbone of modern multimodal AI.

Vision Transformer (ViT)

The Vision Transformer applies the transformer architecture directly to images by splitting them into fixed-size patches (typically 16x16 or 14x14 pixels), flattening each patch into a vector, and processing the sequence of patch embeddings with standard transformer layers. A special [CLS] token aggregates information across all patches, producing a single vector representation of the entire image. ViT demonstrated that transformers, originally designed for text, work remarkably well for vision when trained on enough data.

CLIP: Contrastive Language-Image Pre-training

CLIP jointly trains a vision encoder and a text encoder to embed images and their textual descriptions into a shared vector space. Matching image-text pairs are pushed close together while non-matching pairs are pushed apart. Trained on 400 million image-text pairs scraped from the internet, CLIP learns visual concepts from natural language supervision, enabling zero-shot image classification, image search, and serving as the text encoder for diffusion models like Stable Diffusion.

Figure 23.3: CLIP architecture. Image and text encoders are trained contrastively to produce aligned embeddings in a shared space.

Image Classification with CLIP

CLIP's shared embedding space enables zero-shot image classification without task-specific training. You encode the image and a set of candidate text labels, then find the label whose embedding is closest to the image embedding. This generalizes across domains without fine-tuning.

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

image = Image.open("photo.jpg")
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Cosine similarity between image and each label
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob.item():.3f}")

a photo of a cat: 0.923 a photo of a dog: 0.065 a photo of a car: 0.012

4. Vision-Language Models

Vision-language models (VLMs) go beyond CLIP's contrastive matching to enable full visual reasoning. These models accept images alongside text in their input and generate free-form text responses. They can describe images, answer questions about visual content, extract structured data from screenshots, and reason about spatial relationships. The rapid evolution of VLMs represents one of the most impactful recent developments in AI.

Comparison of Vision-Language Models

Model	Vision Encoder	Architecture	Key Strength	Access
GPT-4V / GPT-4o	Proprietary	Native multimodal	Best general reasoning	API
Gemini 1.5 / 2.0	Native	Natively multimodal	Long context (1M tokens), interleaved modalities	API
Claude 3.5 Sonnet	Proprietary	Native multimodal	Document/chart analysis	API
LLaVA 1.6	CLIP ViT-L	Projection + LLM	Open-source, customizable	Open weights
Qwen-VL / Qwen2-VL	ViT + dynamic resolution	Projection + LLM	Multi-image, video, grounding	Open weights
InternVL 2.5	InternViT-6B	Projection + LLM	Leading open benchmark scores	Open weights

Using GPT-4V for Visual Reasoning

from openai import OpenAI
import base64

client = OpenAI()

# Encode image to base64
with open("chart.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this chart. What trends do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_b64}",
                    "detail": "high"  # "high" for detailed analysis
                }
            }
        ]
    }],
    max_tokens=1000,
)

print(response.choices[0].message.content)

Open-Source VLMs with LLaVA

LLaVA (Large Language and Vision Assistant) is the most influential open-source VLM architecture. It connects a CLIP vision encoder to a language model through a simple projection layer (either a linear layer or a small MLP). The model is trained in two stages: first aligning the vision-language features using image-caption pairs, then fine-tuning on visual instruction data. This simple and effective design has spawned many variants and remains the template for most open-source VLMs.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)

image = Image.open("receipt.jpg")
prompt = "[INST] <image>\nExtract the total amount and date from this receipt. [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)

⚠ Important Limitation

Vision-language models can hallucinate visual details that are not present in the image. They may confidently describe objects that do not exist, misread text in images, or fabricate numerical values from charts. For high-stakes applications (medical imaging, document extraction, autonomous driving), always validate VLM outputs against ground truth or use them as suggestions that require human verification rather than trusted outputs.

Gemini: Natively Multimodal

Google's Gemini models represent a different approach to multimodal AI. Rather than bolting a vision encoder onto a language model, Gemini is trained from the ground up on interleaved text, image, audio, and video data. This native multimodal training enables capabilities that are difficult to achieve with adapter-based approaches, such as understanding spatial relationships across multiple images, processing long videos, and handling interleaved image-text sequences. Gemini 2.0 extends this to native image generation, combining understanding and generation in a single model.

import google.generativeai as genai
from PIL import Image

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")

# Multi-image reasoning
img1 = Image.open("before.jpg")
img2 = Image.open("after.jpg")

response = model.generate_content([
    "Compare these two images and describe what changed:",
    img1,
    img2,
])
print(response.text)

📘 Native vs. Adapter Multimodal

There are two philosophies for building multimodal models. The adapter approach (LLaVA, Qwen-VL) takes a strong text-only LLM and adds vision through projection layers. This is modular and leverages existing LLM capabilities. The native approach (Gemini, GPT-4o) trains on all modalities from the start. This theoretically allows deeper cross-modal understanding but requires enormously more training data and compute. In practice, both approaches produce strong results, and the best choice depends on your specific requirements for customization, deployment, and supported modalities.

Knowledge Check

1. Why does Stable Diffusion operate in latent space rather than pixel space?

Show Answer

Operating in the latent space of a pre-trained VAE reduces the spatial dimensions dramatically (e.g., from 512x512 to 64x64), cutting computation by roughly 50x while maintaining image quality. The VAE encoder compresses images into a compact representation where diffusion can happen efficiently, and the VAE decoder reconstructs full-resolution images from the denoised latents.

2. What is classifier-free guidance and why does it improve prompt adherence?

Show Answer

Classifier-free guidance trains the diffusion model to generate both conditionally (with the text prompt) and unconditionally (without it) by randomly dropping the text condition during training. At inference, the output is extrapolated away from the unconditional prediction: output = uncond + scale * (cond - uncond). Higher guidance scales amplify the difference between conditional and unconditional outputs, producing images that more closely match the prompt at the cost of reduced diversity.

3. How does ControlNet add spatial conditioning without degrading the base model?

Show Answer

ControlNet creates a trainable copy of the diffusion model's encoder blocks while keeping the original model weights completely frozen. The copy processes the conditioning signal (edges, depth, poses) and injects its outputs into the frozen model through zero-initialized convolution layers. Because the zero convolutions start at zero output, they initially have no effect on the base model, and the training process gradually learns how much spatial conditioning to inject.

4. How does CLIP enable zero-shot image classification?

Show Answer

CLIP trains an image encoder and a text encoder to produce embeddings in a shared vector space where matching image-text pairs are close together. For zero-shot classification, you encode the image and a set of candidate text labels (e.g., "a photo of a cat"), compute cosine similarity between the image embedding and each label embedding, and select the label with the highest similarity. No task-specific training or fine-tuning is needed.

5. What is the key architectural difference between LLaVA-style and Gemini-style multimodal models?

Show Answer

LLaVA uses an adapter approach: it takes a pre-trained text LLM and connects a CLIP vision encoder through a learned projection layer, training only the projection and (optionally) fine-tuning the LLM. Gemini uses a native approach: it is trained from scratch on interleaved text, image, audio, and video data, so multimodal understanding is built into the model's core rather than added as an afterthought. The adapter approach is modular and customizable; the native approach potentially enables deeper cross-modal reasoning.

Key Takeaways

Diffusion models generate images by learning to reverse a noise-addition process. Latent diffusion (Stable Diffusion) makes this efficient by operating in compressed VAE latent space.
Flow matching (Flux, SD3) learns straighter generation paths, enabling high-quality images in 4 to 8 steps instead of 20 to 50 for standard diffusion.
Controlled generation techniques (ControlNet for spatial structure, IP-Adapter for style transfer) transform diffusion models from creative tools into precise production systems.
CLIP bridges vision and language through contrastive learning, enabling zero-shot classification and serving as the text encoder for most diffusion models.
Vision-language models (GPT-4V, LLaVA, Gemini) combine vision encoders with LLMs for free-form visual reasoning, with the choice between adapter and native architectures depending on requirements.
VLM hallucination remains a serious concern; always validate visual outputs for high-stakes applications.