Theoretical Background

1. Document Role

This document provides the minimum theoretical grounding needed to understand the system design without assuming deep specialization in diffusion modeling.

It explains:

Related documents:

2. Diffusion at a High Level

A text-to-image diffusion model generates an image by starting from noise and iteratively denoising a latent representation while conditioning each step on text-derived features.

A simplified pipeline is:

  1. tokenize the prompt
  2. encode tokens into text embeddings
  3. sample a latent noise tensor
  4. iteratively denoise using a conditional U-Net
  5. decode the final latent with a VAE

The important design consequence is that the model is conditioned on embeddings, not directly on raw text.

3. Why Prompt Rewriting Is Hard

The visible prompt string is discrete, but the model control signal is continuous.

That mismatch matters:

Prompt rewriting remains useful, but it is a blunt control surface for iterative local search.

4. Prompt Embeddings as Control Objects

For this project, the prompt-conditioned embedding is treated as the editable object of interest.

A prompt embedding is usually a sequence of token-level vectors rather than one single vector. Practical steering may therefore operate on:

This distinction matters because different representations trade off expressiveness, cost, interpretability, and stability.

5. Low-Dimensional Steering

Direct search over the full embedding tensor is high-dimensional and unstable. A lower-dimensional steering parameterization gives the system a controllable search space.

One useful formulation is:

Advantages of searching over z:

6. Why Local Search Is Reasonable

The system is not attempting unrestricted global optimization over all possible prompts. It is attempting local improvement around a user-provided intent.

That makes local search reasonable because:

The purpose is controlled adaptation, not unconstrained generation.

7. Preference Learning Framing

The system is not predicting a ground-truth target image. Instead, it tries to infer a latent user utility function from observed responses.

This is naturally a preference-learning problem.

Common feedback forms include:

The reward is never directly observed. It must be inferred from noisy, partial, and sometimes inconsistent feedback.

8. Exploration and Exploitation

Interactive steering is a sequential decision problem.

A system that only exploits can converge prematurely. A system that only explores may waste user attention.

One research objective of the platform is to compare policies for balancing these pressures under real human feedback.

9. Seed Sensitivity

Diffusion generation is stochastic. Two images produced from the same embedding can still differ substantially because of seed variation.

That introduces a core identification problem:

This is why the system must support:

10. Trust Regions and Anchoring

Large movements in steering space may drift far from the user's original intent.

Two stabilizers are therefore important:

Together they help the system search locally while preserving semantic coherence.

11. Multiple Representation and Update Choices

There is no reason to assume one representation or one update rule is universally best.

The platform should compare alternatives such as:

This comparative framing is central to the research value of the system.

12. Practical Implications for Design

The theoretical framing leads directly to several engineering consequences:

These are not implementation conveniences. They are necessary to make the research claims testable.

13. Limits of the Theory

This document does not claim that embedding movement is always semantically smooth or always easier than prompt editing.

Known limitations include:

The system exists partly to measure these limits rather than assume them away.

14. Summary

The theoretical justification for the project is straightforward: diffusion models consume continuous text-conditioning embeddings, user intent is best modeled through preference feedback, and controlled local search in a low-dimensional steering space provides a plausible way to study interactive steering rigorously.