Theoretical Background
1. Document Role
This document provides the minimum theoretical grounding needed to understand the system design without assuming deep specialization in diffusion modeling.
It explains:
- how text conditioning affects image generation
- why embedding-space steering is plausible
- why low-dimensional control is useful
- why preference learning is the right framing
Related documents:
2. Diffusion at a High Level
A text-to-image diffusion model generates an image by starting from noise and iteratively denoising a latent representation while conditioning each step on text-derived features.
A simplified pipeline is:
- tokenize the prompt
- encode tokens into text embeddings
- sample a latent noise tensor
- iteratively denoise using a conditional U-Net
- decode the final latent with a VAE
The important design consequence is that the model is conditioned on embeddings, not directly on raw text.
3. Why Prompt Rewriting Is Hard
The visible prompt string is discrete, but the model control signal is continuous.
That mismatch matters:
- a tiny wording change can move the conditioning signal in a non-smooth way
- many visual intents do not map cleanly to words
- the same prompt may behave differently under different seeds
- users often discover their intent only after seeing generated outputs
Prompt rewriting remains useful, but it is a blunt control surface for iterative local search.
4. Prompt Embeddings as Control Objects
For this project, the prompt-conditioned embedding is treated as the editable object of interest.
A prompt embedding is usually a sequence of token-level vectors rather than one single vector. Practical steering may therefore operate on:
- the full token embedding tensor
- a pooled representation
- a structured low-rank offset
- selected token subsets
This distinction matters because different representations trade off expressiveness, cost, interpretability, and stability.
5. Low-Dimensional Steering
Direct search over the full embedding tensor is high-dimensional and unstable. A lower-dimensional steering parameterization gives the system a controllable search space.
One useful formulation is:
E0: the base prompt embeddingU: a basis of steering directionsz: a low-dimensional steering codeE(z) = E0 + U z: the active conditioned embedding
Advantages of searching over z:
- fewer degrees of freedom
- easier optimization
- easier uncertainty estimation
- more interpretable trajectories
- simpler comparison between update rules
6. Why Local Search Is Reasonable
The system is not attempting unrestricted global optimization over all possible prompts. It is attempting local improvement around a user-provided intent.
That makes local search reasonable because:
- the user already provides a semantic starting point
- nearby movements are more likely to preserve intent
- smaller steps are easier to interpret
- local updates are easier to constrain and replay
The purpose is controlled adaptation, not unconstrained generation.
7. Preference Learning Framing
The system is not predicting a ground-truth target image. Instead, it tries to infer a latent user utility function from observed responses.
This is naturally a preference-learning problem.
Common feedback forms include:
- scalar rating
- pairwise preference
- partial ranking
- shortlist selection
- free-text critique
The reward is never directly observed. It must be inferred from noisy, partial, and sometimes inconsistent feedback.
8. Exploration and Exploitation
Interactive steering is a sequential decision problem.
- Exploitation means sampling near the currently estimated best direction.
- Exploration means sampling uncertain or diverse directions that may reveal better outcomes.
A system that only exploits can converge prematurely. A system that only explores may waste user attention.
One research objective of the platform is to compare policies for balancing these pressures under real human feedback.
9. Seed Sensitivity
Diffusion generation is stochastic. Two images produced from the same embedding can still differ substantially because of seed variation.
That introduces a core identification problem:
- some quality changes come from embedding movement
- some quality changes come from random seed variation
This is why the system must support:
- same-seed within-round comparisons
- alternate-seed validation
- robustness metrics across seeds
10. Trust Regions and Anchoring
Large movements in steering space may drift far from the user's original intent.
Two stabilizers are therefore important:
- trust region: limit step size per round
- anchor penalty: discourage excessive movement away from the origin
z = 0
Together they help the system search locally while preserving semantic coherence.
11. Multiple Representation and Update Choices
There is no reason to assume one representation or one update rule is universally best.
The platform should compare alternatives such as:
- pooled versus token-level steering
- random versus structured bases
- winner-copy versus averaged updates
- linear versus pairwise probabilistic preference models
- deterministic versus uncertainty-aware updates
This comparative framing is central to the research value of the system.
12. Practical Implications for Design
The theoretical framing leads directly to several engineering consequences:
- the system must log seed policy explicitly
- the system must preserve full round trajectories
- the system must support interchangeable samplers and updaters
- the system must track uncertainty where possible
- the system must constrain steering movement
These are not implementation conveniences. They are necessary to make the research claims testable.
13. Limits of the Theory
This document does not claim that embedding movement is always semantically smooth or always easier than prompt editing.
Known limitations include:
- entangled latent directions
- non-linear effects in the generator
- instability under different prompts or checkpoints
- user inconsistency in expressed preference
The system exists partly to measure these limits rather than assume them away.
14. Summary
The theoretical justification for the project is straightforward: diffusion models consume continuous text-conditioning embeddings, user intent is best modeled through preference feedback, and controlled local search in a low-dimensional steering space provides a plausible way to study interactive steering rigorously.