Research System Specification: Interactive Prompt-Embedding Steering for Stable Diffusion

1. Purpose

This document specifies a research system for studying human-in-the-loop steering of text-to-image diffusion models by modifying prompt-conditioning embeddings rather than only rewriting the visible text prompt.

The system is intended for research, not production. It must make it easy to compare:

The system should expose these choices through a simple HTML interface and a modular backend so that experiments can be repeated, logged, and compared.


2. Motivation

2.1 Problem

Text-to-image diffusion models such as Stable Diffusion are highly sensitive to prompt wording, negative prompts, seed, guidance scale, scheduler, and other generation parameters. Small changes in wording can cause discontinuous changes in output. This makes user steering difficult:

2.2 Central research idea

Instead of treating the user prompt as a fixed string, the system treats the prompt-conditioned embedding as a searchable control object. It then performs an iterative loop:

  1. start from the initial prompt embedding
  2. generate several modified embedding candidates
  3. generate images for those candidates
  4. collect user feedback
  5. estimate a preferred direction or region in embedding space
  6. update the steering state
  7. repeat

This enables a controlled study of whether user preference can be learned through local search in a low-dimensional steering space.

2.3 Research value

This system supports research questions such as:

2.4 Why a research platform is needed

Most image-generation interfaces optimize for convenience, not controlled experimentation. A research platform must instead provide:


3. Theoretical Background

This section is self-contained and written for readers who understand machine learning at a basic level but do not necessarily specialize in diffusion models.

3.1 How text-to-image diffusion works at a high level

A text-to-image diffusion model learns to generate an image by progressively denoising a random latent representation while being conditioned on a text prompt.

A simplified pipeline is:

  1. tokenize the text prompt
  2. encode the tokens into text embeddings
  3. sample a random latent noise tensor
  4. iteratively denoise the latent using a U-Net conditioned on the text embeddings
  5. decode the final latent into an image using a VAE decoder

The important point for this system is that the text prompt does not directly control the image. The model actually consumes a tensor of embeddings derived from the prompt.

3.2 Prompt text versus prompt embedding

The visible prompt string is discrete. The embedding is continuous.

That distinction matters:

A prompt embedding is typically a sequence of token-level vectors, not a single vector. For research purposes, the system may work with:

3.3 Why low-dimensional steering is useful

Directly searching the full embedding tensor is high-dimensional, expensive, and unstable. A better approach is to define a low-dimensional steering code:

Now the system searches over z rather than over the full embedding space.

Advantages:

3.4 Human preference learning

The system is not trying to predict a ground-truth target image. It is trying to infer what the user prefers.

This is a preference-learning problem. A preference model estimates a hidden reward or utility function from observed feedback.

Examples:

The reward function is not directly known. It is inferred from user responses.

3.5 Exploration versus exploitation

This is the core sequential decision problem.

If the system only exploits, it may converge too early to a mediocre local optimum. If it only explores, it wastes user effort and does not improve quickly.

A research goal of the platform is to compare policies for balancing the two.

3.6 Seed sensitivity

Text-to-image diffusion is stochastic. The random seed can change image content substantially even when the prompt embedding stays fixed.

Therefore, preference learning must separate:

The system must support explicit seed-control policies:

3.7 Trust region and anchoring

Large moves in embedding space may drift away from the user’s initial intention. The system therefore uses two stabilizers:

This lets the system search locally without losing semantic coherence.

3.8 Why multiple update mechanisms should be compared

There is no reason to assume one update rule is best. A research platform should compare:


4. Research Goals and Non-Goals

4.1 Goals

The system must support controlled experiments on:

4.2 Non-goals

The initial version is not required to:


5. High-Level System Overview

The system consists of six major parts:

  1. Frontend: simple HTML interface for prompt entry, image display, feedback collection, and experiment controls
  2. Experiment Controller: orchestrates rounds, policies, and logs
  3. Generation Engine: calls the diffusion pipeline and handles prompt embeddings
  4. Sampling Module: proposes candidate steering codes or embedding offsets
  5. Preference / Update Module: learns from feedback and computes the next state
  6. Storage and Evaluation Layer: records experiments, metrics, artifacts, and replays

Data flow:

  1. user creates or loads an experiment
  2. prompt is encoded to base embedding
  3. current strategy proposes candidates
  4. engine generates images
  5. frontend displays them
  6. user provides feedback
  7. backend updates preference state
  8. next round starts

6. Core Research Abstractions

6.1 Experiment

An experiment is a fully specified configuration and all data generated under it.

Fields:

6.2 Session

A session is one interactive run of one experiment with one prompt and one user.

Fields:

6.3 Round

A round is one propose-generate-display-feedback-update cycle.

Fields:

6.4 Candidate

A candidate is one proposed point in steering space.

Fields:

6.5 Feedback event

A feedback event records one user action.

Fields:


7. Frontend Specification (Simple HTML Interface)

7.1 Design principles

The interface should be intentionally simple:

7.2 Main pages

A. Home / Experiment Dashboard

Purpose:

Main elements:

B. Session Setup Page

Inputs:

Actions:

C. Interactive Steering Page

Main layout:

Controls:

Image grid requirements:

Feedback widgets must be switchable by experiment mode:

D. Replay / Analysis Page

Purpose:

7.3 Frontend state model

The frontend should maintain:

7.4 Accessibility requirements

7.5 Frontend technology recommendation

Preferred initial stack:


8. Backend Specification

A suitable baseline stack:

8.2 Backend modules

A. API layer

Responsibilities:

B. Orchestrator

Responsibilities:

C. Embedding manager

Responsibilities:

D. Sampling manager

Responsibilities:

E. Generation manager

Responsibilities:

F. Preference/update manager

Responsibilities:

G. Evaluation manager

Responsibilities:


9. Data Model Specification

9.1 Core tables or collections

experiments

sessions

rounds

candidates

feedback_events

artifacts

9.2 File artifacts

Artifacts to store:


10. API Specification

10.1 Example REST endpoints

POST /experiments

Create a new experiment.

Request body: - name - description - config

Response: - experiment ID

GET /experiments

List experiments.

POST /sessions

Create a session from an experiment config.

Request body: - experiment ID or full config - prompt - negative prompt

Response: - session ID - initial state

POST /sessions/{session_id}/rounds/next

Generate the next round of candidates.

Response: - round ID - candidate metadata - image URLs - state summary

POST /rounds/{round_id}/feedback

Submit feedback for the round.

Request body: - feedback type - payload - optional critique text

Response: - update summary - next incumbent state

GET /sessions/{session_id}

Get full session summary.

GET /sessions/{session_id}/replay

Get ordered rounds and artifacts.

GET /sessions/{session_id}/export

Export logs, metrics, and artifacts manifest.


11. Steering Representation Specification

11.1 Required modes

The system must support at least three steering representations.

Mode A: Low-dimensional latent code

E(z) = E0 + U z

Recommended default for research.

Mode B: Token-level offset mode

Apply learned or sampled offsets to selected token embeddings.

Useful for analyzing more local control.

Mode C: Pooled embedding mode

Apply a simplified offset to a pooled representation.

Useful as a baseline, even if weaker.

11.2 Basis construction strategies

The system should support:

11.3 Constraints

Steering representation must support:


12. Sampling Strategy Specification

This is a main experimental axis. The system must make samplers plug-in based.

12.1 Sampler interface

Each sampler must implement:

12.2 Required baseline samplers

A. Random local sampler

Sample directions uniformly or Gaussian within a trust ball.

Purpose: - sanity baseline

B. Exploit-plus-orthogonal sampler

Batch composition: - exploit near estimated best direction - refine around that direction - explore orthogonal directions - optional mirror check

Purpose: - strong baseline for interactive search

C. Uncertainty-guided sampler

Prefer candidates with high estimated uncertainty and adequate predicted utility.

Purpose: - active learning baseline

D. Thompson-style sampler

Sample from the posterior over reward parameters, then optimize under that sample.

Purpose: - principled exploration/exploitation tradeoff

E. Quality-diversity sampler

Generate candidates that are both strong and diverse across simple descriptors.

Purpose: - preserve multiple promising modes

12.3 Optional advanced samplers

12.4 Batch composition controls

Per round, the system should log and optionally enforce:


13. Feedback Mechanism Specification

This is another major experimental axis.

13.1 Unified internal feedback schema

All feedback must be normalized to a common event format. Even if the frontend collects ratings or rankings, the backend should be able to derive pairwise comparisons when useful.

13.2 Required feedback modes

A. Scalar ratings

User rates each image on a fixed scale.

Pros: - easy to collect

Cons: - noisy calibration

B. Pairwise comparison

User chooses preferred image between two candidates.

Pros: - clean signal

Cons: - may require many comparisons

C. Partial ranking

User ranks top-k candidates.

Pros: - more informative than single winner

D. Winner + critique

User selects best candidate and provides a short natural-language reason.

Pros: - can support directional interpretation later

E. Select-all-that-fit

User marks all acceptable candidates.

Pros: - useful when multiple modes are valid

13.3 Feedback quality controls

The platform should support:


14. Update Mechanism Specification

The update module takes session state and normalized feedback, and computes the next incumbent and preference state.

14.1 Update interface

Each updater must implement:

14.2 Required baseline updaters

A. Winner-copy updater

Set next incumbent to the winning candidate.

Purpose: - simplest baseline

B. Winner-average updater

Move partially toward top-rated or top-ranked candidates.

Purpose: - simple smooth update baseline

C. Linear preference model updater

Fit a linear model on steering features and move along estimated reward gradient.

Purpose: - practical baseline

D. Bradley-Terry / pairwise logistic updater

Fit pairwise preference probabilities and derive next step from estimated utility.

Purpose: - strong baseline for pairwise data

E. Bayesian updater

Maintain posterior uncertainty and update using preference observations.

Purpose: - enables uncertainty-based sampling

14.3 Optional advanced updaters

14.4 Stabilization controls

Each updater should optionally support:


15. Seed Policy Specification

15.1 Required seed modes

A. Fixed-per-round

All candidates in the round share the same seed.

Purpose: - isolate embedding effect

B. Fixed-per-candidate-role

Validation candidates use alternate seeds while main comparison candidates remain fixed.

C. Multi-seed averaging

A candidate is rendered under multiple seeds and summarized.

Purpose: - robustness analysis

15.2 Seed logging requirements

For every candidate, log:


16. Evaluation and Metrics Specification

The platform must support both online and offline evaluation.

16.1 Interaction-level metrics

16.2 Optimization metrics

16.3 Robustness metrics

16.4 Diversity metrics

16.5 Drift metrics

16.6 Human-centered metrics


17. Logging and Reproducibility Specification

17.1 Mandatory logging

Every experiment must log:

17.2 Replay requirements

A session replay must reconstruct:

17.3 Versioning

Version the following independently:


18. Error Handling and Fault Tolerance

The system must handle:

Behavioral requirements:


19. Security and Privacy Notes

Because this is a research prototype, security requirements are modest but should not be ignored.

Minimum requirements:


20. Suggested Project Structure

project/
  app/
    api/
      routes_experiments.py
      routes_sessions.py
      routes_rounds.py
      routes_exports.py
    core/
      config.py
      logging.py
      schema.py
    engine/
      prompt_encoder.py
      steering_basis.py
      generation.py
      seeds.py
    samplers/
      base.py
      random_local.py
      exploit_orthogonal.py
      uncertainty.py
      thompson.py
      quality_diversity.py
    feedback/
      normalization.py
      validation.py
    updaters/
      base.py
      winner_copy.py
      winner_average.py
      linear_pref.py
      bradley_terry.py
      bayesian.py
    evaluation/
      metrics.py
      replay.py
      reports.py
    storage/
      db.py
      models.py
      repository.py
    frontend/
      templates/
        index.html
        setup.html
        session.html
        replay.html
      static/
        styles.css
        app.js
  tests/
    unit/
    integration/
    e2e/
    fixtures/
  scripts/
    run_dev.py
    export_session.py
    replay_session.py
  docs/
    specification.md

21. Test Suite Specification

The test suite is a required part of the research platform because correctness, comparability, and reproducibility are central.

21.1 Test categories

The system must include:

21.2 Unit tests

A. Steering representation tests

Verify:

B. Sampler tests

Verify:

C. Feedback normalization tests

Verify:

D. Updater tests

Verify:

E. Seed policy tests

Verify:

21.3 Integration tests

A. Session lifecycle test

Flow:

  1. create experiment
  2. create session
  3. request first round
  4. submit feedback
  5. request next round
  6. verify state progression and persistence

B. Generation pipeline test

Use a lightweight mock or tiny test pipeline when full image generation is too expensive.

Verify:

C. Replay integrity test

Verify:

D. Strategy plug-in test

Verify:

21.4 End-to-end tests

Using browser automation or HTTP-level testing, verify:

21.5 Deterministic replay tests

These tests are critical.

Given:

The replay must reproduce:

21.6 Schema regression tests

Verify that old experiment exports can still be loaded or migrated.

21.7 Test fixtures

Required fixtures:

21.8 Acceptance test criteria

The prototype is acceptable when:


22. Minimal Viable Research Prototype

The first working version should include only the following mandatory features.

22.1 Mandatory capabilities

22.2 Nice-to-have but optional for v1


23. Example Experimental Matrix

A useful first matrix for research comparison:

Axis 1: Sampling

Axis 2: Feedback

Axis 3: Update

Axis 4: Seed policy

This creates a manageable but meaningful comparison grid.


24. Research Risks and Confounds

The system specification must explicitly acknowledge the main risks.

24.1 Seed confounding

A candidate may look best because of seed luck rather than embedding quality.

24.2 Human inconsistency

User preference may change or become inconsistent as they see more options.

24.3 Entangled directions

One steering move may affect multiple visual properties at once.

24.4 Interface bias

The layout or labeling of images may influence choice.

24.5 Fatigue effects

Long sessions may reduce feedback quality.

The system should log enough metadata to study these confounds later.


25. Deliverables for AI-Generated Implementation

An AI code generator receiving this specification should produce:

  1. a Python FastAPI backend
  2. a simple HTML/CSS/JS frontend
  3. modular sampler and updater interfaces
  4. one working diffusion generation wrapper
  5. experiment persistence layer
  6. replay/export support
  7. a complete automated test suite following Section 21
  8. documentation for local setup and running experiments

25.1 Code-generation constraints

Generated code should:

25.2 Output artifacts expected from implementation


26. Final Summary

This system is a controlled research platform for studying iterative user-guided image generation by steering prompt embeddings in a diffusion model.

Its core design principles are:

The platform is valuable not because it assumes one best method, but because it creates a clean environment for discovering which combinations of steering representation, candidate sampling, feedback collection, and update logic actually work.