Research Improvement Roadmap
1. Purpose
This document tracks the highest-value research improvements for StableSteering as a study platform.
It focuses on:
- research design
- experimental validity
- evaluation quality
- interpretability
- study operations
- comparative baselines
It does not focus on core engineering execution. That belongs in:
2. Current Research Baseline
The current system already supports:
- iterative steering sessions
- multiple samplers and updaters
- multiple feedback modes at the schema level
- deterministic test paths
- replay and trace capture
- real GPU-backed image generation
This is enough for exploratory pilot work, but not yet enough for strong claims about algorithm quality, usability, or scientific validity.
3. Main Research Gaps
The largest current gaps are:
- limited comparative baselines
- limited human-study instrumentation
- no formal study protocols in the repo
- limited analysis automation
- weak coverage of confounds like seed sensitivity and user inconsistency
4. Priority Levels
-
R0Needed before making strong research claims. -
R1Strongly improves study quality and interpretability. -
R2Valuable expansions once the core research loop is stable.
5. R0: Research Validity Priorities
5.1 Establish a baseline comparison matrix
Why it matters:
- steering only matters scientifically if it beats simpler alternatives
- without baselines, improvements can be mistaken for ordinary prompt iteration or random luck
Implementation notes:
- define a minimum comparison set:
- prompt rewriting only
- prompt-only manual iteration without steering state
- no-update random sampling
- updater comparisons such as winner-copy, winner-average, and linear-preference
- lock the task set and evaluation rules before collecting data
- require every new strategy claim to include at least one baseline comparison
Success signal:
- every reported result can be compared against a non-steering or weaker-steering baseline
5.2 Add explicit study protocols
Why it matters:
- informal operator habits create hidden study drift
- protocols turn one-off demos into repeatable experiments
Implementation notes:
- define pilot-study templates with prompt selection rules, stopping criteria, and operator instructions
- version the protocol documents alongside the code
- standardize how prompts, negative prompts, configuration presets, and success criteria are recorded
- separate exploratory studies from claim-bearing studies in documentation and reporting
Success signal:
- two different operators can run the same study and produce comparable artifacts
5.3 Improve confound logging
Why it matters:
- seed effects, fatigue, UI bias, and interruptions can dominate outcomes if they are not measured
- better logging makes null or mixed results more interpretable
Implementation notes:
- log hidden repeats for agreement checks
- add user confidence and decision time fields where appropriate
- log interruptions, retries, abandonment, and mid-session config changes
- distinguish runtime failures from preference uncertainty in analysis exports
Success signal:
- a disappointing or surprising session can be explained in terms of recorded confounds rather than guesswork
5.4 Define research success criteria
Why it matters:
- qualitative enthusiasm is useful for exploration but too weak for evaluating methods
- explicit criteria make it possible to stop, compare, and reject hypotheses honestly
Implementation notes:
- define expected effect sizes or directional improvements for key tasks
- define acceptable operator burden and session length
- add replay-based success checks such as incumbent improvement rate and convergence stability
- specify robustness thresholds across alternate seeds and repeated runs
Success signal:
- the team can say clearly when a strategy is better, worse, or inconclusive
6. R1: Better Measurement and Analysis
6.1 Add stronger outcome metrics
Why it matters:
- raw win counts are not enough to explain why a strategy helped or failed
- richer metrics reveal speed, stability, and user burden tradeoffs
Implementation notes:
- compute incumbent win rate, rounds-to-satisfaction, preference consistency, and seed robustness
- collect user-reported controllability and fatigue
- separate outcome quality metrics from interaction-cost metrics
- report uncertainty and sample counts together with aggregate values
Success signal:
- strategy comparisons show both effectiveness and operator cost
6.2 Build analysis-ready exports
Why it matters:
- analysis should not begin with manual cleaning of raw session files
- structured exports reduce friction for notebooks, dashboards, and papers
Implementation notes:
- export tidy CSV or parquet tables for candidates, feedback events, rounds, and sessions
- include join keys and session metadata in each table
- preserve references to replay bundles and trace reports
- version the export schema and document it clearly
Success signal:
- a researcher can load a session corpus into a notebook with minimal preprocessing
6.3 Add notebook-based analysis templates
Why it matters:
- reusable notebooks turn collected traces into repeatable analysis rather than one-off custom scripts
Implementation notes:
- create starter notebooks for trajectory analysis, seed robustness, sampler comparisons, and updater comparisons
- make notebooks read the official export schema rather than private ad hoc data layouts
- include plots for round progression, incumbent lineage, and preference stability
- keep example notebooks small enough to run on a local workstation
Success signal:
- the same notebooks can be rerun across new study cohorts without structural edits
6.4 Strengthen replay as a research asset
Why it matters:
- replay is already one of the most information-dense artifacts in the project
- it should support comparative analysis, not only debugging
Implementation notes:
- derive structured summaries from replay automatically
- compute change-over-round plots and candidate-lineage views
- highlight baseline prompt images, incumbent carry-forward steps, and final winners
- support side-by-side replay comparisons across strategies or prompts
Success signal:
- replay becomes a first-class analysis surface, not just a development convenience
7. R1: Better Human Interaction Research
7.1 Move beyond rating-only interaction
Why it matters:
- different preference interfaces capture different kinds of user intent
- pairwise, top-k, winner-only, and approve/reject modes may produce very different noise and speed profiles
Implementation notes:
- run controlled comparisons of rating, pairwise, top-k, winner-only, and approve/reject flows
- measure speed, confidence, consistency, and subjective burden for each mode
- separate interface effects from updater effects in analysis
- align frontend instrumentation with the true interaction type rather than rating-derived shortcuts
Success signal:
- the project can justify when one feedback mode is preferable to another
7.2 Evaluate user consistency and fatigue
Why it matters:
- user preference data is only valuable if it remains stable enough to interpret
- long sessions may degrade data quality even if users continue to participate
Implementation notes:
- add hidden repeat judgments and calibration rounds
- measure round count versus confidence, time-to-decision, and critique quality
- look for fatigue patterns such as faster but less consistent later-round judgments
- test whether some feedback modes resist fatigue better than others
Success signal:
- session length recommendations are based on observed behavior rather than guesswork
7.3 Study interface bias
Why it matters:
- layout, ordering, and displayed metadata can shift user choices independently of the underlying model behavior
- UI bias can easily be mistaken for algorithmic improvement
Implementation notes:
- randomize candidate order in controlled experiments
- compare metadata-hidden versus metadata-visible variants
- compare different grid densities and spacing
- test whether richer replay context changes future judgments
Success signal:
- the influence of interface design on measured outcomes is quantified rather than ignored
7.4 Study richer elicitation modes and UI patterns
Why it matters:
- preference quality depends not only on the model but also on how the system asks for user judgments
- some workflows may benefit more from shortlist, critique, or incumbent-comparison interactions than from one-shot winner choice
Implementation notes:
- compare shortlist selection, best-versus-incumbent, approve/reject, pairwise tournament, and critique-assisted workflows
- study when structured reason tags improve preference consistency or update quality
- compare forced-choice interfaces against interfaces that allow uncertainty or "cannot decide" responses
- measure whether region-aware or attribute-aware elicitation helps for inpainting and image-prompt tasks
- analyze whether elicitation mode changes the apparent value of a sampler or preference model
Success signal:
- the research program can explain which UI/elicitation modes work best for which task families and user goals
8. R1: Synthetic Data Research Direction
8.1 Build realistic synthetic steering trajectories toward an anchor
Why it matters:
- real user studies are expensive, slow, and noisy
- anchor-seeking synthetic users can stress-test algorithms under known hidden targets
Implementation notes:
- define anchor types such as latent steering anchors, reference images, attribute vectors, and text-derived targets
- build a synthetic user model that prefers candidates closer to the anchor while still showing uncertainty and bounded inconsistency
- model near ties, occasional reversals, fatigue-like noise, and critique text aligned with choices
- compare synthetic trajectories against real traces on round count, winner stability, and path geometry
Success signal:
- synthetic anchor-seeking sessions look structurally similar to real steering sessions and support meaningful ablations
8.2 Build diversity-oriented synthetic sampling around one or more steered locations
Why it matters:
- users often want controlled variation around a good region, not only convergence to one hidden optimum
- diversity-seeking synthetic tasks open a richer class of evaluation problems
Implementation notes:
- define one-center diversity tasks that preserve core concept while varying composition, lighting, pose, or background
- define multi-center tasks where preference rewards both desirability and coverage
- formalize diversity objectives using center distance, inter-candidate distance, coverage, and duplicate avoidance
- test policies such as shortlist preference, winner-plus-diversity bonus, and coverage-seeking ranking
Success signal:
- diversity-seeking synthetic trajectories are clearly distinguishable from pure anchor-seeking trajectories
8.3 Use synthetic data to pretrain and stress-test steering algorithms
Why it matters:
- synthetic corpora can accelerate algorithm iteration before expensive human studies
- controlled hidden targets make failure analysis much easier
Implementation notes:
- generate corpora with known hidden targets and varied difficulty settings
- use those corpora for regression testing of samplers, updaters, and feedback interpreters
- build challenge sets containing misleading local optima, seed-sensitive candidates, near ties, and quality-diversity tradeoffs
- compare sim-to-real transfer by tuning on synthetic traces and evaluating on later human sessions
Success signal:
- synthetic data reduces wasted human-study cycles and catches weak strategies earlier
8.4 Treat synthetic-user realism itself as a research problem
Why it matters:
- a poor simulator can bias the whole research program
- realism should be measured and improved, not assumed
Implementation notes:
- define realism metrics across win/loss structure, rating distributions, critique patterns, stop-time distributions, and path geometry
- fit simulator parameters to match observed human behavior more closely
- compare multiple simulator families rather than searching for one universal synthetic user
- document which simulator simplifications are believed to be harmless and which remain risky
Success signal:
- synthetic-user realism can be discussed and improved with evidence, not intuition alone
8.5 Extend steering research to richer diffusion workflows
Why it matters:
- many practical workflows use reference images, masks, or structural controls rather than only text prompts
- a steering method that works only for plain text-to-image may not generalize to real creative work
Implementation notes:
- study image-prompt, image-variation, inpainting, and ControlNet-guided steering as distinct task families
- compare whether the same preference-update logic transfers across those workflows
- define pipeline-specific metrics such as structure adherence, local edit faithfulness, and reference-image faithfulness
- study cross-workflow transfer, including whether synthetic-user models calibrated in one workflow transfer to another
Success signal:
- the research program can say where iterative steering helps most across diffusion workflow families
9. R2: Strategy Research Expansions
9.1 Study steering-dimension selection methods
Why it matters:
- steering dimension controls the size and geometry of the search space the user is trying to navigate
- too few dimensions may cap attainable quality or controllability, while too many may increase noise, redundancy, and cognitive burden
- strong results on one fixed dimension do not automatically transfer to other tasks, prompts, or diffusion workflows
Implementation notes:
- compare fixed low dimensions such as 2, 3, 5, 8, and 16 across matched prompts, seeds, and feedback budgets
- test adaptive dimension schedules, for example start low for stability and expand capacity only after early convergence
- compare basis-construction methods such as random orthogonal axes, PCA-style data-driven axes, learned steering dictionaries, and semantically aligned attribute axes
- measure not only final preference outcome but also rounds-to-satisfaction, duplicate rate, diversity, preference consistency, and subjective user burden
- analyze whether optimal dimension depends on task family, prompt complexity, feedback mode, sampler family, or diffusion workflow
- study whether dimension interacts strongly with incumbent carry-forward and trust-radius policies
Success signal:
- the project can recommend steering-dimension choices for specific task classes with empirical justification
9.2 Add richer steering representations
Why it matters:
- low-dimensional steering is interpretable and simple, but it may be too limited for some tasks
Implementation notes:
- compare low-dimensional steering against token-level, pooled-embedding, and hybrid representations
- measure whether richer representations improve controllability or only add instability
- preserve interpretability metrics while expanding representation capacity
Success signal:
- representation changes are justified by measurable gains rather than novelty alone
9.3 Add stronger samplers
Why it matters:
- sampler quality strongly shapes the candidate set a user can choose from
- better samplers may improve both convergence and exploration efficiency
Implementation notes:
- compare current samplers with Thompson-style, quality-diversity, critique-conditioned, and adaptive trust-region methods
- evaluate both human-judged quality and synthetic benchmark performance
- test whether some samplers pair better with specific feedback modes or update rules
- include incumbent-versus-challenger and shortlist-aware samplers that explicitly optimize for comparison quality, not only search quality
- compare one-step greedy samplers against multi-round planning or lookahead samplers
Success signal:
- sampler comparisons reveal clear tradeoffs in exploration, stability, and user burden
9.4 Add stronger preference and reward models
Why it matters:
- simple winner heuristics are easy to inspect, but they may waste information from rankings, approvals, uncertainty, and critiques
- richer preference models can become the bridge between elicitation design and smarter candidate proposal
Implementation notes:
- compare Bradley-Terry, Plackett-Luce, Bayesian preference, and listwise reward models
- test models that incorporate confidence, near ties, or "cannot decide" outcomes instead of discarding them
- test critique-aware models that combine discrete selections with structured or free-text reasons
- compare models that infer absolute latent quality against models that only learn relative preference
- study whether posterior uncertainty from preference models improves downstream samplers
Success signal:
- the project has evidence about which preference-model family best converts user judgments into useful steering signals
9.5 Add stronger updaters
Why it matters:
- the update rule determines how user judgments become steering-state movement
- weak updaters can waste high-quality feedback
Implementation notes:
- compare current simple updaters with Bradley-Terry, Bayesian preference, contextual bandit, and critique-aware approaches
- compare direct latent-state update rules against preference-model-plus-policy decompositions
- evaluate update sensitivity, robustness to noisy feedback, and stability over multiple rounds
- test how updater choice interacts with sampler choice and feedback modality
Success signal:
- updater research produces concrete guidance on which feedback-to-state mapping works best under which conditions
10. Study Program Milestones
Milestone R-A: Pilot Validity
- establish baseline comparison tasks
- define prompt set
- define study protocol
- log confounds more explicitly
Milestone R-B: Reliable Measurement
- add stronger metrics
- add analysis exports
- add notebooks and replay summaries
- build first realistic anchor-seeking synthetic-user pipeline
- build first diversity-seeking synthetic-user pipeline
Milestone R-C: Comparative Research
- compare steering-dimension selection methods
- compare samplers
- compare preference and reward models
- compare updaters
- compare feedback modalities
- compare elicitation/UI modes
- compare representation strategies
- compare synthetic-user regimes against real-user outcomes
- compare steering behavior across text-to-image, image-prompt, inpainting, and ControlNet workflows
11. Suggested Execution Order
- define the baseline comparison matrix
- define pilot protocols and prompt/task sets
- improve confound logging
- define explicit research success criteria
- build analysis-ready exports
- create notebook-based analysis templates
- strengthen replay as an analysis asset
- compare feedback modalities with real users
- study richer elicitation modes and UI patterns
- evaluate consistency, fatigue, and interface bias
- define anchor-seeking synthetic-user tasks
- define diversity-seeking synthetic-user tasks
- build synthetic stress-test corpora
- evaluate synthetic-user realism
- extend studies to image-prompt, inpainting, and ControlNet workflows
- compare steering-dimension selection methods
- compare richer representations, samplers, preference models, and updaters
12. Summary
The next research phase should shift from “can the system run?” to “can the system support credible conclusions?”
That means focusing on:
- better baselines
- better measurement
- better confound control
- better analysis workflows
- better human-study structure
- realistic synthetic-user generation
- diversity-aware synthetic data regimes
- research coverage beyond text-only generation into richer diffusion pipeline families