System Test Specification
1. Document Role
This document defines the verification contract for the research platform. Its purpose is to ensure correctness, comparability, and reproducibility rather than only basic functional behavior.
Related documents:
2. Test Objectives
The test suite must demonstrate that:
- the platform behaves correctly under normal flows
- plug-in components can be swapped without breaking orchestration
- deterministic replay is trustworthy
- schema evolution remains manageable
- failures are surfaced in a controlled and recoverable way
3. Test Categories
The system must include:
- unit tests
- integration tests
- end-to-end tests
- deterministic replay tests
- regression tests for schemas and exports
4. Test Environment Strategy
The suite should distinguish between:
- pure logic tests with no model dependency
- service tests with mocked generation
- limited end-to-end runs with lightweight fixtures
- explicit real-model smoke tests run separately from the default test suite
Real image generation should not be required for most tests.
5. Unit Tests
5.1 Steering representation tests
Verify:
- prompt encoding returns expected shape
- basis construction returns correct dimensions
E(z) = E0 + U zapplies valid tensor shape rules- trust-region clipping behaves correctly
- anchor penalties reduce drift where expected
- invalid steering dimensions fail clearly
5.2 Sampler tests
Verify:
- candidate count is correct
- candidates respect trust radius
- orthogonal exploration reduces alignment with exploit direction
- deterministic sampling works under fixed RNG state
- diversity filtering removes near duplicates
- role tags are assigned consistently
5.3 Feedback normalization tests
Verify:
- ratings normalize correctly
- rankings derive pairwise preferences correctly
- invalid ranking payloads are rejected
- duplicate selections are rejected where required
- optional critique text is preserved
- skip or uncertain actions normalize correctly
5.4 Updater tests
Verify:
- winner-copy selects the winning candidate exactly
- averaging updater interpolates correctly
- linear updater moves in the expected direction
- pairwise updater handles symmetric cases correctly
- Bayesian updater changes uncertainty as expected
- trust-region clipping constrains updates
5.5 Seed policy tests
Verify:
- fixed-per-round uses the same seed
- validation candidates receive alternate seeds when configured
- seed manifests are stored for all candidates
- missing seed metadata is treated as a failure
5.6 Persistence and schema tests
Verify:
- sessions persist immutable config snapshots
- rounds persist in correct order
- candidate and feedback foreign-key relationships remain valid
- replay exports serialize required fields
6. Integration Tests
6.1 Session lifecycle test
Flow:
- create experiment
- create session
- request first round
- submit feedback
- request next round
- verify progression and persistence
6.2 Generation pipeline test
Use a lightweight mock or tiny test pipeline when full generation is too expensive.
Verify:
- embeddings flow from encoder through steering to generator
- generation failures are captured and surfaced
- successful candidates still persist when one candidate fails
6.3 Replay integrity test
Verify:
- exported replay matches stored rounds and feedback
- images and metadata align correctly
- round order is stable
6.4 Strategy plug-in test
Verify:
- samplers can be swapped by config
- updaters can be swapped by config
- controller logic does not depend on one concrete strategy implementation
6.5 API contract test
Verify:
- endpoints accept expected payloads
- structured errors are returned on invalid input
- response schemas remain stable
7. End-to-End Tests
Using browser automation or HTTP-level testing, verify:
- a user can create an experiment from the UI
- a user can start a session
- a user can provide at least two feedback modes
- a user can proceed to the next round
- a user can open replay for a completed session
- replay export API returns the expected round and feedback history
- recoverable errors are shown clearly
8. Deterministic Replay Tests
These tests are critical.
Given:
- fixed prompt
- fixed experiment configuration
- fixed RNG seeds
- mocked or deterministic generation backend
The replay must reproduce:
- the same candidate proposals
- the same candidate order
- the same update steps
- the same persisted metrics
- the same round summaries
9. Regression Tests
Regression coverage should include:
- historical export loading
- schema migration behavior
- known edge-case prompts
- known edge-case feedback payloads
- previously fixed replay bugs
10. Failure-Mode Tests
The test suite should verify controlled behavior for:
- one-candidate render failure
- duplicate feedback submission
- premature next-round generation while feedback is still pending
- invalid ranking payloads
- export generation failure
- database write interruption
- resume after crash
11. Test Fixtures
Required fixtures:
- deterministic prompt embedding fixture
- synthetic candidate set fixture
- fake user feedback fixture
- mock image generator fixture
- small replay log fixture
- schema snapshot fixture
- frontend/backend trace capture fixture where needed
12. Acceptance Criteria
The prototype is acceptable when:
- all unit tests pass
- core integration tests pass
- deterministic replay tests pass
- one sampler and one updater can be swapped by configuration only
- the UI supports at least two feedback modes
- exports can be generated and replayed
- browser smoke coverage includes replay export retrieval
- failure-mode behavior is covered for the major recoverable errors
13. Test Reporting Expectations
Test reporting should make it easy to identify:
- failing component area
- failing scenario
- whether the failure breaks replay trustworthiness
- whether the failure is isolated or systemic
14. Summary
The test suite is part of the research method, not an implementation afterthought. If replay, schema stability, and strategy interchangeability are not verified, the platform cannot support reliable experimental conclusions.