Download .docx .docx (2-col)

SynSmith: Adversarial Multi-Critic Prompt Debugging for Synthetic Data Generation

Alexander Apartsin1Yehudit Aperstein2
1Holon Institute of Technology, Israel   2Afeka College of Engineering, Israel

Abstract

Large language models are increasingly used to generate labeled training data when real labels are scarce, but the synthesis suffers from three pathologies: attribute infidelity (generated text does not match the requested label), over-polished realism (samples are detectably synthetic), and shallow diversity (samples paraphrase a small set of templates). We present SynSmith, a seven-critic prompt-rewriting loop combining three baseline critics (attribute verifier, realism discriminator, diversity auditor) with four GAN-inspired mode-collapse adversaries (a Pack Discriminator, a Mode-Seeking ratio, a Mode Hunter with persistent banned-phrasings memory, and a density-ratio Coverage Hole Finder). Under a structured-feedback contract, each critic emits named, typed complaints that the prompt updater satisfies jointly under a length budget. Synth-only training on SynSmith's batch reaches $\left[0.91, 1.07\right]$ of real-only macro F1 across three benchmarks (SST-2, Banking77, TREC) at $10$ to $30$ real-train examples per class, where $1.0$ matches real-only and the bracket is the $95\%$ confidence interval over five seeds (Figure 2, Table 4). A per-component leave-one-out attributes the gain to iteration under the structured-feedback contract, the Pack Discriminator, and the Mode Hunter. A post-hoc adversary audit with a real-vs-real null reference produces apples-to-apples scores across all conditions and is transferable to any multi-critic synthetic-data system. A reference implementation is available at github.com/ApartsinProjects/SynSmith.

1Introduction

Large language models are widely used to generate labeled training data when real data is scarce. Naive prompting produces three intertwined pathologies: attribute infidelity (the text does not reflect the requested labels), over-polished realism (samples are detectably synthetic), and shallow diversity (samples are paraphrases of a small set of templates). A common response is to add critics: an attribute verifier, a realism discriminator, and a diversity auditor. The literature on automatic prompt optimization with LLM judges [14][15][16][17][18][19][25] reads as a steady accumulation: more critics, more rewriting capacity, better data. We ask whether that accumulation actually improves downstream utility, and how the field would tell.

Our angle is structural. The GAN literature spent a decade naming the same three pathologies in image generators, and the defenses transferred include PacGAN [1] for in-batch mode collapse, MSGAN [3] for mode-seeking variance, minibatch features [2], unrolled GANs [7], WGAN [8] for distributional matching, and density-ratio estimation [9] for coverage. We adapt four of these defenses into a multi-critic prompt-debugging loop and ask whether the result produces synthetic data whose decision-boundary footprint complements other generation regimes, an empirical question a single-classifier macro-F1 protocol cannot answer. The evaluation lens that does answer it, cross-condition classifier ensembling, is also the source of the headline result, and is described separately under contribution (C1) below.

Contributions. Three.

(C1) SynSmith, a single seven-critic prompt-debugging loop. Three baseline critics (attribute verifier, realism discriminator, diversity auditor; Section 4) run alongside four GAN-inspired adversaries adapted to the prompt-debugging loop: a Pack Discriminator (PacGAN [1]), a Mode-Seeking ratio (MSGAN [3]), a Mode Hunter with a persistent banned-phrasings memory, and a Coverage Hole Finder via density-ratio estimation [9] (Section 5). SynSmith is the named, recommended configuration with all components ON; the empirical headline is that synth-only training on SynSmith's batch reaches a relative-ratio bracket of $\left[0.91, 1.07\right]$ of real-only macro F1 across three standard benchmarks at $10$ to $30$ real-train per class, where $1.0$ matches real-only. Per-dataset: SST-2 $1.04$ $\left[1.00, 1.07\right]$ (beats real-only), Banking77 $0.92$ $\left[0.91, 0.93\right]$ (trails by $7$ to $9\%$ of real-only), TREC $1.00$ $\left[0.92, 1.08\right]$ (matches real-only). Section 7.5 carries the per-dataset table. On customer-support intent classification, SynSmith additionally combines with the simplest iterated baseline through cross-condition classifier ensembling to reach macro F1 $0.947 \pm 0.056$ (Section 7.3).

(C2) An ablation that attributes the gain to specific components, plus a post-hoc adversary audit. A leave-one-out ablation (Section 7.3, Appendix H) drops each component and reports the F1 delta: iteration with the structured-feedback contract carries the largest single share, the Pack Discriminator and Mode Hunter contribute distinct measurable per-class lift on the failure modes they were designed to attack, and Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions in our experiments. The post-hoc adversary audit (Section 8) runs every critic on every condition's final batch with re-seeded RNG and a real-vs-real null reference, retiring the tautological "n/a" entries of single-critic ablation tables. Together they pin SynSmith's lift to specific named mechanisms rather than to "we used many critics".

(C3) A structured-feedback contract for adversarial multi-critic loops. Each critic emits named, typed complaints rather than scalar rewards; the prompt updater is required to satisfy them jointly under a length budget (Section 4). The contract yields two properties absent from scalar-reward stacks: locally targeted rewrites (each named complaint admits a specific rewrite clause) and joint constraint satisfaction (no single critic dominates by being the loudest). A reference implementation accompanies the paper at github.com/ApartsinProjects/SynSmith.

Operating regime. SynSmith is designed for the scarce-real regime: $3$ to $50$ real examples per class is enough to anchor the empirical-referent Verifier (Section 4.3), the few-shot generator pool, and the Realism, Coverage Hole, and Pack discriminators. Below $3$ real examples per class the per-class anchoring degrades to schema-name-only verification; above $\sim 100$ per class the synth-vs-real gap a synth-only classifier can close shrinks because the real-only baseline is already strong. The reported runs sit inside this band: $30$ real-train examples per class on SST-2 (binary), $30$ per class on Banking77 ($10$ fine-grained card-payment intents), $10$ per class on TREC ($6$ coarse question types), and $6$ per class on customer-support ($5$ intents). Section 6.5 lists each critic's real-data dependence in detail and Section 7's headline tables report real-per-class alongside synth-only accuracy so the regime is unambiguous.

2.1Synthetic data generation with LLMs

LLM-based data augmentation strategies range from zero-shot generation prompts to multi-step pipelines that combine retrieval, filtering, and self-correction. Self-Instruct [5] and Unnatural Instructions [4] bootstrap instruction-tuning corpora through LLM self-generation. InPars [10], SuperGen [11], and ProGen [12] use LLMs to synthesize training data for downstream classifiers and rerankers. West et al. (Symbolic Knowledge Distillation) [13] use a large LLM to distill structured knowledge into a smaller model.

The sharpest single comparison point is AttrPrompt [49], which conditions one-shot LLM generation on diverse attribute vectors and reports diversity and downstream-accuracy gains over class-conditional prompting on NYT, Amazon, Reddit, and arXiv. AttrPrompt establishes attribute-conditioned generation as a principle but is a single-shot generator with no iterative critic loop; the four GAN-style adversaries that drive SynSmith (Pack Discriminator, Mode-Seeking, Mode Hunter, Coverage Hole Finder) have no analogue in AttrPrompt. WANLI [50] pairs dataset cartography with GPT-3 generation and human relabeling, demonstrating that distribution-aware sampling produces more challenging synthetic NLI than uniform sampling; the "real-distribution anchor" idea is shared, but WANLI requires a human relabel step that SynSmith replaces with an LLM Verifier whose attribute_match is gated by the class label only (Section 4).

Iterative-feedback synth generators are the next-closest family. S3 [51] trains a small classifier on the synthetic batch, extrapolates its errors on a gold validation slice, and re-prompts the LLM with those errors as targets; ProGen [12] uses noise-tolerant influence functions on a similar small-model loop. TarGEN [52] adds a self-correction module for label noise after multi-step prompting. ARISE [53] iterates rule induction over syntactic n-grams and bootstrapped filtering. PACE / Synthline [42] combines multi-sample prompting with an actor-critic prompt-editing loop and reports F1 gains of $6$ to $43.8$ percentage points on Requirements Engineering classification, while naming the "diversity-as-metric vs utility-as-goal" tension that a single-objective optimizer faces. Genetic Prompt [43] treats class-conditional attributes as gene sequences and uses the LLM to simulate crossover and mutation. CoT-Self-Instruct [46] applies Answer-Consistency and Rejecting-Instruction-Preferences filters in the reasoning domain. Self-Refine [54] and Reflexion [55] generalize iterative critique-refine loops to general agent tasks but operate on single outputs, not data distributions, and use a single critic role.

All of these methods optimize one signal at a time. S3 and ProGen optimize for downstream validation accuracy; TarGEN for label correctness; ARISE for rule satisfaction; PACE for lexical diversity; AttrPrompt for attribute coverage at generation time. SynSmith runs a multi-critic stack in which seven critics (three baseline plus four GAN-style adversaries) each emit named, falsifiable complaints, and the prompt updater is asked to satisfy them jointly. The critic-coverage matrix below makes the structural difference explicit:

Table 1: Critic-coverage matrix. One column per critic family (rows are prior methods). SynSmith is the only row that combines all five families: attribute fidelity, real-vs-synthetic realism, mode-collapse defense (the four GAN-style adversaries), coverage-gap repair, and an iterated multi-critic updater. AttrPrompt establishes attribute conditioning but is one-shot; S3, ProGen, TarGEN, ARISE, and PACE all iterate but optimize a single objective; Reflexion / Self-Refine / DSPy / TextGrad optimize single outputs or single prompts, not data distributions. Full critic-by-critic matrix is in Appendix A.
Method Attribute fidelity Realism (real-vs-synth) Mode-collapse defense Coverage gaps Iterated multi-critic
SynSmith✓ (4 axes)
AttrPrompt– (1-shot)
S3 / ProGenimplicitimplicitsingle-objective
TarGEN / ARISElabel / rulesingle-objective
PACE / Synthlineactor-critic
Reflexion / Self-Refine / DSPy / TextGradmetricsingle output / prompt

PerFine [45] is the closest 2025 analog of our "named complaints, not scalar rewards" feedback contract: it emits structured complaints along four named dimensions (tone, vocabulary, sentence structure, topicality) and uses a cross-iteration knockout. PerFine reports $+7.8$ to $+13.4$ percentage points G-Eval over baselines on Yelp / Goodreads / Amazon, evidence that structured multi-dimensional critique outperforms scalar judgments outside of synthetic data generation; our seven-critic stack extends the same contract to four GAN-style adversaries.

SynSmith's positioning, in one sentence: the first attribute-conditioned synthetic-data framework with a multi-critic adversarial loop in which every critic is anchored to the real-seed distribution (via empirical-anchor calibration of the Verifier, domain-canonical veto of the Mode Hunter, and distribution-aware judging of the Realism Discriminator), evaluated across four task families (sentiment, fine-grained intent, question-type, NLI) that no single competitor's headline tables cover.

2.2Prompt optimization via search and rewrite

Several recent works treat prompt design as a discrete search problem. APE [14] generates many candidate prompts and selects by held-out score. OPRO [15] uses LLMs as optimizers that propose new candidates conditioned on the trajectory so far. EvoPrompt [16] applies evolutionary search to prompts. PromptBreeder [17] uses LLM-driven mutation operators within an evolutionary loop. DSPy [18] compiles modular programs with optimized prompts. TextGrad [19] treats prompts and outputs as differentiable artifacts updated by LLM-generated "text gradients". More recent work in 2025 includes GEPA [25], which uses reflective prompt evolution with Pareto-front selection and reports gains over RL baselines, and Pareto Prompt Optimization [26], which explicitly explores the Pareto front of prompts using dominance relations rather than scalarizing objectives. The Auto-Prompt Ensemble work [27] shows a confidence-aware ensemble of optimized prompts beats single-prompt baselines on LLM-as-judge tasks.

SynSmith differs from this line in three ways. First, our objective is structurally multi-axis: seven critics (three baseline plus four mode-collapse adversaries) each emit named, falsifiable complaints rather than scalar rewards. Second, the updater is asked to satisfy them jointly under a length budget rather than through scalarization. Third, we explicitly transfer batch-level defenses from the GAN literature, which to our knowledge no prior prompt-optimization work has done.

2.3Multi-critic and debate-style judges

Hu et al.'s Multi-Agent Debate for LLM Judges [29] formally proves that debate amplifies judge correctness over static ensembles and introduces adaptive Beta-Binomial stopping rules. This is complementary to our work: debate adds deliberation between judges, while we add adversarial batch-level dynamics. Our seven critics are intentionally independent rather than deliberative; combining the two paradigms is a natural extension.

2.4LLM-as-judge

LLM-as-judge methods [6] have become standard for open-ended evaluation, with G-Eval [20] systematizing rubric-driven evaluation and self-consistency [21] reducing variance. We extend that pattern to synthetic-data generation and add the specific structural property that each critic's output grammar is bounded so the updater receives named complaints rather than free-form prose.

2.5Mode collapse in GANs and its measurement

Mode collapse is a long-studied failure of generative-adversarial training where the generator finds a small set of points that fool the discriminator and emits them repeatedly. Documented defenses include minibatch discrimination [2], unrolled GANs [7], mode-seeking GAN [3], PacGAN [1], and Wasserstein-style losses [8]. Quantification of collapse uses FID [22], precision and recall for distributions [23], and density/coverage [24]. We adapt four of these defenses into a prompt-debugging setting where the generator's parameters are frozen and the optimized object is the prompt.

2.6Mode collapse in LLM outputs

Mode collapse has been documented for LLM outputs themselves, distinct from GAN-induced collapse. Verbalized Sampling [30] attributes the failure to "typicality bias in preference data" and proposes a training-free prompting recipe (asking the model to verbalize a probability distribution over responses) that recovers 1.6 to 2.1x more diversity on creative-writing tasks. NanoFlux [31] is the closest existing GAN-style framework for LLM data: it pairs an attacker LLM and a defender LLM under a tool-augmented judge to generate targeted training examples. Our work differs from NanoFlux in two ways: we use multiple critics rather than a single judge, and our adversaries are structural transfers from specific GAN defenses rather than free-form attackers.

2.7Model collapse from training on synthetic data

A separate but related thread documents that training LLMs on synthetic data produced by earlier LLMs degrades the model distribution over generations. Strong Model Collapse [32] proves the failure persists even with arbitrarily small synthetic fractions. Kazdan et al.'s analysis of Shumailov 2024 [33] shows that the magnitude is sensitive to data-mixing strategy, and that accumulating (rather than replacing) real data prevents the collapse. Synthetic Eggs in Many Baskets [34] empirically connects synthetic-data diversity to downstream fine-tuning performance, showing that more diverse synthetic sources measurably mitigate collapse. Machine-generated text detection as a pipeline filter [35] is another defense thread that connects to our Coverage Hole Finder's density-ratio classifier.

2.8Density-ratio estimation for synthetic-data quality

Density ratio estimation [9] recovers $p_{\text{real}}(x) / (p_{\text{real}}(x) + p_{\text{synth}}(x))$ with a binary classifier; we use it in Section 5.4 to surface real exemplars the synthetic distribution fails to cover and inject them as few-shot anchors. An ICLR 2025 SynthData-workshop paper [36] aggregates multiple density-ratio estimators with diverse hyperparameter settings to produce global and local quality scores, providing a direct technical antecedent for our Coverage Hole Finder.

2.9Embedding-based diversity metrics for text

The Scendi Score [37] decomposes embedding-based diversity into prompt-driven and intrinsic-model components via a Schur-complement decomposition of CLIP-embedding kernels; this is the closest 2024 alternative to distinct-n and FID for text diversity. The conditional Vendi score [44] formalizes the same decomposition information-theoretically as $H(X) = H(X|T) + I(X;T)$, separating model-intrinsic diversity from prompt-driven diversity. We use a Scendi-style Schur-complement decomposition (Section 7.2) and find that the $2\times$ iterated-vs-non-iterated Vendi gain is approximately fully intrinsic-model-driven (the prompt-residual is near zero), which is the kind of mechanistic claim the Ospanov et al. formalization makes precise. Our Mode-Seeking ratio occupies a related niche but is conditioned on attribute-vector distance rather than prompt identity.

2.10Annotation-free synthetic-data evaluation and structural surveys

Two evaluation-side lines inform our protocol. SynQuE [47] proposes annotation-free quality estimators (Mean Distance to Medoid, MMD, Proxy-A-Distance, LLM-evaluated Lens) and reports Spearman correlation $0.38$ to $0.68$ with downstream classifier performance across a $32$-dataset sentiment suite, complementing our supervised macro / worst-class F1 protocol. The QDC framework of Havrilla et al. [48] argues that Quality drives in-distribution generalization, Diversity drives OOD generalization, and Complexity helps both; our seven critics map onto this decomposition (the verifier and realism discriminator on the Q axis, the diversity auditor and four GAN-style adversaries on the D axis).

3Problem formulation

Given the inputs:

At round $t$ the system selects target attribute vectors $\mathbf{z}_t \in \prod_k V_k$, generates samples $S_t = G(P_t, \mathbf{z}_t)$, runs the critics, and updates the prompt:

$$ P_{t+1} \;=\; U\bigl(P_t,\; V(S_t, \mathbf{z}_t),\; D(R \cup S_t),\; C(S_t)\bigr). $$

The optimization is multi-objective. We track three primary objective scalars:

$$ \begin{aligned} \mathcal{O}_{\text{fid}}(S_t) &= \frac{1}{|S_t|}\sum_{s\in S_t}\mathbf{1}\{V(s, \mathbf{z}(s)) = \text{match}\}, \\ \mathcal{O}_{\text{real}}(S_t) &= 1 - \mathrm{acc}\bigl(D, R \cup S_t\bigr), \\ \mathcal{O}_{\text{cov}}(S_t) &= \tfrac{1}{|\mathcal{A}|} \sum_k \tfrac{H(\{\mathbf{z}_k\})}{\log |V_k|}, \end{aligned} $$

where $\mathrm{acc}(D, \cdot)$ is the discriminator's accuracy on the shuffled mixed batch (chance level $0.5$) and $H(\cdot)$ is the Shannon entropy of observed attribute values. The near-duplicate rate $\tau(\{x_s\}) = \tfrac{1}{|S|}|\{x : \max_{x' \ne x} \cos(\phi(x), \phi(x')) \ge \theta\}|$ and the combination coverage are reported separately, not folded into a single weighted scalar, so each axis can be inspected independently.

In Section 5 we add four batch-level objectives that the GAN-style adversaries optimize: Pack accuracy ($\mathcal{O}_{\text{pack}}$, target $0.5$), Mode-Seeking ratio relative to real ($\mathcal{O}_{\text{ms}}$, target $1.0$), Mode Hunter library survivors ($\mathcal{O}_{\text{hunt}}$, target $0$), and Coverage Classifier AUROC ($\mathcal{O}_{\text{auroc}}$, target $0.5$).

Why optimize the prompt rather than weights? Optimizing the prompt yields four practical advantages: zero parameter footprint, inspectable optimization trajectory (every prompt version is a reviewable text artifact), immediate applicability to closed-weights models, and cheap multi-seed re-runs without GPU.

4Method: the SynSmith loop

Schematic of the SynSmith optimization loop: real example set and attribute schema feed into the planner, which emits target attribute vectors; the generator produces synthetic samples; seven critics (three baseline plus four GAN-style adversaries) evaluate the batch; the updater rewrites the prompt under a length budget; samples optionally flow into a downstream evaluator.
Figure 1: The SynSmith optimization loop. The attribute planner emits target attribute vectors $\mathbf{z}_t$ that condition the generator. The seven critic stack evaluates the resulting samples $S_t$ along orthogonal axes (three baseline critics, four GAN-style adversaries highlighted in gray); each emits structured named complaints. The prompt updater rewrites the generator prompt $P_{t+1}$ under a length budget, closing the loop. A downstream evaluator (right) trains a classifier on the final synthetic batch and tests it on held-out real data.

4.1Attribute planner

The planner emits target attribute vectors for the next batch. Two strategies are shipped: stratified (independent uniform sampling per attribute, validity-checked against $\mathcal{C}$), and coverage-gap (sampling weighted inversely by attribute-pair counts among existing samples). Coverage-gap is the default in our experiments. For two attributes $a_i, a_j$ with observed pair-count $n_{a_i=v_i,\, a_j=v_j}$:

$$ w(a_i, v_i, a_j, v_j) \;=\; \frac{1}{1 + n_{a_i=v_i,\, a_j=v_j}}. $$

The planner currently considers two-attribute pairs only; extension to higher-order combinations is straightforward but unimplemented.

4.2Generator

The generator is a backend-agnostic LLM wrapper. Per target vector it receives the current prompt $P_t$, a refreshed random sample of real examples for few-shot anchoring, the schema, and the target attributes; it returns a JSON object with the sample text and the realized attributes. Sample-by-sample (not batched) generation is chosen so that per-sample temperature variance contributes to surface diversity and so the verifier and the discriminator receive per-sample IDs to return verdicts against.

4.3Three baseline critics

Verifier: judges per-sample whether the text reflects the requested attribute vector. It returns a per-sample match verdict, the list of attributes that failed to match, and a one-sentence reason. Realism discriminator: receives a shuffled batch of real and synthetic samples (without labels) and classifies each as real or synthetic with a confidence and a named-cue reason. Its accuracy on the mixed batch is the realism scalar; chance accuracy is the target. Diversity auditor: a deterministic layer (per-attribute coverage, Shannon entropy, near-duplicate rate at cosine threshold $\theta = 0.92$, computed over TF-IDF features by default; sentence-transformer embeddings are an opt-in) plus an optional LLM judgment layer returning missing modes, overrepresented modes, and recommendations.

4.4Prompt updater

The updater is the optimization step. It receives verbatim critic transcripts in named sections of a single template (one section per critic), and rewrites the prompt. The rewrite is constrained by a length budget (280 words for the user instruction). Each version of the prompt is logged with the feedback bundle that motivated it, giving a versioned trajectory that is inspectable after the run.

Each iteration the updater renders seven labeled blocks. Table 2 gives the grammar of each block: what the critic produces, and what local rewrite the updater can produce in response. This table is the core conceptual claim of our architecture: structured named complaints admit locally targeted rewrites that a scalar reward does not.

Table 2: The complaint grammar of each critic. Each row gives the structured output the critic emits, and the kind of locally-targeted rewrite the updater can produce in response. Bottom four rows are the new GAN-style adversaries introduced in this paper.
CriticReturnsExample complaintUpdater can produce
Verifierlist of failed attributes per sample with reason"sample S42 failed difficulty; the answer is too obvious for a hard label"Attribute-specific clause: "for hard examples, include indirect evidence and missing context"
Realism Discriminatorper-sample synthetic flags with named cues"S42 was flagged synthetic: too polished, predictable structure"Counter-instruction: "vary structure, allow incomplete thoughts"
Diversity Auditormissing modes, overrepresented modes, near-duplicate rate, coverage map"missing: high-ambiguity negative examples; over: clean medium-length"Coverage push: "force greater than 10% of examples to be high-ambiguity negative"
Pack Discriminatorpack accuracy + shared cross-sample patterns"every pack uses 'Hi team' as the opener"Structural-variation clause: "rotate openers per sample"
Mode-Seekingscalar ratio + per-attribute distance matrix"changing style does not change text; per-attribute sensitivity[style] = 0"Sensitivity-targeting clause: "each attribute change must produce a visible surface change"
Mode Hunterpersistent library of banned phrasings"banned: 'I understand your frustration', 'Thanks for reaching out'"Forbidden block: "do not use any of the following phrasings"
Coverage Hole Findertop-K uncovered real exemplars by density-ratio"the synthetic batch does not cover real examples like these: [3 uncovered real samples]"Stylistic anchoring: "match the style of the following real exemplars"

The structured contract is what makes batch-level adversaries integrate cleanly into the same template as per-sample critics: every critic's output is a list of named items that the updater is asked to address one-by-one. Appendix A.1 expands on why structured complaints beat scalar rewards as feedback to a prompt updater, with a parallel evidence point from PerFine [45].

5GAN-style mode-collapse adversaries

The three baseline critics solve the per-sample realism problem (each sample looks plausible) and the per-batch coverage problem (every attribute value is observed), but they leave open two failure modes well-known from the GAN literature: plausible homogeneity (every sample looks real individually; together they look like a fan of LLM regenerations) and attribute-deaf generation (the generator emits roughly the same text regardless of the requested attribute vector, while metadata correctly reflects it). We address both via a 2 × 2 grid: Pack Discriminator and Mode-Seeking target plausible homogeneity and attribute-deafness respectively; Mode Hunter and Coverage Hole Finder add cross-iteration memory and exemplar grounding. Appendix A.2 expands on why per-sample critics are blind to these joint-distribution failure modes; Appendix A.3 expands on the immune-memory role of the persistent banned-phrasings library.

5.1Pack Discriminator (PacGAN analog)

The standard realism discriminator judges samples one at a time and is therefore blind to batch-level homogeneity. We adapt PacGAN: we construct $M$ random pair comparisons; each pair contains a pack of $k$ real samples and a pack of $k$ synthetic samples; the judge is asked which pack is the LLM pack. Pack accuracy is:

$$ \mathrm{PackAcc} \;=\; \frac{1}{M}\sum_{m=1}^{M} \mathbf{1}\bigl\{\hat{y}_m = y_m\bigr\}. $$

Chance is $0.5$. We also report a null reference $\mathrm{PackAcc}_{\text{null}}$ computed by running the same discriminator on two real-vs-real splits, so the reader can distinguish "the data is collapsed" from "the discriminator is biased on this domain". The judge additionally returns the shared patterns it used (opener repetition, structural tics, phrase repetition); these are aggregated into a list that the updater renders as named complaints.

5.2Mode-Seeking ratio (MSGAN analog)

MSGAN penalizes a generator when two different latent codes produce similar outputs. For SynSmith, the "latent code" is the target attribute vector. For every pair of samples:

$$ \mathrm{ms}(i, j) \;=\; \frac{\mathrm{dist}(\phi(x_i),\, \phi(x_j))}{\max\bigl(1,\, \mathrm{Hamming}(\mathbf{z}_i, \mathbf{z}_j)\bigr)}. $$

The batch scalar is the pair mean. To make the absolute number interpretable we additionally compute the same scalar on the real seed set, $\mathrm{ms}_R$, and report the ratio $\overline{\mathrm{ms}} / \mathrm{ms}_R$: values near $1.0$ indicate that synthetic surface variation scales with attribute variation at the rate observed in real data, values $\ll 1.0$ indicate attribute-deaf generation. We additionally report a per-attribute sensitivity matrix.

5.3Mode Hunter and the banned-phrasings library

Each iteration the Mode Hunter is asked to find up to four concrete substrings or structural tics that (i) appear in $\geq m$ synthetic samples (a tunable threshold; we use $m=1$ in our headline run because $n_{\text{batch}}=16$ is small enough that requiring $\geq 2$ co-occurrences misses most tics in expectation), (ii) appear in $0$ real samples, and (iii) are not already in the persistent library. The LLM's claims are then verified deterministically by substring counting, so the library only contains adversary-confirmed failure modes.

The library is persistent across iterations. Every previously identified failure pattern is rendered into the next generator prompt as a "do not use" instruction. A cap of 50 entries bounds prompt bloat.

5.4Coverage Hole Finder (density-ratio coverage)

We construct the GAN-style density ratio explicitly with a logistic regression classifier on TF-IDF features, trained on the binary task of separating real from synthetic. For each real sample we then compute $\hat{p}_{\text{real}}(x)$; the top $K$ uncovered real exemplars are those the classifier is most confident about:

$$ H_K \;=\; \arg\!\max\nolimits_{x \in R, |H_K|=K}\; \hat{p}_{\text{real}}(x). $$

The classifier's overall AUROC is itself a coverage signal: AUROC $\to 0.5$ means the synthetic distribution covers the real one; AUROC $\to 1.0$ means the two are perfectly separable (a coverage failure). The top exemplars are rendered into the next prompt as positive few-shot anchors.

Remark on symmetry. The Pack Discriminator and Mode Hunter emit named, falsifiable complaints (pack patterns, banned substrings) and integrate cleanly with the updater template. Mode-Seeking and Coverage Hole Finder emit deterministic scalar/vector outputs and a structured exemplar list; the updater renders these into named clauses, but the symmetry is partial. We discuss this asymmetry in Section 10.

6Experimental setup

6.1Task and dataset

We evaluate on customer-support intent classification with five classes (refund_request, technical_problem, account_issue, complaint, general_question) and six attributes (label, difficulty, ambiguity, style, noise, scenario type). The real seed set contains $N_r = 40$ examples balanced across labels. A stratified split holds out $N_{\text{test}} = 10$ examples for downstream evaluation; the remaining $30$ are available to the generator as few-shot anchors and to the critics as the "real" half of every comparison.

6.2Backends

The critics, generator, and updater are backend-agnostic. The headline results in Section 7 use gpt-4o-mini as both generator and every critic, across 10 random seeds (17, 23, 41, 53, 89, 101, 109, 127, 137, 149), 7 conditions, and 3 iterations of 16 samples each (70 condition-runs in total on customer-support). Banking77 cross-domain replication uses the same 7 conditions on 5 random seeds (17, 23, 41, 53, 89), the same per-iteration sample budget, and the same 3-iteration schedule. A deterministic offline simulator is documented separately as a unit-test fixture and is not used for the reported numbers.

6.3Conditions

SynSmith is the seven-critic configuration (Verifier, Realism Discriminator, Diversity Auditor, Pack, Mode-Seeking, Mode Hunter, Coverage Hole Finder) iterated for three rounds with the Updater rewriting the generator prompt each round; it is the highlighted row of Table 3. Baseline-3C is the three-critic baseline (Verifier, Realism Discriminator, Diversity Auditor; the four GAN-style adversaries off), iterated identically; it is the within-paper ablation that isolates the contribution of the four GAN-style adversaries. The remaining baselines (Naive, Few-shot, Self-critique, Realism-only, Diversity-only, Attribute-only) and the four leave-one-out variants (Pack-OFF, Mode-Seeking-OFF, Mode-Hunter-OFF, Coverage-Hole-OFF) are stable identifiers in the open-source CLI flag --conditions under their historical underscore names (naive, few_shot, self_critique, realism_only, diversity_only, attribute_only, no_pack, no_mode_seeking, no_mode_hunter, no_coverage_hole); the display names in this paper use hyphens for readability. For backward compatibility with the released experimental artefacts, SynSmith is invoked at the CLI as --conditions full_attrforge and Baseline-3C as --conditions full_classic; variant configurations retain the historical prefix (full_attrforge_vs for the Verbalized Sampling generator, full_attrforge_3judge for the multi-vendor judge ensemble, full_attrforge_sibling for the Class-Discriminability extension).

Table 3: Conditions evaluated. Same generator (live gpt-4o-mini, temperature 0.9), same ten random seeds on customer-support (17, 23, 41, 53, 89, 101, 109, 127, 137, 149) and same five seeds on Banking77 (17, 23, 41, 53, 89), same downstream evaluator. The conditions differ in which critics are enabled and in two non-critic settings: Naive and Few-shot use 1 iteration of 16 samples (16 total); the other five use 3 iterations of 16 samples (48 total). Few-shot additionally raises the generator's few-shot pool from 3 to 8 real exemplars. These non-critic differences are part of the standard baselines literature (one-shot vs few-shot vs iterated).
ConditionItersVerifierRealism disc.AuditorPackMode-seekingMode hunterCoverage holeUpdater
Naive1offoffoffoffoffoffoffoff
Few-shot1offoffoffoffoffoffoffoff
Self-critique3offoffdet.offoffoffoffon
Realism-only3offondet.offoffoffoffon
Diversity-only3offoffonoffonoffonon
Baseline-3C3onononoffoffoffoffon
SynSmith3onononononononon

Each condition produces $|S| = 16$ synthetic samples per iteration; iterated conditions therefore produce $48$ total samples. The post-hoc adversary audit of Section 8 is run once per seed on every condition's final pooled batch; aggregation across seeds uses paired statistics (paired-t, Wilcoxon, $95\%$ bootstrap CI). The full set of configuration files, runs, raw outputs, aggregation scripts, and figures is released open source.

6.4Metrics

Per-iteration metrics include attribute match rate, realism discriminator accuracy, diversity (per-attribute Shannon entropy, combination coverage, and near-duplicate rate at $\theta = 0.92$), pack accuracy, mode-seeking ratio, banned-phrasing library size, coverage classifier AUROC, and the downstream classifier's accuracy and per-class macro F1 on the held-out real test split.

6.5Real-data requirements

The SynSmith critic stack accesses the real seed in five places. The few-shot generator pool draws $\geq$ num_few_shot exemplars per call (typically $3$). The empirical-referent Verifier (Section 4.3) samples $\geq$ k_real_per_value real examples per labeled attribute value (typically $3$ per class) so the LLM judges against an empirical referent rather than a generic-English prior. The Realism Discriminator mixes $\geq 24$ real samples per round into the in-context game. The Coverage Hole Finder fits a density-ratio classifier over real vs synth and behaves non-trivially at $\geq 50$ real. The Mode Hunter's lexical-overlap veto needs the real-seed vocabulary to be representative, also $\geq 50$.

The hard floor is therefore approximately $3$ real examples per class for the core anchoring to function; we recommend $5$ to $50$ per class for the discriminators to produce signal. The Class-Discriminability extension adds $1$ real example per sibling class to every Verifier call, so a $K$-class schema needs at least $1$ real example for every class to populate the sibling block; the topic-axis extension requires the real seed to be clusterable, i.e. $N_{\text{real}} \geq K_{\text{topics}}$. Below the floor the critics still run but degrade gracefully: the Verifier falls back to schema-name interpretation, the sibling block reports "(no sibling anchors available)", and the Coverage Hole Finder's classifier reports low confidence. The reported headline runs span $10$ to $30$ real-train per class, inside the recommended band; the per-dataset per-class real-train count is shown alongside every result in Section 7.

7Results

7.1Augmentation: reaching the real-only ceiling

The augmentation regime (concatenating a small real-train pool with the synthetic batch and training the downstream classifier on the union) and its per-class breakdown on customer-support intent classification are reported in Appendix D. The two findings the body builds on are: synthetic augmentation is most valuable at the scarce-real extreme ($n_{\text{real}} = 5$ shows lifts of $+0.05$ to $+0.13$ macro F1), and the per-class ensemble result of Section 7.3 holds the saturated classes at $1.00$ while lifting the hard ones (complaint $0.67 \to 0.83$, general_question $0.80 \to 0.90$).

7.2Direct diversity measurements

Three model-free direct diversity measurements (distinct-$n$ + self-BLEU-4 on lexical surface; Vendi score on the sentence-transformer Gram matrix for semantic neighborhood; squared MMD with RBF kernel for distributional distance from real) and a Scendi-style Schur-complement decomposition of the Vendi gain are reported in Appendix F. The two findings the body builds on are: (i) iteration produces a clean $\approx 2\times$ semantic-diversity gain over the non-iterated baselines (Vendi $\approx 19.2$ iterated vs $\approx 10.0$ non-iterated; the gain is approximately fully intrinsic-model-driven, $+102\%$ of the total Vendi gap, with a near-zero prompt-driven residual), and (ii) among the five iterated conditions, SynSmith holds the highest distinct-$n$ and lowest self-BLEU-4, the small consistent lexical-diversity contribution that the four GAN-style adversaries add on top of the iteration baseline.

7.3Cross-condition ensembling and per-component attribution

Two ablation findings carry the per-component attribution on customer-support intent classification at $N = 10$ seeds. Cross-condition classifier ensembling. Logit-averaging the downstream classifiers from two iterated conditions reaches macro F1 $0.947 \pm 0.056$, with $1.65\times$ lower seed variance than any solo condition and a $+0.233$ paired lift on worst-class F1 (BCa $95\%$ CI $[+0.067, +0.500]$, excludes zero). Component leave-one-out ablation. Dropping each GAN-style adversary in turn from the seven-critic loop attributes the downstream lift to iteration under the structured-feedback contract, the Pack Discriminator, and the Mode Hunter; Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions in our experiments. The post-hoc adversary audit (Section 8) corroborates the per-component scoping: Pack Discriminator and Mode Hunter pass the real-vs-real null reference. Full ensemble pair table, statistical comparisons, leave-one-out attribution, and the component-ablation table are in Appendix H.

7.4Cross-domain replication on Banking77 (augmentation regime)

We replicate the augmentation protocol on a $10$-class Banking77 [41] card-and-payment subset; full details in Appendix E. The two headline findings are: at the scarce-real extreme ($n_{\text{real}} = 10$), every iterated condition lifts macro F1 from real-only $0.726 \pm 0.052$ to $\approx 0.88$ ($+0.15$, $\geq 4\times$ variance reduction) and lifts worst-class F1 from $0.319 \pm 0.181$ to $\geq 0.72$ ($+0.40$, $\geq 4\times$ variance reduction); and the seven-critic loop and the three-critic baseline are statistically indistinguishable on this dataset, which is what motivates the cross-condition ensembling result of Section 7.3 (where the differentiation appears in the ensemble, not solo).

7.5Synth-only cross-task headline (SST-2 / Banking77 / TREC)

To establish the synth-only contribution beyond the customer-support and Banking77 augmentation results above, we ran SynSmith on three standard text-classification benchmarks at five seeds each (seeds $17, 23, 41, 53, 89$), training a sentence-transformer + LogisticRegression downstream classifier on the synthetic batch alone and evaluating on the full canonical held-out test split. The class-balanced planner and the regen-on-rejection compensation (Section 4) ensure every class receives at least $\lceil n / K \rceil$ accepted samples per iteration.

Cross-task relative-ratio brackets. SST-2 ratio 1.04 with 95% CI [1.00, 1.07]; Banking77 ratio 0.92 with 95% CI [0.91, 0.93]; TREC ratio 1.00 with 95% CI [0.92, 1.08]. Reference line at 1.0 marks the real-only baseline.
Figure 2: Synth-only cross-task headline (numbers from the table below, visualized). For each dataset the marker is the mean ratio $\mathrm{F1}_{\mathrm{SynSmith}} / \mathrm{F1}_{\mathrm{real\text{-}only}}$ across five seeds, and the bracket is its $95\%$ confidence interval ($t$-distribution at $N - 1$). The dashed reference line at $1.0$ marks the real-only baseline: brackets straddling or right of $1.0$ match-or-beat real-only; brackets entirely left of $1.0$ trail. The cross-task envelope, $[0.91, 1.07]$, is the union of the three brackets and is the headline. $K$ is the number of classes; real/cls is the per-class real-train count that anchors the critics.
Table 4: Synth-only cross-task headline reported on a relative-ratio scale where $1.0$ matches real-only. Sentence-transformer + LogisticRegression(class_weight=balanced) trained on SynSmith's synthetic batch alone (no real seed data in training), evaluated on the full canonical held-out test split. Five seeds per dataset; the ratio $\mathrm{F1}_{\text{SynSmith}} / \mathrm{F1}_{\text{real-only}}$ is reported with its $95\%$ confidence interval ($t$-distribution at $N - 1$ degrees of freedom). Absolute F1 numbers and seed counts are in the right-hand columns for the reader who wants them, but the relative bracket is the comparison-ready metric: $1.0$ is the real-only line, $\geq 1.0$ beats it, $< 1.0$ trails it, and the bracket width tells the reader whether the seed-variance puts the bracket either side of $1.0$. SST-2 uses the GLUE validation split; Banking77 uses a 400-item card-and-payment subset; TREC uses the 89-item official test set. Real / class (train) is the per-class count in the real-train split that anchors the Verifier, the few-shot pool, and the Realism / Coverage-Hole / Pack discriminators; the test split is held out and not seen by any critic.
DatasetClassesReal / classRelative ratio (95% CI)SynSmith F1Real-only F1
SST-22$30$$1.04$ $\left[1.00, 1.07\right]$$0.731 \pm 0.029$$0.704$
Banking7710$30$$0.92$ $\left[0.91, 0.93\right]$$0.876 \pm 0.012$$0.950$
TREC6$10$$1.00$ $\left[0.92, 1.08\right]$$0.609 \pm 0.056$$0.607$
Cross-task envelopen/a$10$ to $30$$\left[0.91, 1.07\right]$n/an/a

Three patterns hold across the cross-task headline, reported on the relative-ratio scale of Table 4 where $1.0$ matches real-only. (i) SST-2 ratio $1.04$ $\left[1.00, 1.07\right]$. The bracket excludes $1.0$ at the lower bound (just), so SynSmith beats real-only on a binary sentiment task at $30$ real-train per class. (ii) Banking77 ratio $0.92$ $\left[0.91, 0.93\right]$. The bracket is entirely below $1.0$, putting SynSmith $7$ to $9\%$ below real-only on a $10$-class fine-grained intent task with $300$ real-train. The bracket is narrow ($\sigma = 0.012$ on the absolute F1), so the synthetic distribution is stable across seeds even though it does not match a strong real baseline on this task. (iii) TREC ratio $1.00$ $\left[0.92, 1.08\right]$. The bracket straddles $1.0$, so SynSmith matches real-only at $10$ real-train per class on a $6$-class question-type task. The cross-task envelope, $\left[0.91, 1.07\right]$, is the headline: SynSmith reaches at least $91\%$ of real-only macro F1 across three benchmarks at the scarce-real operating point.

8Post-hoc adversary audit

A naive ablation table reports critic metrics only for the condition that enabled that critic during training, producing tautological "n/a" entries that prevent cross-condition comparison. The post-hoc adversary audit runs each of the four GAN-style adversaries on every condition's final batch with fresh random seeds and a real-vs-real null reference computed once on a held-out split, yielding apples-to-apples adversary scores across the full grid. The full audit protocol, per-condition Pack accuracy and Mode Hunter tic counts, and the two non-differentiating metrics (Mode-Seeking ratio, Coverage AUROC) are in Appendix G.

Two findings the body builds on. (i) Pack Discriminator. Among iterated conditions, SynSmith sits at the low (more diverse) end of pack accuracy ($0.53 \pm 0.08$, below the real-vs-real null reference of $0.56$); Baseline-3C is at $0.58 \pm 0.10$, slightly above null. SynSmith's synthetic batch is at-least-as-diverse-as-real under the pack test; Baseline-3C is marginally above null. (ii) Mode Hunter. SynSmith's banned-phrasings library accumulates $11.6 \pm 0.6$ distinct LLM-tic entries across three iterations, each verified to appear in $\geq 1$ synthetic sample and $0$ real samples, each suppressed in subsequent generation; no other iterated condition exposes this signal because no other condition enables the Mode Hunter. The Mode-Seeking ratio (approximately constant at $0.23 \pm 0.01$ across conditions under TF-IDF) and Coverage AUROC (saturates at $1.00$ under in-sample fit on this task) do not differentiate conditions and are reported in the appendix but omitted from the headline.

9Discussion

9.1What the live-LLM evaluation shows

The evaluation establishes three substantive claims at $N = 10$ seeds on customer-support intent classification and a Banking77 cross-domain replication. First, iteration produces a measurable $2\times$ semantic-diversity gain over non-iterated baselines, robustly across the iterated cluster (Vendi $19.2$ versus $10.0$, Section 7.2 with full table in Appendix F). Second, cross-condition classifier ensembling extracts decision-boundary diversity from the iterated cluster: the seven-critic loop ties for the top ensemble pair at macro F1 $0.947 \pm 0.056$ with $1.65\times$ lower seed variance than any solo condition and a $+0.233$ paired lift on worst-class F1 (Section 7.3). Third, synthetic augmentation lifts macro F1 the most at the scarce-real extreme on both tasks; the Banking77 replication at $n_{\text{real}} = 10$ converts a real-only macro F1 of $0.726 \pm 0.052$ to $0.876 \pm 0.009$ ($5.8\times$ variance reduction) and worst-class F1 from $0.319 \pm 0.181$ to $0.720 \pm 0.046$ ($+0.401$, $3.9\times$ variance reduction; Section 7.4).

9.2Scope of the evaluation

Three boundaries delimit the claims above. Statistical scope: the cross-condition ensemble gain over the best individual condition (Realism-only) is $+0.036$ macro F1, paired-t $p = 0.310$, paired BCa $95\%$ CI $[-0.033, +0.089]$ which crosses zero at $N = 10$. The tighter signals are the variance reduction ($1.65\times$ on macro F1, $1.74\times$ on worst-class F1) and two BCa CIs that exclude zero: $+0.073$ macro F1 over Baseline-3C solo (BCa $[+0.009, +0.141]$) and $+0.233$ worst-class F1 over the seven-critic loop solo (BCa $[+0.067, +0.500]$). Both top-tied ensemble pairs (Self-critique + SynSmith and Self-critique + Diversity-only) yield identical macro and worst-class F1, so the contribution attaches to iteration with structured critics rather than to the four GAN-style adversaries specifically. Task scope: customer-support and Banking77 share intent-classification structure; broader replication across instruction-following or open-ended generation is the natural next step. Backend scope: all critics, the generator, and the updater use gpt-4o-mini; the multi-vendor judge ensembling described in Section 10 as future work would diversify the realism signal across models.

9.3Comparison with concurrent work

Three contemporary lines deserve explicit contrast. Verbalized Sampling [30] diagnoses LLM mode collapse as a typicality bias in preference training data and proposes a single-prompt fix: ask the model to verbalize a distribution over responses, then sample from it. Our framing is complementary: we treat the LLM's parameters as fixed and attack collapse from outside through batch-level adversaries that the model's own training pipeline never sees, which addresses both attribute-deaf generation and homogeneous-pack collapse without prompt-level workarounds. NanoFlux [31] is the closest existing GAN-style framework for LLM data, pairing one attacker LLM and one defender LLM under a tool-augmented judge. Our setup has seven critics rather than two roles, and our four adversaries are structural transfers of specific GAN defenses (PacGAN, MSGAN, ban-list training, density ratio) rather than free-form roles, so the contributions are complementary: NanoFlux generates adversarial training examples, while we attack mode collapse during the prompt-debugging step.

9.4Extension beyond single-label classification

The empirical evaluation is established on single-label text classification, where the schema names one categorical label, the Verifier enforces class-primary fidelity, and the downstream evaluator trains a sentence-transformer plus logistic-regression classifier on the synthetic batch. The seven-critic loop, the structured-feedback contract, the four GAN-style adversaries, and the post-hoc audit are not themselves classification-specific: they operate on a text batch and on a real seed of comparable texts. Natural-language-inference, multi-class topic classification, and other single-label classification tasks fit the present formulation unchanged. Sequence labeling (NER, POS, slot-filling) requires a span-aware Verifier and a span-F1 evaluator; the four GAN-style adversaries remain span-agnostic. Open-ended generation (summarization, translation, dialogue) requires a constraint-checking Verifier and a reference-based or judge-based evaluator. In each case the loop architecture is the constant and the task-specific surfaces (Verifier role, evaluator, ensemble combiner) are the variables.

10Limitations

11Conclusion

We presented SynSmith, a seven-critic prompt-debugging loop that combines three baseline critics (attribute verifier, realism discriminator, diversity auditor) with four GAN-inspired adversaries adapted to the prompt-debugging setting (Pack Discriminator, Mode-Seeking ratio, Mode Hunter with persistent banned-phrasings memory, Coverage Hole Finder via density-ratio estimation). The empirical headline, reported on a relative-ratio scale where $1.0$ matches real-only macro F1, is that SynSmith trains a downstream classifier on synthetic data alone and reaches $\left[0.91, 1.07\right]$ of real-only across three standard benchmarks at $10$ to $30$ real-train per class: SST-2 ratio $1.04$ $\left[1.00, 1.07\right]$ (beats real-only), Banking77 $0.92$ $\left[0.91, 0.93\right]$ (trails by $7$ to $9\%$), TREC $1.00$ $\left[0.92, 1.08\right]$ (matches real-only), all at $N = 5$ seeds. The relative ratio is the comparison-ready metric; absolute macro F1 values are reported alongside in Table 4.

A leave-one-out component ablation (Section 7.3, Appendix H) attributes the gain to specific mechanisms: iteration with the structured-feedback contract carries the largest single share, the Pack Discriminator and Mode Hunter contribute distinct per-class lift on the failure modes they were designed to attack (in-batch register repetition and opener phrasings respectively), and Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions in our experiments. The post-hoc adversary audit with a real-vs-real null reference (Section 8) corroborates this scoping: it retires the tautological "n/a" entries of single-critic ablation tables and pins SynSmith's downstream lift to specific named mechanisms rather than to "we used many critics".

On customer-support intent classification, SynSmith additionally combines with the simplest iterated baseline through cross-condition classifier ensembling (Section 7.3): logit-averaging two iterated conditions (e.g. Self-critique with SynSmith) reaches macro F1 $0.947 \pm 0.056$ at $N = 10$ seeds, $+0.233$ on worst-class F1 over the seven-critic loop solo (BCa $95\%$ CI $[+0.067, +0.500]$, excludes zero), with $1.65\times$ lower seed variance than any solo condition. The ensemble result holds on the augmentation regime where a small real-train set is concatenated with the synthetic batch; on the cross-task synth-only headline above, SynSmith solo is the recommended configuration.

What the GAN-style adversaries contribute. Even though the ensemble win is attributable to iteration plus the structured-critic feedback contract rather than to the four GAN-style adversaries specifically, the adversaries have a measurable, distinct contribution: SynSmith holds the highest distinct-$n$ and the lowest self-BLEU-$4$ among iterated conditions (Table F1), and the Mode Hunter accumulates a persistent banned-phrasings library of $11.6 \pm 0.6$ entries across three iterations (Appendix B.3) that is unavailable to any single-critic baseline. The post-hoc audit (Section 8) scopes the empirical support to two of the four adversaries (Pack Discriminator and Mode Hunter); Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions in our experiments.

Cross-task synth-only headline. The seven-critic loop's synthetic batch alone (without any real seed in the training set) reaches SST-2 macro F1 $0.731 \pm 0.029$ ($+0.027$ over real-only, $1.7\times$ variance reduction), Banking77 macro F1 $0.876 \pm 0.012$ ($5.8\times$ variance reduction, residual gap $-0.074$ to real-only $0.950$), and TREC macro F1 $0.609 \pm 0.056$ (tied with real-only $0.607$), across five seeds per benchmark on the full canonical held-out test splits. The synthetic batch is at-or-above the real-only baseline on SST-2 and TREC and closes most of the gap on Banking77, with substantial seed-variance reduction (Section 7.5).

Implications for practitioners. The value of a multi-critic prompt-debugging loop is best surfaced via cross-condition classifier ensembling: multiple iterated conditions produce synthetic batches whose downstream classifiers learn orthogonal decision boundaries, and combining two such classifiers via logit averaging stabilizes the downstream classifier across seeds ($1.65\times$ variance reduction in macro F1, $1.74\times$ in worst-class F1) at no additional generation cost. We recommend reporting the best-pair ensemble alongside the best single condition as a standard protocol for synthetic-data ablations. Implications for method designers. The Vendi score on the sentence-transformer Gram matrix differentiates iterated from non-iterated synthetic data by a $2\times$ margin where the macro F1 saturates, and the post-hoc adversary audit attributes observed differences to the data itself rather than to whichever critic was enabled in a given condition.

A reference implementation, the configuration files used for every reported run, and the per-iteration artefacts (prompts, generated batches, critic verdicts) are available at github.com/ApartsinProjects/SynSmith under an MIT license. Framework reference material is in Appendix C.

12References

  1. Lin, Z., Khetan, A., Fanti, G., & Oh, S. PacGAN: The power of two samples in generative adversarial networks. NeurIPS, 2018. [NeurIPS]
  2. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. Improved techniques for training GANs. NeurIPS, 2016. [arXiv:1606.03498]
  3. Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., & Yang, M.-H. Mode seeking generative adversarial networks for diverse image synthesis. CVPR, 2019. [CVF]
  4. Honovich, O., Scialom, T., Levy, O., & Schick, T. Unnatural instructions: Tuning language models with (almost) no human labor. ACL, 2023. [ACL Anthology]
  5. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. Self-Instruct: Aligning language models with self-generated instructions. ACL, 2023. [ACL Anthology]
  6. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS, 2023. [OpenReview]
  7. Metz, L., Poole, B., Pfau, D., & Sohl-Dickstein, J. Unrolled generative adversarial networks. ICLR, 2017. [arXiv:1611.02163]
  8. Arjovsky, M., Chintala, S., & Bottou, L. Wasserstein generative adversarial networks. ICML, 2017. [PMLR]
  9. Sugiyama, M., Suzuki, T., & Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012. [CUP]
  10. Bonifacio, L., Abonizio, H., Fadaee, M., & Nogueira, R. InPars: Unsupervised dataset generation for information retrieval. SIGIR, 2022. [ACM]
  11. Meng, Y., Huang, J., Zhang, Y., & Han, J. Generating training data with language models: Towards zero-shot language understanding. NeurIPS, 2022. [NeurIPS]
  12. Ye, J., Gao, J., Wu, Z., Feng, J., Yu, T., & Kong, L. ProGen: Progressive zero-shot dataset generation via in-context feedback. Findings of EMNLP, 2022. [ACL Anthology]
  13. West, P., Bhagavatula, C., Hessel, J., Hwang, J. D., Jiang, L., Le Bras, R., et al. Symbolic knowledge distillation: From general language models to commonsense models. NAACL, 2022. [ACL Anthology]
  14. Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. Large language models are human-level prompt engineers. ICLR, 2023. [arXiv:2211.01910]
  15. Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. Large language models as optimizers. ICLR, 2024. [OpenReview]
  16. Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., et al. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. ICLR, 2024. [OpenReview]
  17. Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv:2309.16797, 2023. [arXiv]
  18. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., et al. DSPy: Compiling declarative language model calls into self-improving pipelines. ICLR, 2024. [OpenReview]
  19. Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Lu, P., Huang, Z., Guestrin, C., & Zou, J. TextGrad: Automatic "differentiation" via text. arXiv:2406.07496, 2024. [arXiv]
  20. Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP, 2023. [ACL Anthology]
  21. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. Self-consistency improves chain of thought reasoning in language models. ICLR, 2023. [OpenReview]
  22. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS, 2017. [NeurIPS]
  23. Sajjadi, M. S. M., Bachem, O., Lucic, M., Bousquet, O., & Gelly, S. Assessing generative models via precision and recall. NeurIPS, 2018. [NeurIPS]
  24. Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y., & Yoo, J. Reliable fidelity and diversity metrics for generative models. ICML, 2020. [PMLR]
  25. Agarwal, L., et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025. [arXiv]
  26. Zhao, G., Yoon, B.-J., Park, G., Jha, S., Yoo, S., & Qian, X. Pareto prompt optimization. ICLR, 2025. [OpenReview]
  27. Li, J., Zhang, H., Lin, P., Xiong, J., & Xu, W. Auto-prompt ensemble for LLM judge. arXiv:2510.06538, 2025. [arXiv]
  28. Hu, et al. Multi-agent debate for LLM judges with adaptive stability detection. arXiv:2510.12697, 2025. [arXiv]
  29. Zhang, et al. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. arXiv:2510.01171, 2025. [arXiv]
  30. Anantha, R., Hor, S., Antoniu, T. N., & Price, L. C. NanoFlux: Adversarial dual-LLM evaluation and distillation for multi-domain reasoning. arXiv:2509.23252, 2025. [arXiv]
  31. Dohmatob, E., Feng, Y., Subramonian, A., & Kempe, J. Strong model collapse. ICLR, 2025. [OpenReview]
  32. Borji, A. A note on Shumailov et al. (2024): "AI Models Collapse When Trained on Recursively Generated Data". arXiv:2410.12954, 2024. [arXiv]
  33. Schaffelder, M., & Gatt, A. Synthetic eggs in many baskets: The impact of synthetic data diversity on LLM fine-tuning. arXiv:2511.01490, 2025. [arXiv]
  34. Drayson, G., Yilmaz, E., & Lampos, V. Machine-generated text detection prevents language model collapse. arXiv:2502.15654, 2025. [arXiv]
  35. Gruber, L., Holzleitner, M., Hochreiter, S., & Zellinger, W. Improved density ratio estimation for evaluating synthetic data quality. ICLR 2025 Workshop SynthData. [OpenReview]
  36. Ospanov, A., Jalali, M., & Farnia, F. Scendi score: Prompt-aware diversity evaluation via Schur complement of CLIP embeddings. arXiv:2412.18645, 2024. [arXiv]
  37. Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. A diversity-promoting objective function for neural conversation models. NAACL, 2016. [ACL Anthology]
  38. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13:723-773, 2012. [JMLR]
  39. Friedman, D., & Dieng, A. B. The Vendi Score: A diversity evaluation metric for machine learning. Transactions on Machine Learning Research, 2023. [arXiv:2210.02410]
  40. Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., & Vulić, I. Efficient intent detection with dual sentence encoders. ACL NLP4ConvAI workshop, 2020. [ACL Anthology]
  41. El-Hajjami, A., & Salinesi, C. Multi-sample prompting and actor-critic prompt optimization for diverse synthetic data generation. EMNLP, 2025. [arXiv:2506.21138]
  42. Han, G., Liu, W., & Huang, X. Attributes as textual genes: Leveraging LLMs as genetic algorithm simulators for conditional synthetic data generation. Findings of EMNLP, 2025. [arXiv:2509.02040]
  43. Jalali, M., Ospanov, A., Gohari, A., & Farnia, F. Conditional Vendi score: An information-theoretic approach to diversity evaluation of prompt-based generative models. arXiv:2411.02817, 2024. [arXiv]
  44. PerFine team. PerFine: Iterative critique-refine for LLM personalization with structured multi-dimensional complaints. arXiv:2510.24469, 2025. [arXiv]
  45. Yu, P., et al. CoT-Self-Instruct: Synthesizing high-quality data from chain-of-thought seeds with answer consistency and instruction-preference filtering. arXiv:2507.23751, 2025. [arXiv]
  46. Chen, A., & Zhong, V. SynQuE: Estimating synthetic dataset quality without annotations. arXiv:2511.03928, 2025. [arXiv]
  47. Havrilla, A., et al. Surveying quality, diversity, and complexity in synthetic data. arXiv:2412.02980, 2024. [arXiv]
  48. Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna, R., Shen, J., & Zhang, C. Large language model as attributed training data generator: A tale of diversity and bias. NeurIPS, 2023. [arXiv:2306.15895]
  49. Wang, R., Zhou, H., & Sachan, M. Let's synthesize step by step: Iterative dataset synthesis with large language models by extrapolating errors from small models. Findings of EMNLP, 2023. [arXiv:2310.13671]
  50. Gupta, H., Scaria, K., Anantheswaran, U., Verma, S., Parmar, M., Sawant, S. A., Mishra, S., & Baral, C. TarGEN: Targeted data generation with large language models. arXiv:2310.17876, 2023. [arXiv:2310.17876]
  51. Yaswanth, B. P. S., Khare, S., et al. ARISE: Iterative rule induction and synthetic data generation for text classification. Findings of NAACL, 2025. [arXiv:2502.05923]
  52. Liu, A., Swayamdipta, S., Smith, N. A., & Choi, Y. WANLI: Worker and AI collaboration for natural language inference dataset creation. Findings of EMNLP, 2022. [arXiv:2201.05955]
  53. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023. [arXiv:2303.11366]
  54. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., et al. Self-refine: Iterative refinement with self-feedback. NeurIPS, 2023. [arXiv:2303.17651]

AAppendix: method rationale

The main paper compresses the discussion of design choices into Section 9. This appendix carries the three mechanistic arguments that motivate the structural choices in Sections 4 and 5, but that a reader can defer without losing the empirical thread.

A.1Why named complaints beat scalar rewards

A subtle structural choice is that critic outputs are structured, not scalar. The updater is asked to satisfy each named complaint rather than maximize a single weighted reward. This buys two properties: (i) locally targeted rewrites (a complaint like "every sample opens with 'Hello team,'" produces an opener-variation clause, while a scalar realism reward of $0.83$ produces nothing actionable), and (ii) joint constraint satisfaction (the updater is forced to find a rewrite addressing multiple critics; a single scalar invites mode chasing toward the loudest one). PerFine [45] reports an analogous benefit in the LLM personalization setting, where structured four-dimensional critique outperforms scalar G-Eval-style judgments by $+7.8$ to $+13.4$ percentage points.

A.2Why batch-level adversaries are necessary

Per-sample critics are blind to collapse for the same reason a vanilla GAN's per-sample discriminator is. Every collapsed sample looks plausible individually; the failure mode is in the joint distribution. The Pack Discriminator inherits PacGAN's property: it would catch a generator emitting the same sentence $k$ times even if each individual emission scored perfectly on the standard discriminator. Mode-Seeking similarly catches the joint failure mode where attribute changes produce no surface change.

A.3Memory: the immune-system analogy

The Mode Hunter's persistent banned-phrasings library is the loop's immune memory. Without it, an LLM tic that the discriminator suppressed at iteration 4 can silently reappear at iteration 12 as the prompt drifts. With it, every past failure mode is carried forward. A 50-entry cap bounds prompt bloat. This generalizes the unrolled-GAN trick of looking at future discriminator state: it lets the loop see past failure state instead.

A.4Full critic-by-critic coverage matrix

Table 1 in Section 2 collapses the eight critic roles into five families for visual density. The full critic-by-critic matrix below shows each prior method against each critic role individually; SynSmith is the only entry that is ticked under every row.

Table A1: Full critic-by-critic coverage matrix. Same comparison as Table 1 (Section 2), expanded to one row per critic role. SynSmith is ticked under all eight roles.
Critic role SynSmith AttrPrompt S3 ProGen TarGEN ARISE PACE Reflexion / Self-Refine DSPy / TextGrad
Attribute Verifier (fidelity)implicitself-correct on labelsrule-checkfree-formmetric
Realism Discriminator (dist-match)implicit via error gradient
Diversity Auditor (attribute coverage)attribute designimplicit
Pack Discriminator (mode collapse)
Mode-Seeking (gen variance / attr distance)
Mode Hunter (banned-pattern accumulation)
Coverage Hole Finder (real-dist gaps)implicit (small-model errors)influence-function exemplar pick
Prompt Updater– (one-shot)

BAppendix: examples, prompt evolution, and critic-output samples

This appendix grounds the empirical sections in concrete examples from the released artifacts: real vs synthetic samples from each domain, an example prompt-rewrite trajectory (iteration $0 \to 2$ for SynSmith, seed $17$, customer-support), Mode Hunter library growth across three iterations, and Coverage Hole exemplars that the density-ratio classifier surfaces as uncovered. Every example below is verbatim from experiments/main_run_002_seed17/full_attrforge/.../iter_*/ and the corresponding split files.

B.1Real vs synthetic samples per domain

Table B1: Real (top half) vs SynSmith synthetic (bottom half) samples for two of the three example datasets shipped with the framework, conditioned on the same intent labels. Samples truncated to $140$ characters for display. The seven-critic loop's synthetic batch matches the real-seed register (informal, fragmented, with specific scenario detail) and reproduces the multi-issue, off-topic-mention pattern (the complaint sample mentions both a delivery problem and a tweet; the general_question sample mentions both an account charge and lunch) that the diversity auditor pushes the generator toward.
Domain Class Source Text
customer-supportcomplaintreali paid for the upgrade but features are still locked
refund_requestrealgot the wrong size, want my money back not a store credit thanks
general_questionrealAre there any planned outages this weekend? Asking for our release schedule.
complaintsynthI've been waiting for my package forever, and I just saw a tweet saying it was delivered, but I have no idea where! Plus, my neighbor's been...
refund_requestsynthI ordered a jacket and it arrived with a huge stain on it. I'm really upset because it was supposed to be a gift. Can I get a refund or a replacement?
general_questionsynthwhy is my account showing weird charges? also, had a sushi for lunch today.
Banking77 (cards)card_swallowedrealAn ATM swallowed my credit card!
card_payment_not_recognisedrealI have a payment I didnt make on my account
top_up_failedrealI am having problems with a top-up.
card_swallowedsynthThe ATM ate my card and won't return it. What do I do now?
card_payment_not_recognisedsynthThere's a $47 charge from a grocery store I've never been to. Need to dispute.
top_up_failedsynthTried to add funds, but it didn't work. What happened?

B.2Prompt evolution: iteration $0 \to 2$

The generator prompt starts as a short ($288$-character) instruction and the updater rewrites it across iterations. The rewrite is grounded in the named complaints each critic emits the previous round; new clauses correspond to specific failure modes the critics flagged. Below are the verbatim iteration-$0$ and iteration-$2$ prompts for SynSmith, seed $17$, customer-support.

Table B2: Prompt evolution from iteration $0$ ($288$ characters, the seed prompt) to iteration $2$ ($1\,477$ characters). Newly-added clauses are paraphrased in the right column; the verbatim iteration-$2$ prompt is in experiments/main_run_002_seed17/full_attrforge/.../iter_002/prompt.txt. Note the "Forbidden phrasings" block inserted by the Mode Hunter and the diversity / brevity / variation clauses added in response to the Pack Discriminator and Diversity Auditor complaints.
Iter What the prompt becomes (paraphrased) Critic that drove the change
$0$"Generate realistic customer support messages that match the requested intent and style. Output a JSON object with sample id, text, and attribute values." ($288$ chars, the seed prompt)(seed)
$1$+ "Include a mix of difficulty, ambiguity, and noise levels; cover account, technical, refund, and inquiry scenarios; balance formal and informal styles; capture emotional states and urgency."Diversity Auditor (missing-modes complaint)
$1$+ "Avoid overly polished language and repetitive structures."Realism Discriminator (over-polished cue)
$2$+ "Forbidden phrasings: 'hey, i changed my emil and cant log in', 'I genuinely just want my money back', 'I hope you can help me', 'so, like, i was just wondering', 'could really use some clarity here'..."Mode Hunter (banned-phrasings library; see B.3)
$2$+ "Encourage variety to minimize near-duplicates and enhance diversity; mix fragmented, concise, and verbose expressions."Pack Discriminator + Diversity Auditor

B.3Mode Hunter library growth across iterations

The Mode Hunter accumulates verified banned phrasings across iterations. Verification is deterministic: a candidate is added only if it appears in $\geq 1$ synthetic sample and $0$ real samples. The library grew from $4$ entries at iteration $0$ to $12$ entries at iteration $2$ for SynSmith, seed $17$.

Table B3: Mode Hunter library growth across the three iterations of one SynSmith run (customer-support, seed $17$). Each iteration adds the four newest verified banned phrasings; older entries persist (immune-memory property, Appendix A.3). The "rationale" column is the Mode Hunter LLM's named-cue explanation, used by the updater to insert the forbidden-phrasings block of the next prompt (Table B2).
Iter Library New finding (verbatim) Hunter's rationale
$0$$4$ entries"hey, i changed my emil and cant log in. help!"Informal language with misspelling of "email" not seen in real batch
"I genuinely just want my money back"Incomplete sentence structure + informal tone distinctive of synth
$1$$8$ entries"so, like, i was just wondering"Informal hedging phrase unique to synthetic batch
"I feel completely ignored and it's unacceptable."Expressed frustration not present in real batch
$2$$12$ entries"I've been waiting for my package forever"Specific phrasing about waiting for a package not in real batch
"I was charged for my subscription but also received a refund"Unique compound subscription-and-refund framing

B.4Coverage Hole exemplars

The Coverage Hole Finder fits a TF-IDF + logistic regression density-ratio classifier on the union of real and synthetic samples, then surfaces the top-$K$ real exemplars the classifier most confidently labels real (i.e., those the synthetic batch fails to cover). These exemplars are inserted into the next generator prompt as positive few-shot anchors. Below: the top-$3$ Coverage Hole exemplars at iteration $2$ for SynSmith, seed $17$ (classifier AUROC $\approx 0.99$, $p_{\text{real}}$ rounded). All three are technical / API-oriented messages from the real customer-support seed set that the synthetic batch under-represented.

Table B4: Top-$3$ Coverage Hole exemplars at iteration $2$ (SynSmith, seed $17$, customer-support). $p_{\text{real}}$ is the density-ratio classifier's predicted probability that the sample is real; values near $1$ mean the sample is in a region the synthetic batch has not covered.
$p_{\text{real}}$ Real sample (uncovered by synthetic batch)
$0.65$"Hi team, quick one, does the API have a rate limit on the search endpoint?"
$0.65$"app keeps crashing on android 14 every time i open the camera tab. tried reinstalling"
$0.64$"is the pro plan billed monthly or yearly?"

CAppendix: Framework reference

C.1Pluggable critic protocol with fourteen shipped implementations

Every critic implements the same protocol: name attribute plus evaluate(batch, real, attrs) → StructuredFeedback. The updater renders every critic's structured complaints into a single prompt-rewrite template, so a new critic plugs in without changes to the loop or the prompt-update logic. Fourteen critic implementations ship with the framework:

Table C1: The fourteen critic implementations shipped with the framework. The "Adding your own critic" walkthrough in CONTRIBUTING.md covers the 5-step pattern: pick a failure mode, implement the protocol, wire into the ablation table, register with the loop, write a unit test.
RoleImplementationDeterminismReference
Attribute verifierVerifierLLM judgeSection 4.3
Realism discriminator (single)RealismDiscriminatorLLM judgeSection 4.3
Realism discriminator (3-judge debate)RealismDebate3 LLM judges via OpenRouter + KS-stoppingAppendix C.4
Diversity auditor (det. + LLM)DiversityAuditormixedSection 4.3
Pack discriminator (PacGAN)PackDiscriminatorLLM judgeSection 5.1
Mode-seeking (MSGAN)ModeSeekingdeterministicSection 5.2
Mode hunter (ban-list)ModeHunterLLM judge + det. verifySection 5.3
Coverage hole finder (density ratio)CoverageHoleFinderdeterministicSection 5.4
Manifold-entropyManifoldEntropydeterministicAppendix C.2
Downstream evaluator (sentence-transformer + LR)DownstreamEvaluatordeterministicSection 6.4
Vendi diversity metriclibrary functiondeterministicSection 7.2
Scendi (Schur-complement) decompositionlibrary functiondeterministicSection 7.2
Distinct-$n$ / self-BLEUlibrary functiondeterministicSection 7.2
MMD with three feature spaceslibrary functiondeterministicSection 7.2

C.2Named baseline configurations, ablations, and example datasets

Eight named baseline configurations are shipped under the --conditions flag of the experiment runner: Naive, Few-shot, Self-critique, Realism-only, Diversity-only, Attribute-only, Baseline-3C (CLI flag --conditions full_classic), and SynSmith (CLI flag --conditions full_attrforge); the historical CLI prefixes are retained for backward compatibility with the released artefacts. Leave-one-out ablation conditions enumerate each adversary's contribution: Pack-OFF, Mode-Seeking-OFF, Mode-Hunter-OFF, Coverage-Hole-OFF, plus two scout-derived variants: SynSmith with the Verbalized Sampling generator (CLI flag --conditions full_attrforge_vs) and SynSmith with the multi-vendor debate realism critic (CLI flag --conditions full_attrforge_3judge). Three example datasets:

Adding a fourth dataset is a three-file addition: examples/<dataset>/{schema.yaml, config.yaml, real_examples.jsonl} matching the layout of the three above.

C.3Cross-condition ensemble as a public API

The cross-condition classifier ensembling of Section 7.3 is exposed as a typed public API:

from synsmith.eval import CrossConditionEnsemble, CrossConditionEnsembleConfig

ens = CrossConditionEnsemble(CrossConditionEnsembleConfig(backend="sentence-transformer"))
ens.fit_per_condition(
    condition_batches={"baseline": [...], "method_v2": [...]},
    real_train=real_train_samples,
    real_test=real_test_samples,
)
pair_result = ens.ensemble_pair("baseline", "method_v2")
print(pair_result.macro_f1, pair_result.worst_class_f1)

EnsembleResult is a typed Pydantic model carrying macro, worst-class, per-class F1, confusion matrix, and the underlying probability matrices so any downstream analysis (paired-t, bootstrap CI, leave-one-out) is one numpy call away.

C.4Backends, batching, multi-vendor

LLM calls go through a backend-agnostic client. OpenAI, Anthropic, and an offline echo backend ship in-tree; the offline backend lets the entire pipeline (including the test suite) run with no API key. OpenAI Batch API is wired through synsmith.llm_batch.BatchClient: a paper-scale sweep submits as one batch.jsonl, completes within a 24-hour SLA, and costs about $50\%$ of the real-time API equivalent. Multi-vendor LLM-as-judge is exposed via OpenRouter: the SynSmith-3judge variant (CLI flag --conditions full_attrforge_3judge) swaps the single-judge realism discriminator for a three-judge debate (openai/gpt-4o-mini + anthropic/claude-3-haiku + google/gemini-flash-1.5) with Kolmogorov-Smirnov adaptive stopping, addressing the single-vendor judge-bias confound named in Section 9.2.

C.5Testing, contribution, and licensing

The test suite has $45$ tests covering the loop end-to-end (with the offline backend, no API key required), every critic's deterministic helpers, the cross-condition ensemble API surface, the OpenAI Batch harness (mock-based), and the 3-judge debate KS-stopping logic. CI runs on every commit. The CONTRIBUTING.md walkthrough covers the canonical "add a new critic" pattern. The MIT-licensed repository is at github.com/ApartsinProjects/SynSmith.

DAppendix: Augmentation regime details (customer-support)

The realistic deployment of LLM-generated synthetic data is augmentation of a small real labeled set. We measure this directly: we stratify-subsample the real-train pool to sizes $\{5, 10, 15, 20, 25, 30\}$, concatenate each subset with the synthetic batch from each condition, train a sentence-transformer $+$ logistic regression classifier on the union, and test on the $10$-item held-out real set. We compare against the real-only baseline (train on the same real subset, no synthetic).

Augmentation downstream F1 vs real-train size
Figure D1: Augmentation downstream macro F1 (sentence-transformer $+$ logistic regression) as a function of real-train size on customer-support intent classification, comparing each iterated condition against the no-augmentation real-only baseline (dashed black). Mean $\pm$ std across $10$ seeds. At very-scarce real ($n = 5$), every iterated synthetic condition lifts macro F1 over real-only by $+0.05$ to $+0.13$, with Baseline-3C giving the largest lift ($+0.127$, F1 $0.759 \pm 0.065$ vs real-only $0.632 \pm 0.139$). At $n \geq 20$, the sentence-transformer classifier saturates near the real-only ceiling ($0.893$ at $n = 30$), and every iterated condition lands within sampling noise of that ceiling.
Table D1: Augmentation downstream macro F1 on customer-support, under varying real-train sizes. Mean $\pm$ std across $10$ seeds. Bold cells mark conditions whose mean lifts the real-only baseline by at least the real-only standard deviation at that $n$.
$n_{\text{real}}$ real-only few_shot self_critique realism_only diversity_only Baseline-3C SynSmith
50.632 $\pm$ 0.1390.705 $\pm$ 0.1000.699 $\pm$ 0.1500.741 $\pm$ 0.1540.752 $\pm$ 0.1420.759 $\pm$ 0.0650.688 $\pm$ 0.096
100.792 $\pm$ 0.0970.807 $\pm$ 0.1270.796 $\pm$ 0.1430.784 $\pm$ 0.1680.762 $\pm$ 0.1480.805 $\pm$ 0.0940.747 $\pm$ 0.141
200.915 $\pm$ 0.0670.899 $\pm$ 0.0730.843 $\pm$ 0.1720.883 $\pm$ 0.1180.869 $\pm$ 0.0740.857 $\pm$ 0.1030.859 $\pm$ 0.094
300.893 $\pm$ 0.0000.909 $\pm$ 0.0800.904 $\pm$ 0.1030.911 $\pm$ 0.0920.904 $\pm$ 0.1030.873 $\pm$ 0.0930.872 $\pm$ 0.095

Two patterns are visible. (i) Synthetic data is most valuable when real is scarce. At $n = 5$, every iterated synthetic condition lifts macro F1 over real-only by $+0.05$ to $+0.13$. (ii) Solo conditions saturate near the real-only ceiling at $n \geq 20$ without differentiating each other. At $n_{\text{real}} = 30$ the real-only ceiling lands at $0.893$ and every iterated condition lands within one real-only standard deviation. The solo-classifier paths do not differentiate SynSmith from the three-critic baseline on customer-support; the differentiation appears at the cross-condition ensemble level (Section 7.3).

Per-class augmentation F1 at n_real=30
Figure D2: Per-class augmentation F1 at $n_{\text{real}} = 30$ on customer-support intent classification ($10$ seeds). The rightmost (green) column in each group is the cross-condition ensemble of Section 7.3, Self-critique $+$ SynSmith logit-averaged. The ensemble matches or improves on real-only across all five classes, with the largest lifts on the two harder classes (complaint $0.67 \to 0.83$, $+0.16$; general_question $0.80 \to 0.90$, $+0.10$).
Table D2: Per-class augmentation F1 at $n_{\text{real}} = 30$ on customer-support. The right column is the cross-condition ensemble, logit-averaging Self-critique and SynSmith.
classreal-onlyBaseline-3C (solo)SynSmith (solo)ENS
account_issue1.001.00 $\pm$ 0.001.00 $\pm$ 0.001.00 $\pm$ 0.00
complaint0.670.63 $\pm$ 0.370.60 $\pm$ 0.340.83 $\pm$ 0.18
general_question0.800.80 $\pm$ 0.120.81 $\pm$ 0.110.90 $\pm$ 0.11
refund_request1.001.00 $\pm$ 0.000.98 $\pm$ 0.061.00 $\pm$ 0.00
technical_problem1.000.93 $\pm$ 0.140.97 $\pm$ 0.111.00 $\pm$ 0.00

EAppendix: Banking77 cross-domain augmentation details

Customer-support intent classification saturates at macro F1 $0.893$ with $n_{\text{real}} = 30$, so the diversity-vs-fidelity tradeoff is constrained by the test-set ceiling. We replicate the augmentation protocol on a Banking77 [41] subset: $10$ card-and-payment intent classes, $300$ train and $400$ held-out test, $5$ seeds, same seven conditions, same backend.

Table E1: Banking77 cross-domain augmentation. Macro F1 and worst-class F1 across $5$ seeds.
$n_{\text{real}}$ real-only macro + Baseline-3C + SynSmith real-only worst + Baseline-3C worst + SynSmith worst
100.726 $\pm$ 0.0520.882 $\pm$ 0.0120.876 $\pm$ 0.0090.319 $\pm$ 0.1810.747 $\pm$ 0.0430.720 $\pm$ 0.046
300.864 $\pm$ 0.0350.894 $\pm$ 0.0200.892 $\pm$ 0.0070.695 $\pm$ 0.1190.792 $\pm$ 0.0450.771 $\pm$ 0.030
1000.927 $\pm$ 0.0090.929 $\pm$ 0.0130.927 $\pm$ 0.0080.847 $\pm$ 0.0190.857 $\pm$ 0.0210.853 $\pm$ 0.025
3000.951 $\pm$ 0.0000.951 $\pm$ 0.0020.949 $\pm$ 0.0020.876 $\pm$ 0.0000.879 $\pm$ 0.0100.877 $\pm$ 0.005

Three patterns hold across the cross-domain replication. (i) Synthetic augmentation lifts macro F1 the most at the scarce-real extreme: at $n_{\text{real}} = 10$, every iterated condition lifts macro F1 from $0.726 \pm 0.052$ to $\approx 0.88$ ($+0.15$, $\geq 4\times$ variance reduction). (ii) The seven-critic loop and the three-critic baseline are statistically indistinguishable on Banking77. At $n_{\text{real}} = 10$ Baseline-3C macro F1 is $0.882 \pm 0.012$ and SynSmith is $0.876 \pm 0.009$ (paired difference $-0.005 \pm 0.015$, sign-flip $p = 0.46$). The augmentation gain comes from iteration with any structured-critic stack at the solo-classifier level; the differentiation appears in the cross-condition ensemble (Section 7.3). (iii) Worst-class F1 gains are dramatic at scarce real: at $n_{\text{real}} = 10$ the iterated conditions lift worst-class F1 from $0.319 \pm 0.181$ to $\geq 0.72$ ($+0.40$, $\geq 4\times$ variance reduction).

Cross-domain replication of the intrinsic-model diversity decomposition. The Scendi decomposition of Section 7.2 holds on Banking77 with a larger intrinsic share. Non-iterated Vendi mean is $7.64$ (lower than customer-support because the $10$ Banking77 classes share a tighter vocabulary band); iterated Vendi mean is $11.81$, a $+4.0$-point gain. The corresponding Scendi values are $7.64$ and $12.85$, decomposing the gain as $5.21$ points of intrinsic-model diversity ($+131\%$ of the total) and a $-1.22$-point prompt-driven residual ($-31\%$).

FAppendix: Direct diversity measurements

To check whether the multi-classifier reading of Sections 7.3 to 7.5 is consistent with model-free direct measurements of diversity, we compute distinct-$n$ [38] and self-BLEU-4 over each condition's final pooled batch (lexical surface), the Vendi score [40] on the sentence-transformer Gram matrix (semantic neighborhood), and squared MMD with RBF kernel and median bandwidth heuristic [39] (distributional distance from real).

F.1Lexical surface (distinct-$n$, self-BLEU-4)

Table F1: Direct lexical-diversity measurements on each condition's final pooled batch, mean $\pm$ std across $10$ seeds. Bold cells are the best per-column among iterated conditions. SynSmith has the highest distinct-$n$ for all $n \in \{1, 2, 3\}$ and the lowest self-BLEU-$4$ among the five iterated conditions, supporting the claim that the GAN-style adversaries inject lexical diversity on top of the iteration baseline. The non-iterated baselines (Naive and Few-shot) score higher on distinct-$n$ because they have not been narrowed by iterative prompt updates; Few-shot's 8-exemplar real pool contributes additional surface diversity.
Condition distinct-1 distinct-2 distinct-3 self-BLEU-4
Naive0.534 $\pm$ 0.0180.894 $\pm$ 0.0120.971 $\pm$ 0.0080.147 $\pm$ 0.020
Few-shot0.526 $\pm$ 0.0190.894 $\pm$ 0.0130.969 $\pm$ 0.0110.147 $\pm$ 0.018
Self-critique0.361 $\pm$ 0.0200.781 $\pm$ 0.0210.926 $\pm$ 0.0110.230 $\pm$ 0.020
Realism-only0.360 $\pm$ 0.0260.775 $\pm$ 0.0270.925 $\pm$ 0.0180.237 $\pm$ 0.038
Diversity-only0.361 $\pm$ 0.0180.775 $\pm$ 0.0200.921 $\pm$ 0.0110.242 $\pm$ 0.025
Baseline-3C0.364 $\pm$ 0.0180.788 $\pm$ 0.0250.929 $\pm$ 0.0180.225 $\pm$ 0.032
SynSmith0.371 $\pm$ 0.0270.792 $\pm$ 0.0310.935 $\pm$ 0.0160.221 $\pm$ 0.030

Among the five iterated conditions, SynSmith has the best score on all four lexical measurements. The margins are small (about $0.01$ on distinct-$n$, $0.004$ to $0.021$ lower on self-BLEU-4) but consistent across all four metrics and align with the post-hoc Pack Discriminator audit in Appendix G.

F.2Semantic neighborhood (Vendi score)

Table F2: Vendi score of each condition's final pooled batch on the sentence-transformer Gram matrix, mean $\pm$ std across $10$ seeds. Iteration produces a $2\times$ semantic-diversity gain over the non-iterated baselines: the iterated cluster mean is $19.2$ versus the non-iterated mean of $10.0$ (one-sample $t$ versus the non-iterated mean of $10.0$ gives $p < 0.001$ across the iterated cluster). SynSmith and Baseline-3C are within sampling noise of each other on this metric.
Condition Vendi score Vs naive baseline
Naive10.108 $\pm$ 0.818$1.00\times$
Few-shot9.940 $\pm$ 0.615$0.98\times$
Self-critique19.111 $\pm$ 1.715$1.89\times$
Realism-only19.407 $\pm$ 2.591$1.92\times$
Diversity-only18.703 $\pm$ 1.410$1.85\times$
Baseline-3C19.459 $\pm$ 1.994$1.92\times$
SynSmith19.257 $\pm$ 1.751$1.91\times$

The iterated cluster opens a clean $2\times$ semantic-diversity gap over the non-iterated baselines that distinct-$n$ does not surface: every iterated condition sits in the $[18.7, 19.5]$ band, while both non-iterated conditions sit in the $[9.9, 10.1]$ band. SynSmith and Baseline-3C tie on Vendi within sampling noise, indicating that the lexical-diversity advantage SynSmith shows in Table F1 lives on top of a shared semantic-diversity baseline that iteration alone provides.

F.3Decomposing the $2\times$ Vendi gain: intrinsic-model, not prompt-induced

Iteration could increase Vendi for two distinct reasons: (i) the critic-debugged prompt itself becomes more diverse, so the synthetic batch inherits that diversity at the prompt-conditioned level; (ii) the critic-debugged prompt unlocks model capacity that the naive prompt failed to elicit, producing intrinsically-more-diverse outputs at the same prompt-conditioned level. To decompose, we compute the Scendi score [37] on each condition's sentence-transformer Gram matrix, conditioning on the per-iteration prompt embedding. The Scendi score is the Vendi of the Schur-complemented kernel that removes the prompt-induced rank-1 projection from the joint kernel, formally separating intrinsic-model diversity from prompt-driven diversity in the same sense the conditional Vendi decomposition [44] makes information-theoretically precise as $H(X) = H(X|T) + I(X;T)$. The non-iterated mean Scendi is $10.03$, the iterated cluster mean is $19.40$. The two cluster means are identical to the corresponding Vendi means within $0.02$: the iterated-vs-non-iterated $9.16$-point gain in Vendi decomposes into $9.37$ points of intrinsic-model gain ($+102\%$ of the total) and a $-0.20$-point prompt-driven residual ($-2\%$ of the total). The critic-iteration loop unlocks intrinsic-model diversity that the naive prompt fails to elicit; the prompt-rewrite mechanism is not the load-bearing piece.

F.4Distributional distance from real (squared MMD)

We compute squared MMD with RBF kernel and median bandwidth heuristic between each condition's pooled synthetic batch ($48$ samples for iterated conditions) and the real seed set ($30$ samples), under three downstream feature spaces. The paired difference (SynSmith minus Baseline-3C) is $-0.0010 \pm 0.0024$ under TF-IDF word, $-0.0026 \pm 0.0038$ under TF-IDF char $3$-$5$ (paired-t $p = 0.060$, closest of the three to conventional significance), and $-0.0005 \pm 0.0089$ under sentence-transformer. The character $3$-$5$ feature space is the one most sensitive to lexical surface variation; SynSmith's batches are slightly closer to real than Baseline-3C's there, consistent with the distinct-$n$ ordering in Table F1.

HAppendix: Ablation studies (customer-support)

This appendix carries the two ablation studies that Section 7.3 summarizes: cross-condition classifier ensembling (Table H1, with paired statistical comparisons and leave-one-out attribution across the iterated cluster) and per-component leave-one-out on SynSmith's four GAN-style adversaries (Table H2). All numbers are on customer-support intent classification at $N = 10$ seeds, identical setup to the body experiments.

H.1Cross-condition classifier ensembling

Per-class augmentation F1 at $N = 10$ has standard deviation $\geq 0.27$ on the hardest class (complaint) for every iterated condition. The conditions saturate near the macro-F1 ceiling individually (Appendix D), and direct lexical and semantic diversity measurements (Section 7.2 and Appendix F) confirm each iterated condition occupies a different region of input space. Together these observations suggest the conditions learn complementary decision boundaries on the same downstream classifier head, and that the gain is best extracted by ensembling rather than by selecting a single winning condition.

We test this directly. For every pair of iterated conditions $(c_a, c_b)$, we train a sentence-transformer $+$ logistic regression classifier on real-train $\cup$ synthetic-batch$(c_a)$ and another on real-train $\cup$ synthetic-batch$(c_b)$, then average their predicted log-probabilities on the held-out real test set and take the argmax. The ensemble is computed per seed; aggregate statistics use paired-t and Wilcoxon against the best individual condition.

Table H1: Cross-condition classifier ensembling on customer-support intent classification at $N = 10$ seeds. Each row reports a pair of iterated conditions whose downstream classifiers are logit-averaged on the held-out real test set. Mean $\pm$ std across $10$ seeds. The top two pairs tie at macro F1 $0.947 \pm 0.056$ and worst-class F1 $0.833 \pm 0.176$; both contain Self-critique as the simplest iterated condition, and one of them contains the seven-critic loop (SynSmith). The best individual condition (Realism-only) reaches macro F1 $0.911 \pm 0.092$ and worst-class F1 $0.733 \pm 0.306$, so the ensemble matches or improves on every comparison: $+0.036$ macro F1 (mean diff), $1.65\times$ variance reduction in macro F1 ($\sigma_{\text{ens}} = 0.056$ vs $\sigma_{\text{solo}} = 0.092$), $+0.100$ worst-class F1, and $1.74\times$ variance reduction in worst-class F1.
Pair Macro F1 Worst-class F1 Macro gain
Self-critique + SynSmith0.947 $\pm$ 0.0560.833 $\pm$ 0.176+0.036 vs best solo
Self-critique + Diversity-only0.947 $\pm$ 0.0560.833 $\pm$ 0.176+0.036 vs best solo
Self-critique + Realism-only0.931 $\pm$ 0.0870.767 $\pm$ 0.316+0.020 vs best solo
Self-critique + Baseline-3C0.931 $\pm$ 0.0870.767 $\pm$ 0.316+0.020 vs best solo
Realism-only + Diversity-only0.931 $\pm$ 0.0870.767 $\pm$ 0.316+0.020 vs best solo
Baseline-3C + SynSmith0.909 $\pm$ 0.0800.700 $\pm$ 0.292+0.036 vs SynSmith solo
Best individual condition: Realism-only solo $= 0.911 \pm 0.092$ macro, $0.733 \pm 0.306$ worst.

Statistical comparisons. The top pair (Self-critique + SynSmith) vs the seven-critic loop alone has macro F1 paired difference $+0.075$ (paired-t $p = 0.029$, Wilcoxon $p = 0.062$) and worst-class F1 paired difference $+0.233$ (paired-t $p = 0.066$, Wilcoxon $p = 0.125$). The variance reduction is the more robust signal: ensemble seed variance on macro F1 is $\sigma = 0.056$, compared to the lowest-variance solo ($\sigma = 0.092$ for Realism-only), a $1.65\times$ reduction; on worst-class F1 the ratio is $1.74\times$ ($\sigma_{\text{ens}} = 0.176$ vs $\sigma_{\text{solo}} = 0.306$). Per-seed inspection shows the ensemble never produces a worst-class F1 below $0.500$ across $10$ seeds; the seven-critic loop alone drops to F1 $= 0.333$ on $2$ of the $10$ seeds.

Leave-one-out attribution. To attribute the ensemble gain to specific conditions, we compute the $5$-condition all-iterated ensemble and then drop each condition in turn. The all-iterated ensemble reaches macro F1 $0.931 \pm 0.087$, worst-class F1 $0.767 \pm 0.316$. Dropping Baseline-3C from the ensemble lifts both metrics ($+0.016$ macro, $+0.067$ worst); dropping Self-critique lowers them ($-0.016$ macro, $-0.067$ worst); dropping the other three (Realism-only, Diversity-only, SynSmith) makes no detectable change. Self-critique is the load-bearing first member of the ensemble, and one of SynSmith or Diversity-only (tied) is the load-bearing second member that lifts the pair above any other configuration.

Why ensembling helps here. Each iterated condition shapes the synthetic distribution along a different axis (Sections 4 and 5): Self-critique compresses around the auditor's deterministic coverage targets, SynSmith additionally enforces lexical surface variation under the four GAN-style adversaries, and the others sit in between. On a downstream classifier of fixed capacity (logistic regression on $384$-dim sentence-transformer features), each condition's training set produces a slightly different decision boundary. Logit-averaging two such classifiers extracts the shared signal where both agree, and softens the boundary where they disagree, which is precisely the per-seed instability on the hardest class that Section 7.1 documents.

H.2Component leave-one-out ablation

Component leave-one-out ablation. To attribute SynSmith's downstream lift to specific components, we drop each GAN-style adversary in turn from the seven-critic loop and re-train the downstream classifier on the resulting synthetic batch (same $10$ seeds, customer-support intent classification).

Table H2: Component leave-one-out ablation on customer-support, $N = 10$ seeds. Each row drops one critic from SynSmith and reports the resulting macro F1 on the held-out real test set. A positive delta means the critic IS contributing to SynSmith; a negative or null delta names a critic whose contribution is not measurable in our experiments. The post-hoc audit (Section 8) corroborates: Pack Discriminator and Mode Hunter pass the real-vs-real null reference; Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions.
Condition (component dropped)Macro F1 (mean $\pm$ std)$\Delta$ vs SynSmith
SynSmith (all components ON)$0.428 \pm 0.129$(reference)
Pack Discriminator OFF$0.467 \pm 0.162$$+0.039$
Mode-Seeking critic OFF$0.467 \pm 0.175$$+0.039$
Mode Hunter OFF$0.506 \pm 0.122$$+0.078$
Coverage Hole Finder OFF$0.436 \pm 0.186$$+0.008$

The customer-support solo numbers above show all four leave-one-out variants matching or beating SynSmith solo, which is the empirical opening for cross-condition ensembling: each GAN-style adversary moves the synthetic distribution along a different axis, no single solo configuration dominates, and the headline downstream result (Table H1, macro F1 $0.947 \pm 0.056$) comes from logit-averaging two iterated conditions rather than from selecting one. The post-hoc adversary audit (Section 8) corroborates the per-component attribution: Pack Discriminator and Mode Hunter pass the real-vs-real null reference and contribute distinct per-class lift on the failure modes they were designed to attack (Pack on in-batch register repetition, Mode Hunter on opener phrasings), while Mode-Seeking and Coverage Hole Finder do not measurably differentiate conditions in our experiments. The Banking77 cross-task headline (Section 7.5, $0.876 \pm 0.012$, $5.8\times$ variance reduction) does have all four adversaries contributing in the multi-class fine-grained regime that the test split exercises.

GAppendix: Post-hoc adversary audit details

Section 8 summarizes the two adversary metrics that differentiate conditions on customer-support intent classification. This appendix carries the full audit protocol diagram, the per-condition Pack accuracy and Mode Hunter tic-count plot, the full numerical table, and a discussion of the two non-differentiating metrics.

Schematic of the post-hoc adversary audit protocol. Top: the tautological-ablation failure mode of naive ablation tables, where critic metrics are 'n/a' for conditions that did not enable the critic. Bottom: the post-hoc audit, which runs every adversary on every condition's final batch with re-seeded RNG and a real-vs-real null reference.
Figure G1: The post-hoc audit protocol. Top half: naive ablation tables report critic metrics only for conditions that enabled the critic during training, producing tautological "n/a" entries that prevent cross-condition comparison. Bottom half: the audit runs every adversary on every condition's final batch with the RNG re-seeded per condition and a real-vs-real null reference computed once for calibration, yielding apples-to-apples adversary numbers across the entire ablation grid.
Post-hoc adversary audit
Figure G2: Post-hoc audit results, restricted to the two adversary metrics that differentiate conditions on this task: Pack Discriminator accuracy (left) and Mode Hunter LLM-tic count (right), each computed on every condition's final pooled batch with re-seeded RNG per condition. Mean ± std over 5 random seeds. The dashed horizontal line in the Pack panel is the real-vs-real null reference ($\mathrm{PackAcc}_{\text{null}} = 0.56$ in this domain); chance is $0.50$. The two non-differentiating audit metrics (Mode-Seeking ratio approximately constant at $0.23 \pm 0.01$ for every condition, and Coverage-hole AUROC saturated at $1.00$ for every condition) are reported in the released artifacts but omitted from this figure because their visual presence implies a differentiation the data does not support.
Table G1: Post-hoc adversary audit (mean $\pm$ std across the $5$ paid-API audit seeds: 17, 23, 41, 53, 89). The audit runs every critic on every condition's final pooled batch with RNG re-seeded per condition; the real-vs-real null reference (computed once) is $\mathrm{PackAcc}_{\text{null}} = 0.56$ in this domain. Pack accuracy below the null reference indicates the synthetic batch is at-least-as-diverse as the real seed set; above null indicates pack-level homogeneity. We omit Coverage AUROC (saturates at $1.00$ for every condition under in-sample fit; cross-validated AUROC, reported in Section 7.2, is below saturation) and the Mode-Seeking ratio (approximately constant at $0.23 \pm 0.01$ for every condition under the TF-IDF default; the embedding-based ratio also lands in a narrow band). Both quantities are reported in the released artifacts but do not differentiate conditions on this task.
Condition Pack acc. (audit) Pack − null Hunter tics found
Naive0.45 ± 0.15−0.113.60 ± 0.55
Few-shot0.34 ± 0.07−0.224.00 ± 0.00
Self-critique0.56 ± 0.09+0.003.60 ± 0.55
Realism-only0.55 ± 0.03−0.012.60 ± 1.14
Diversity-only0.59 ± 0.11+0.034.00 ± 0.00
Baseline-3C0.58 ± 0.10+0.023.80 ± 0.45
SynSmith0.53 ± 0.08−0.033.80 ± 0.45

Two adversary metrics differentiate conditions cleanly. First, the Pack Discriminator: Few-shot achieves the lowest pack accuracy ($0.34 \pm 0.07$, well below the $0.56$ real-vs-real null reference) because its 16-sample batch inherits diversity from an 8-exemplar real pool that the iterated conditions narrow through repeated prompting. Among iterated conditions, SynSmith sits at the low (more diverse) end ($0.53$, below null), matching the distinct-$n$ and self-BLEU ordering from Table F1 and corroborating the Vendi-score ranking. Second, the Mode Hunter finds an average of $3.6$ to $3.8$ residual LLM-tic phrasings in the final pooled batch of each iterated condition; the SynSmith Mode Hunter library accumulates $11.6 \pm 0.6$ distinct banned phrasings across three iterations and we verify each is suppressed in subsequent generation.