Supplementary Information for

Domain-gated latent diffusion discovers novel HMX-class energetic materials with first-principles validation

Yehudit Aperstein^1,*†, Alexander Apartsin^2,*

¹Department of Intelligent Systems, Afeka Tel-Aviv College of Engineering, Israel.

²School of Computer Science, Faculty of Sciences, HIT–Holon Institute of Technology, Israel.

^*These authors contributed equally to this work.

^†Corresponding author. Email: apersteiny@afeka.ac.il

Companion to NMIPaper.html (main article).

Scope. This Supplementary Information accompanies the NMI Article “Domain-gated latent diffusion discovers novel HMX-class energetic materials with first-principles validation”. It provides reproducibility-grade detail beyond what fits in Methods: full per-lead chemistry (Supplementary Table S1), dataset provenance (Supplementary Note 1), complete hyperparameter tables (Supplementary Note 2), full DFT computational protocol with uncertainty bounds (Supplementary Note 3), reproducibility details and superseded ablations (Supplementary Note 4), baseline implementation details (Supplementary Note 5), and the seven detailed guidance and pool-fusion ablations whose headlines appear in main Fig. 5 (Supplementary Note 6). All section, table, and figure references in this document are SI-internal unless explicitly marked “main paper”. The reference list at the end is reproduced from the main article so the SI reads standalone.

What is not in this SI. Raw DFT geometries, optimised electronic-structure outputs, model checkpoints (LIMO encoder, two denoisers, score model at rounds 0/1/2), the 918 self-distillation hard-negative latents, per-step training curves, and the full labelled-master corpus are deposited on Zenodo (10.5281/zenodo.19821953, CC-BY-4.0). All code is on GitHub (ApartsinProjects/DGLD4Energetic, Apache-2.0).

**Supplementary Table S1.** Twelve DFT-confirmed novel leads with canonical SMILES. SMILES are RDKit-canonicalised with explicit charge separation on nitro groups. \(\rho_\text{cal}\) and \(D_\text{K-J,cal}\) are 6-anchor-calibrated DFT values (Methods, “Six-anchor DFT calibration”, main paper). Note: L3 and L16 share an identical canonical SMILES because the Pareto reranker picked up the same molecule from two different sampling lanes; both are retained in the table to reflect the merged top-100 provenance.
ID	Canonical SMILES	Formula	\(\rho_\text{cal}\) (g/cm³)	\(D_\text{K-J,cal}\) (km/s)	Chemotype family
L1	`O=[N+]([O-])c1noc([N+](=O)[O-])c1[N+](=O)[O-]`	C₃N₄O₇	2.09	8.25	3,4,5-trinitro-1,2-isoxazole
L2	`O=[N+]([O-])N=C1NC(O)ON=C1[N+](=O)[O-]`	C₃H₃N₅O₆	1.95	7.78	cyclic nitramine + nitroimine
L3	`O=C1N=C([N+](=O)[O-])C(N[N+](=O)[O-])=NON1`	C₃H₂N₆O₆	2.00	7.50	polynitro 1,2,4-oxadiazinone
L4	`O=[N+]([O-])C1=NCN=NN1[N+](=O)[O-]`	C₂H₂N₆O₄	1.94	6.71	tetrazoline N-nitramine
L5	`O=C(C=NO[N+](=O)[O-])[N+](=O)[O-]`	C₂HN₃O₆	1.94	7.99	acyl oxime nitrate
L9	`NN(C=NC=C(N(N[N+](=O)[O-])[N+](=O)[O-])[N+](=O)[O-])[N+](=O)[O-]`	C₃H₅N₉O₈	1.91	7.33	open-chain N-rich nitramine
L11	`O=C(N[N+](=O)[O-])C(N=C[N+](=O)[O-])[N+](=O)[O-]`	C₃H₃N₅O₇	1.90	7.88	acyl nitramine + nitroimine
L13	`O=[N+]([O-])C=NOC=NN[N+](=O)[O-]`	C₂H₃N₅O₅	1.86	7.36	nitroimine + nitramine chain
L16	`O=C1N=C([N+](=O)[O-])C(N[N+](=O)[O-])=NON1`	C₃H₂N₆O₆	2.00	7.50	polynitro 1,2,4-oxadiazinone (= L3 by canonical SMILES; retained from a different sampling lane)
L18	`O=[N+]([O-])C=C(N=NCN[N+](=O)[O-])[N+](=O)[O-]`	C₃H₄N₆O₆	1.84	7.09	azo-bridged dinitromethyl
L19	`O=CC(O[N+](=O)[O-])[N+](=O)[O-]`	C₂H₂N₂O₆	1.91	7.24	open-chain nitrate ester + nitro
L20	`O=C1N=C(N[N+](=O)[O-])C(=N1)[N+](=O)[O-]`	C₃HN₅O₅	1.98	7.41	polynitro 1,3,5-triazinone (extension lead)

Supplementary Notes

Supplementary Note 1: Dataset details

1.1 Data sources and provenance

The labelled master, the unlabelled SMILES dump, and the motif-augmented expansion (Methods “Dataset and four-tier label hierarchy” (main paper)) are assembled from the following public and quasi-public sources. Per-source row counts are post-canonicalisation, post-charge-filter, and post-token-length filter; they sum to more than the master because most molecules carry labels from several sources.

**Table 1.1.** Per-source provenance of the labelled master, the unlabelled SMILES dump, and the motif-augmented expansion. Per-source row counts are post-canonicalisation and post-charge-filter and pre-cross-source-deduplication.
Role	Source	Approx. rows contributed	Tier(s)	Citation
Labelled master (energetics)	Hand-compilation of experimental \(\rho\), HOF, \(D\), \(P\) for known nitro / nitramine / azide / tetrazole compounds, drawn from Klapötke (2019), Cooper (1996), and the LLNL Explosives Handbook (Dobratz & Crawford 1985)	~3 000	A	[1]
Labelled master	Casey et al. 2020 detonation-property dataset (curated CHNO + CHNOClF subset, B3LYP/6-31G* HOF + experimentally-anchored densities)	~9 000	B	[2]
Labelled master	Kamlet–Jacobs derived rows: density and HOF from literature compilations propagated through the K-J equation to give surrogate \(D, P\)	~25 000	C	[3]
Labelled master	3D-CNN smoke-model surrogate (in-house ensemble trained on the union of A+B+C, eight outputs, 5-fold CV \(R^2 = 0.84\–0.92\)) used to label rows that have no other source	~30 000	D	this work
Negatives for viability classifier	ZINC-15 random subsample (drug-like, neutral, MW < 500), used as inert chemistry against which the RandomForest viability classifier is trained	80 000	n/a	[4]
Hard negatives	Cheats mined from earlier generation pools (gem-tetranitro on small rings, polyazene chains, trinitromethane, gem-dinitrocyclopropane, etc.), encoded through LIMO and used as viab=0 latents in score-model training	918	n/a	this work, Methods (main paper)
Hazard SMARTS catalog	NIST CAMEO Chemicals reactivity database + ChemAxon “dangerous reactivity” SMARTS, distilled into the Bruns–Watson demerit catalog used by the SMARTS gate	n/a (rules)	n/a	[5], [6]
Unlabelled domain corpus	Augmented CHNO / CHNOClF SMILES dump assembled from public energetic-chemistry compilations (Klapötke reviews, patent-extracted SMILES, energetic-materials review papers); used only for the unconditional prior	~380 000	none	[7]
Motif-augmented expansion	Programmatic substitution of nitro, nitramine, azide, cyano, tetrazole, oxadiazole groups onto the labelled master scaffolds, then re-canonicalised and de-duplicated	~1.08 M	none (only used in unlabelled prior)	this work
External novelty database	PubChem 2023 (PUG REST query on canonical SMILES) used to assess novelty of generated leads, not for training	n/a	n/a	[8]
Stability triage tool	xTB 6.6.1 (GFN2-xTB) used at sample-time triage on top leads (\(\le\)100 candidates), not in the training set	n/a	n/a	[9]

What is and is not in the master. The labelled master contains canonical, neutral, single-component CHNOClF SMILES with at least one of {density, HOF, \(D\), \(P\)}. Salt fragments, mixtures, organometallics, isotopologues, and radicals are filtered. Per-row labels are aggregated to the highest-tier value available for each property; a row with experimental \(\rho\) (Tier A) and Kamlet–Jacobs \(P\) (Tier C) carries both with their respective tiers, and only the Tier-A density participates in the conditional gradient during diffusion training. No private, embargoed, or classified data was used; every row traces to a public or in-house-derived source.

Row-count overlap and deduplication. The per-source counts in the table above are pre-overlap. After cross-source canonical-SMILES deduplication the labelled master contains \(\sim 65\,980\) unique molecules (Tier-A \(\cap\) Tier-B is small but non-zero: \(\sim\)600 molecules carry both an experimental measurement and an independent DFT-derived label for at least one property; these contribute to a held-out cross-tier consistency check used during data-readiness audits). The unlabelled \(\sim 380\,\)k corpus and the motif-augmented \(\sim 1.08\,\)M expansion are deduplicated against the labelled master and against each other, yielding a final pretraining pool of \(\sim 694\,\)k unique canonical SMILES (this is the corpus referenced in the Results “Novelty and synthesisability” (main paper) Tanimoto stratification). Anchor compounds (RDX, HMX, TATB, FOX-7, PETN, NTO, TNT, etc.) appear in the labelled master and are not filtered from generation; their occasional surfacing in gated runs is a sanity-check rediscovery rather than a leak.

1.2 Data preprocessing details

The labelled master is built by canonicalising every input SMILES with RDKit, dropping rows where canonicalisation fails or the molecule is charged after canonicalisation (we ban net charges), and aggregating per-molecule label rows so that each molecule carries the highest-tier value available for each of (density, HOF, det-velocity, det-pressure). Tier ordering is A > B > C > D. Heat-of-formation values are reported in kJ/mol; densities in g/cm³; velocities in km/s; pressures in GPa. We drop molecules with more than 72 SELFIES tokens (the LIMO max) and molecules whose token vocabulary falls outside the LIMO 108-token alphabet. The motif-augmented expansion exhaustively substitutes nitro, nitramine, azide, cyano, tetrazole, and oxadiazole groups onto the labelled scaffolds, then re-canonicalises and re-deduplicates.

Atom-composition descriptors — **Figure 1.1.** Atom-composition statistics over a 30k subsample of the labelled corpus: molecular weight, nitro-group count, and joint C-vs-N atom counts. The corpus is heavily concentrated in CHNO chemistry with 2–6 nitro groups and MW between 150 and 350.

Supplementary Note 2: Model architecture and hyperparameter values

2.0 Architecture summary

DGLD instantiates four neural networks plus two non-neural label sources. (1) The LIMO SELFIES-VAE encodes every SMILES once into a deterministic 1024-d latent μ that the diffusion model treats as its data. (2) The denoiser ε_θ is a 44.6 M-parameter FiLM-ResNet that reverses a 1000-step cosine DDPM in latent space; the production sampler runs two such denoisers in parallel (DGLD-H + DGLD-P, Results (main paper)). (3) The multi-task score model is a 4-block FiLM-MLP with a shared 1024-d trunk over (z, σ) with three active steering heads (Viability, Sensitivity, Hazard) and three auxiliary multi-task heads (Performance, SA, SC; Supplementary Note 2.4); its per-head autograd gradients are the steering bus injected into DDIM at every sampling step. (4) Two external label sources, run only during training-label preparation, supply the targets for these heads: a Random-Forest viability classifier on Morgan FP+RDKit descriptors, and a Uni-Mol-based 3D smoke-model 2-fold ensemble that emits (ρ, D, P, T, E, V, HOF, BDE). Numeric layer counts, widths, parameter counts, and training optima for all four networks plus the Random Forest and smoke ensemble are tabulated below in Tables 2.1a (LIMO VAE), 2.1b (denoiser + score model), and 2.1c (label sources + reranker).

External label sources are training-time only. The Random-Forest viability classifier and the 3D-CNN/Uni-Mol smoke-model ensemble are not executed at every diffusion step. They are run once, offline, over the cached training corpus to manufacture the per-row labels (\(y_\text{viab}\), \(y_\rho\), \(y_D\), \(y_P\), HOF, etc.) that the multi-task score model is trained to regress and classify. At sample time, the only networks actually evaluated per DDIM step are the denoiser (one of DGLD-H or DGLD-P depending on the pool-fusion lane, Methods (main paper) / Discussion (main paper)) and the multi-task score model; the Random Forest and smoke ensemble do not appear in the sampling loop. The smoke ensemble is additionally invoked once per surviving candidate at re-ranking (Stage 1) to produce the (8-output) post-rerank property scores.

**Table 2.1a.** LIMO SELFIES-VAE hyperparameters. Source of truth for all LIMO-related numeric values referenced in Methods “Dataset and four-tier label hierarchy” (main paper) and Methods (main paper).
Component	Hyperparameter	Value
LIMO SELFIES-VAE	Latent dim	1024
	SELFIES max length	72
	Vocabulary size	108
	Embedding dim	64
	Encoder layers	Linear(72·64=4608→2000)–ReLU–Linear(2000→1000)–BN1d–ReLU–Linear(1000→1000)–BN1d–ReLU–Linear(1000→2·1024)
	Decoder layers	Linear(1024→1000)–BN1d–ReLU–Linear(1000→1000)–BN1d–ReLU–Linear(1000→2000)–ReLU–Linear(2000→72·108=7776)
	Activation	ReLU (encoder & decoder hidden); log-softmax over vocab on output
	Loss	NLL reconstruction + β·KL (standard VAE ELBO; free-bits disabled)
	KL weight β	0.01 (constant)
	Fine-tune steps	~8500 on 326 k energetic-biased SMILES
	Code	`dgld/limo_model.py` (LIMOVAE class)

**Table 2.1b.** Denoiser and multi-task score-model hyperparameters. Source of truth for all denoiser and score-model numeric values referenced in Methods “Dataset and four-tier label hierarchy” (main paper) and Methods (main paper).
Component	Hyperparameter	Value
Denoiser ε_θ (DGLD-P)	Backbone	FiLM-modulated ResNet, 8 residual blocks, latent 1024, hidden 2048
	Per-block layers	LayerNorm–Linear(1024→2048)–FiLM(γ,β)–SiLU–Linear(2048→1024) with residual add
	Parameter count	44.6 M
	Time-embed dim	256 (sinusoidal → 2-layer MLP)
	Property-embed dim	64 per property × 4 properties = 256, plus learned mask token
	Activation	SiLU
	Loss	masked MSE on predicted noise ε with mask m
	T (diffusion steps)	1000 (cosine schedule, Nichol & Dhariwal)
	EMA decay	0.999
	Optimizer / LR / wd	AdamW / 1e-4 / 1e-4 (cosine decay)
	cfg-dropout rate	0.10
	property-dropout rate (DGLD-P)	0.30
	oversample-extreme factor / quantile	5×–10× above 90th percentile (high tail; matches Methods “Dataset and four-tier label hierarchy” (main paper))
Multi-task score model	Backbone	4-block FiLM-MLP, latent 1024, hidden 1024, σ-embed 128
	Per-block layers	Linear(1024→1024)–LayerNorm–FiLM(σ-emb→(γ,β) over 1024)–SiLU–Dropout(0.1) with residual add
	σ embedding	Linear(1→128)–SiLU–Linear(128→128)
	Heads (6, parallel on 1024-d trunk)	Linear(1024→256)–SiLU–Linear(256→d_k); d_k=1 for viability/SA/SC/sensitivity/hazard; d_k=4 for performance (ρ, HOF, D, P)
	Activation	SiLU (trunk); sigmoid on viability/hazard logits at use-time
	Per-head loss	BCE-with-logits (viability, hazard); smooth-L1 on z-scored target (SA, SC, sensitivity, performance), performance masked to Tier-A/B
	Noise-level curriculum	σ ~ U(0, σ_max=2.0) per batch; z_t=z₀+σ·ε
	Optimizer / LR / wd	AdamW / 2e-4 / 1e-4 (cosine decay), grad-norm clip 1.0
	Steps / batch	~40k AdamW steps / batch 1024 (matches Discussion (main paper))
	Sample-time guidance	autograd of head losses w.r.t. z_t; per-head clamp + alpha-anneal disabled (Discussion (main paper))
	Code	`dgld/train_multihead_latent.py` (MultiHeadScoreModel)

**Table 2.1c.** External label sources and re-ranking pipeline configuration. The Random Forest and 3D smoke-model ensemble run only at training-label preparation (and Stage-1 reranking for the smoke ensemble), not in the per-step sampling loop.
Component	Hyperparameter	Value
Random-Forest viability classifier (training-label source)	Algorithm	scikit-learn RandomForestClassifier (CPU)
	n_estimators / max_depth / min_samples_leaf	200 / 22 / 4 (v2-hardneg; v1 uses CLI flags: see code)
	Class weight / random_state	balanced / 42
	Feature dim	~2089 = 2048 Morgan FP (radius 2) + 10 RDKit descriptors + 11 derived CHNO/OB features
	Training data	66 k labelled energetics positives vs 80 k ZINC-250k random negatives + mined hard negatives (Supplementary Note 2.1)
	Validation AUC	0.9986 (v2-hardneg test split)
3D smoke-model ensemble (training-label source & Stage-1 scorer)	Backbone	Uni-Mol v1 (Zhou et al. 2023) regression head, 2-fold scaffold-CV ensemble
	Parameter count	~84 M per fold × 2 folds
	Inputs	3D conformer atom-coordinate set (Uni-Mol native); not a raw voxel grid
	Outputs (8)	ρ, D, P, det-temperature T, det-energy E, det-volume V, HOF, BDE
	Validation R²	0.84–0.92 (5-fold CV, per output)
	Wrapper / inference	`scripts/diffusion/unimol_validator.py`; reset every 80 batches
Re-ranking (Stage 1–2)	SA cap	5.0
	SC cap	3.5
	Tanimoto window	(0.20, 0.55)

2.1 Hard-negative augmentation (self-distillation)

Why it's needed. The Viability head's labels at round 0 are produced by \(y_{\text{viab}} = \mathbf{1}[\text{SMARTS pass}] \cdot P_{\text{RF}}(\text{viable}\mid \text{SMILES})\) (cf. Results (main paper)). The Random Forest was trained on (labelled energetic corpus = positives, ZINC drug-like = negatives), a clean two-class boundary. The diffusion sampler, however, operates in latent regions between those two distributions, where the Random Forest generalises poorly: it scores high on small-skeleton polynitro and polyazene-chain motifs that are physically impossible. Self-distillation closes that gap by mining the model's own false positives in those gap regions and re-feeding them with explicit \(y_{\text{viab}}=0\) labels. The score model is trained over three rounds (Extended Data Fig. 5, main paper): round 0 trains on corpus only and mines 137 false-positive cheats via the SMARTS gate; round 1 trains on corpus + 137 cheats and mines up to 918 cumulative; round 2 trains on corpus + 918 cheats and is the production checkpoint. The held-out probe (7 anchors: RDX, HMX, TNT, FOX-7, PETN, TATB, NTO; 5 cheats: gem-tetranitro on C2, trinitromethane, gem-dinitrocyclopropane, polyazene-NO₂ chain, dinitromethane) confirms anchors hold at 0.86–0.99 while cheats are demoted to 0.12–0.84. Supplementary Note 6.1 reports an early variant (137 cheats) versus the production variant (918 cheats).

What's frozen vs trained. Across all rounds the LIMO encoder, LIMO decoder, denoiser (DGLD-H and DGLD-P), Random Forest viability classifier, and SMARTS rulebook are FROZEN. The only thing that updates between rounds is the score-model weights (shared trunk + heads); the Viability head receives the explicit hard-negative supervision, while the other heads see hard-negative latents as unlabelled (\(a_k = 0\) in their availability mask, Results (main paper)). The score model after round 2 (corpus + 918 hard negatives) is the production checkpoint used in Results (main paper).

Procedure summary. The five-step mine-then-retrain loop is given in Methods (main paper, “Multi-task score model and self-distillation”) and is not re-stated here; the held-out 7-anchor / 5-cheat probe (anchors \(\ge 0.86\) AND cheats \(\le 0.84\)) is the stopping criterion, satisfied at round 2 (918 cumulative hard negatives).

Why a teacher Random Forest rather than direct binary labels. The score-model head trains on noised latents \(z_t\), so we need a label-from-SMILES function that can be applied at \(z_0\) once and the label carried through forward diffusion; a static (corpus, ZINC) membership table cannot label augmented rows or self-distilled hard negatives, since neither set has a fixed identity in either training distribution. The Random Forest gives a continuous probability in [0, 1] that yields smooth gradients (binary \(\{0, 1\}\) labels would zero the gradient almost everywhere on the sigmoid output). Decoupling the cheap CPU teacher (the Random Forest, retrainable in seconds when the corpus changes) from the expensive GPU student (the score-model head, which costs a full diffusion-pipeline retrain to update) lets either be revised independently. The five-stack of (1) Random Forest teacher, (2) SMARTS rules gate, (3) self-distillation hard negatives, (4) noised-latent training, (5) gradient-mode at sample time is what makes this label scheme contribute the headline lift despite the Random Forest alone being a coarse "energetic vs drug" discriminator.

Within a single call to train_score_model the hard-negative set is fixed input data, indistinguishable from the labelled corpus rows; the only "self-distillation" is the outer-loop iteration where the next round's training data depends on the current round's model. Operational note: a third-party reproduction can skip rounds 0–1 entirely by loading the published 918 hard-negative latents (released alongside the labelled master, Discussion (main paper)) and running only one round of train_score_model with corpus + 918 negatives. This is the recipe used by the Zenodo-deposited production checkpoint.

2.2 Sensitivity head: heuristic vs literature-grounded variants

The default 5-head score model trains the sensitivity head against the hand-coded heuristic of Supplementary Note 2.2 (nitro density, open-chain N-NO₂ count, small-skeleton penalties), then z-scored. Because guiding generation with gradients of a head trained on a heuristic is partially tautological as a measurement of sensitivity reduction, we additionally fine-tune a literature-grounded variant: the trunk and the four other heads are frozen at the default-score-model weights, the sensitivity head is re-initialised, and the head is fit on 306 (canonical SMILES, observed h₅₀) pairs from the Huang & Massa compilation [10] mapped to a sigmoid sensitivity proxy in [0, 1] (h₅₀=40 cm pivot, slope 1.5 in log-h₅₀ units). Training mixes a smooth-L1 loss on the 276-row h₅₀ train split with a teacher regulariser pulling the new head toward the default-head prediction on 383 k auxiliary latents, weighted 1.0:0.25. The retrained head reaches Pearson r=+0.71 on a 30-row held-out h₅₀ split and produces qualitatively correct relative orderings on a 5-anchor smoke test (TATB, h₅₀~490 cm, scored 0.37; trinitro-isoxazole lead scored 0.55). The literature-grounded variant is a drop-in replacement for the default score model at sample time.

2.3 Per-head gradient-norm rebalancing (optional)

Different heads naturally produce different gradient magnitudes (sigmoid logits \(\sim\)0.5; smooth-L1 z-scores \(\sim\)2.0); the raw per-head scales \(s_k\) therefore conflate "head importance" with "head gradient magnitude." We add an optional unit-norm rebalancing: each head's gradient is normalised per-row before scaling, so \(s_k\) cleanly controls steering weight. This makes the head-scale ablation cleanly interpretable; it is a no-op when only one head is active.

2.4 Drug-domain transfer heads (SA, SC)

The score model architected in Discussion (main paper) has six heads. Four of them (viability, sensitivity, hazard, performance) are domain-native: each head's training-label source draws on energetic-domain data or chemotype rules (Random Forest discriminator on Morgan-FP descriptors of energetic vs ZINC; Politzer–Murray BDE chemotype-class fit; chemist-curated SMARTS hazard catalog plus Bruns–Watson demerits; 3D-CNN smoke ensemble trained on Tier-A/B labelled energetic compounds). The remaining two heads, SA (Ertl synthetic-accessibility regression, label source RDKit's sascorer.py) and SC (Coley SCScore retrosynthetic-complexity regression, label source Coley's pretrained SCScorer), are auxiliary multi-task heads (Performance, SA, SC): their label sources are calibrated on drug-discovery reaction corpora (Reaxys and pharma chemistry) and penalise uncommon ring fusions and rare fragments well-precedented in energetic-materials synthesis (CL-20-class polyaza-cages, hexanitrobenzene). The same drug-domain transfer issue applies to the rerank-time SA + SCScore caps (5.0 / 3.5) in Discussion (main paper); that is a separate layer (post-decode hard caps on canonical SMILES) and is unaffected by the discussion here.

Empirical sample-time evaluation. SA was tested at sample time in the Supplementary Note 6.4 SA-axis matrix at \(s_{\text{SA}} = 0.15\). The result: turning SA-gradient on raised the top-1 composite penalty from 0.618 \(\pm\) 0.056 (SA-C2 viab+sens, no SA) to 0.698 \(\pm\) 0.015 (SA-C3 viab+sens+SA), a 0.080 (13 %) worsening. The interpretation is that the drug-domain SA score actively penalises the very chemotype family the diffusion sampler is targeted at. Production therefore uses \(s_{\text{SA}} = 0\). SC was retained as an architectural slot in the production score model for backward-compatibility with the 5-head predecessor (Supplementary Note 2.1) but was never plumbed into the sample-time gradient sum (no \(s_{\text{SC}}\) is defined at sample time anywhere in the codebase or experiments); we make no empirical claim about its sample-time utility in this domain. The natural follow-up ablation, sweeping \(s_{\text{SC}}\) at non-zero values to test whether the drug-domain SC label has a similar inversion effect to SA, is left to future work.

Why retain the heads. Both SA and SC heads share the 1024-d trunk used by the four domain-native heads; their trained weights add \(\sim\)2 \(\times\) 256k parameters (negligible against the \(\sim\)8 M parameter score-model trunk; cf. Table 2.1b) and the per-head supervised signal during training does not appear to harm the domain-native heads on the held-out anchor/cheat probe (the six-head production model and its five-head predecessor score the same on the 7-anchor and 5-cheat test sets within the per-round monitoring tolerance of Supplementary Note 2.1). The cost of retaining is low; the cost of dropping (re-training the score model, re-running Supplementary Note 6.4 to confirm no SA-axis dependency) is non-trivial. The heads are therefore retained.

Cross-references. The same SA / SC story appears in three places relevant to the headline: Discussion (main paper) Architecture (drug-domain transfer paragraph); Discussion (main paper) Sample-time gradient (the "Why \(s_{\text{SA}} = 0\) and where SC is" paragraph); Supplementary Note 6.4 Multi-seed pattern (the SA-C3-vs-SA-C2 numerical result). This Supplementary Note consolidates them.

2.5 Production recipe (Table 2.5)

Complete configuration used throughout Results (main paper). Each knob traces to the Results “Component contributions” (main paper) ablation (or Methods; see also Methods (main paper) hyperparameter sweep) that selected it.

**Table 2.5.** Production recipe for the Results (main paper) experiments. Knobs grouped by stage: generative model, score model, sampler, filter pipeline (Stages 1–4), reranker. “Justified by” points to the ablation or sweep that fixed the value.
Knob	Production value	Justified by
Generative model	LIMO VAE + conditional latent DDPM with classifier-free guidance	Supplementary Note 6.2
Denoiser ensemble	DGLD-H + DGLD-P, pools fused post-decode	Results (main paper); Supplementary Note 6.4
CFG scale w	7	Methods (main paper); Extended Data Fig. 3
DDIM steps	40	prior-art default; not separately swept
Pool size	\(\ge\) 40k per denoiser, fused	Supplementary Note 6.5; Extended Data Fig. 2 (main paper)
Score-model training	4-block FiLM-MLP trunk, 6 heads, multi-task	Discussion (main paper); Supplementary Note 6.1 (HN budget = 918)
Active steering heads	Viability (\(s = 1.0\)), Sensitivity (\(s = 0.3\)), Hazard (\(s = 1.0\))	Supplementary Note 6.3, Supplementary Note 6.4
Inactive heads	Performance, SA, SC (auxiliary multi-task only)	Supplementary Note 6.4
Anneal / clamp	\(\sigma_{\max} = 0\), \(C_g = 50\)	Methods (main paper); Supplementary Note 4.5
Filter Stage 1	Chemist-curated SMARTS gate	Methods; see also Methods (main paper)
Filter Stage 2	Pareto reranker on 4 axes, composite tie-break, top-K = 100	Methods; see also Methods (main paper)
Filter Stage 3	GFN2-xTB, HOMO–LUMO gap \(\ge\) 1.5 eV	Methods; see also Methods (main paper); chemist-set threshold
Filter Stage 4 – DFT	B3LYP/6-31G(d) opt + \(\omega\)B97X-D3BJ/def2-TZVP single-point	Methods; see also Methods (main paper); Supplementary Note 3.1
Filter Stage 4 – calibration	6-anchor LOO fit: \(\rho_{\text{cal}} = 1.392\,\rho_{\text{DFT}} - 0.415\), HOF\(_{\text{cal}}\) = HOF\(_{\text{DFT}} - 206.7\)	Results (main paper); Supplementary Note 3.4
Filter Stage 4 – K-J recompute	Closed-form K-J on calibrated \((\rho_{\text{cal}}, \text{HOF}_{\text{cal}})\); ranking-grade only	Results (main paper); Supplementary Note 3.5
Reranker property scoring	`perf_score(\(\rho\), HOF, \(D\), \(P\))` ramp on 3D-CNN smoke ensemble outputs	Methods; see also Methods (main paper)

Supplementary Note 3: First-principles audit (DFT) methodology and uncertainty bounds

This Supplementary Note collects the DFT-methodology details that support Results (main paper) (per-lead validation and anchor calibration). Results (main paper) reports only the result table; readers who want to assess the methodology choices, the systematic-bias bands, and the calibration-extension recommendations should refer to this Supplementary Note.

3.1 Functional and basis-set choice

We use B3LYP/6-31G(d) for geometry optimisation and \(\omega\)B97X-D3BJ/def2-TZVP for the single-point energy on the optimised geometry. The B3LYP/6-31G(d) geometry is a long-standing standard for CHNO molecules (Casey et al. [2] use the same level for the 3D-CNN training labels we calibrate against), and the \(\omega\)B97X-D3BJ functional is recommended for energetic-materials thermochemistry by recent benchmarks (Goerigk et al., 2017). Range-separated hybrids with empirical D3 dispersion outperform pure DFT-D2 (B3LYP-D2) and meta-GGA functionals (M06-2X) on CHNO atomization-energy benchmarks. Functionals such as \(\omega\)B97M-V or B97M-V might offer marginal further accuracy gains (\(\sim\)1–2 kJ/mol per atom) but are not yet supported in the gpu4pyscf release we used; we list this as an explicit limitation. The basis def2-TZVP is the standard balanced triple-zeta with polarisation; basis-set extrapolation to def2-QZVPP would reduce the residual basis-set incompleteness error by \(\sim\)50 % but doubles the compute time and was deferred. For high-N azole rings (oxatriazole, tetrazole) B3LYP/6-31G(d) overestimates N-heteroatom bond lengths by ~0.01–0.02 Å relative to experimental crystal geometries, contributing to the elevated calibration slope (1.392) and representing an additional density uncertainty source beyond the packing-factor spread already quantified.

3.2 Thermal correction to 298 K

All HOFs in Results (main paper) are reported at 0 K (atomization-energy method gives \(\Delta H_f^{0\,\text{K}}\)). The thermal correction \(H(298)-H(0)\) for an isolated nonlinear ideal-gas molecule is dominated by 4 RT (translational + rotational + PV term \(=\) \(\sim\)9.9 kJ/mol) plus a small low-frequency-vibration contribution (each mode \(<200\) cm⁻¹ contributes \(\sim\)RT \(=\) 2.5 kJ/mol). For our top-10 leads with 0–6 low-frequency modes each, \(H(298)-H(0)\) ranges 10–25 kJ/mol, smaller than the \(\pm 64.6\) kJ/mol LOO uncertainty from the 6-anchor calibration of the atomization energy (3.4). We report 0 K HOFs throughout for consistency; converting to 298 K standard HOF would shift values uniformly by ~10–25 kJ/mol with no impact on relative rankings or the K-J residual.

3.3 Calibration uncertainty: literature bias bands

The 6-anchor LOO RMS quoted in Results (main paper) (\(\pm 0.078\) g/cm³ on \(\rho\), \(\pm 64.6\) kJ/mol on HOF) is the internal fit-quality diagnostic. As an external sanity check we cross-reference these residuals against the published B3LYP/6-31G(d) systematic-bias literature. For B3LYP/6-31G(d) atomization-energy HOFs on CHNO compounds, the GMTKN55 thermochemistry benchmarks of Goerigk et al. [11] place the systematic atomization-energy bias of B3LYP with a double-zeta polarised basis at approximately 5–15 kJ/mol per atom, which scales to \(\sim\)100–250 kJ/mol at the molecular level for our 15–25-atom leads, consistent with the +170 to +224 kJ/mol offsets we observe at the RDX/TATB anchors. Casey et al. [2] report B3LYP/6-31G* HOF residuals on a CHNO test set with molecule-level RMSE in the \(\sim\)30–50 kJ/mol band (\(\sim\)5–10 kJ/mol per heavy atom), which is the closest published analogue to our setup and which we adopt as the residual-uncertainty band on calibrated HOFs in Results (main paper). For density, the Bondi van-der-Waals volume estimator [12] requires a packing coefficient that ranges 0.65–0.78 across observed CHNO crystal structures; our fixed value of 0.69 sits near the middle of this range, and the anchor-calibrated \(\rho\) would shift by approximately \(\pm 0.15\) g/cm³ across the full packing-coefficient interval, which dominates the \(\rho_{\text{cal}}\) uncertainty for any single lead.

3.4 6-anchor calibration

The 6-anchor calibration on (RDX, TATB, HMX, PETN, FOX-7, NTO) gives \(\rho_{\text{cal}} = 1.392\,\rho_{\text{DFT}} - 0.415\) and HOF\(_{\text{cal}} = \) HOF\(_{\text{DFT}} - 206.7\) kJ/mol, with leave-one-out RMS residuals of \(\pm 0.078\) g/cm³ on calibrated \(\rho\) and \(\pm 64.6\) kJ/mol on calibrated HOF. The four added anchors (HMX, PETN, FOX-7, NTO) were computed on two independent GPU compute runs (separate hosts); cross-run consistency was \(\Delta\rho \le 0.0002\) g/cm³ and \(\Delta\)HOF\(\,=\,0.0\) kJ/mol on every anchor, demonstrating run-to-run determinism of the GPU4PySCF pipeline. The density slope of \(1.392\) sits in the physically expected \([1.0, 2.0]\) range for the Bondi-vdW + 0.69-packing estimator under DFT-optimised geometry; the previously reported 2-anchor slope of 4.275 (RDX/TATB only) was a 2-point regression artefact corrected by the 6-anchor extension. The 6-anchor calibration is the source of truth for Results (main paper); per-anchor fit residuals are tabulated in Supplementary Note 3.4. The calibrated DFT numbers in Results (main paper) remain ranking-grade rather than absolute-value-grade, with residual atomization-method uncertainty of approximately 5–15 kJ/mol per atom for B3LYP/6-31G(d) HOFs (Goerigk and Casey benchmarks). (The TATB experimental HOF = −141 kJ/mol is the 298 K condensed-phase value; applying HOF_cal = HOF_DFT − 206.7 to our 0 K DFT output (+82.9 kJ/mol) gives −123.7 kJ/mol, a 17 kJ/mol residual attributable to the 0 K → 298 K thermal correction not included in the linear offset; this is within the ±64.6 kJ/mol LOO RMS.)

3.5 Per-lead DFT-recomputed property table

Full per-lead B3LYP/6-31G(d) optimisation + \(\omega\)B97X-D3BJ/def2-TZVP single-point results for the chem-pass leads and the two anchors (RDX, TATB):

**Table 3.1.** Per-lead B3LYP/6-31G(d) optimisation + ωB97X-D3BJ/def2-TZVP single-point *structural* validation for the 12 chem-pass leads and the two anchors (RDX, TATB). Columns: lead ID; molecular formula; atom count; number of imaginary frequencies (0 = real local minimum); minimum real-mode wavenumber; zero-point vibrational energy. Per-lead numbers from `m2_summary.json` and `m2_lead_*.json`. DFT densities, HOFs, and h₅₀ predictions are in Table 3.1c.
ID	Formula	n_atoms	ν_min (cm⁻¹)	ZPE (kJ/mol)
L1	C₃N₄O₇	14	13.4	169.7
L2	C₃H₃N₅O₆	17	47.1	263.1
L3	C₃H₂N₆O₆	17	35.9	244.7
L4	C₂H₂N₆O₄	14	52.0	205.5
L5	C₂H₁N₃O₆	12	23.8	153.0
L9	C₃H₅N₉O₈	25	0.0	396.4
L11	C₃H₃N₅O₇	18	9.1	271.4
L13	C₂H₃N₅O₅	15	24.8	232.1
L16	C₃H₂N₆O₆	17	35.9	244.7
L18	C₃H₄N₆O₆	19	12.9	303.0
L19	C₂H₂N₂O₆	12	47.6	172.8
L20	C₃H₁N₅O₅	14	0.0	190.6
RDX	C₃H₆N₆O₆	21	40.3	375.0
TATB	C₆H₆N₆O₆	24	6.8	419.1

Note on L3/L16. These two entries share identical formula, atom count, frequencies, and ZPE because they are the same compound (oxadiazinone-N-nitramine) generated under two distinct SMILES representations; they are counted as one unique structure in all scaffold-diversity tallies.

Note on L9/L20. ν_min = 0.0 cm⁻¹ reflects a near-zero torsional frequency (soft dihedral mode within numerical precision of the geometry optimizer) rather than an imaginary mode or saddle point; n_imag = 0 confirms these are real local minima.

**Table 3.1c.** DFT-derived *properties* for the 12 chem-pass leads and the two RDX/TATB anchors under the 6-anchor calibration. ρ_cal = 1.392·ρ_DFT − 0.415; HOF_cal = HOF_DFT − 206.7 kJ/mol (6-anchor RDX/TATB/HMX/PETN/FOX-7/NTO fit; LOO RMS ±0.078 g/cm³ on ρ and ±64.6 kJ/mol on HOF). h_50,model from the h50 score-model head (Huang and Massa [10]); h_50,BDE via Politzer–Murray 2014 chemotype-class linear fit h₅₀=1.93·BDE−52.4 (BDE class typicals: Ar–NO₂ 70, R₂N–NO₂ 47, R-CH-NO₂ 55, R-O-NO₂ 40 kcal/mol). Anchor sanity check: RDX h_50,model=26.3 cm vs literature 25–30; TATB 89.2 vs literature 140–490 (both routes underpredict TATB; H-bond cushioning is not in the BDE-only class scheme).
ID	ρ_DFT (g/cm³)	ρ_cal (g/cm³)	HOF_DFT (kJ/mol)	HOF_cal (kJ/mol)	h_50,model (cm)	h_50,BDE (cm)
L1	1.801	2.093	+229.5	+22.9	30.3	82.7
L2	1.699	1.949	+117.3	-89.3	33.5	53.8
L3	1.731	1.995	+322.0	+115.4	82.6	38.3
L4	1.692	1.941	+498.8	+292.1	27.8	38.3
L5	1.693	1.942	+53.6	-153.1	33.4	24.8
L9	1.669	1.909	+536.5	+329.8	21.9	38.3
L11	1.663	1.900	+47.6	-159.0	26.5	38.3
L13	1.634	1.859	+321.8	+115.1	31.0	38.3
L16	1.731	1.995	+322.0	+115.4	82.6	38.3
L18	1.619	1.839	+388.7	+182.1	38.6	38.3
L19	1.666	1.905	-166.7	-373.3	45.5	24.8
L20	1.723	1.983	+194.7	-12.0	74.1	38.3
RDX	1.632	1.857	+240.4	+33.8	26.3	38.3
TATB	1.663	1.900	+82.9	-123.7	89.2	82.7

L1 discrepancy. The score model predicts h_50,model=30.3 cm (more sensitive) while the BDE estimate gives h_50,BDE=82.7 cm (less sensitive), a factor of 2.7. The two methods disagree on L1 because the model was calibrated on a drug-domain distribution (few isoxazole-class nitro compounds in training), whereas the BDE formula classifies L1's Ar–NO₂ bonds using the aromatic-nitro class typical (BDE~70 kcal/mol). Neither route is authoritative for a novel chemotype; experimental impact-sensitivity testing is required.

Reference-class scaffolds. Three reference-class compounds (R2, R3, R14) flagged by the strengthened SMARTS gate of Results “Novelty and synthesisability” (main paper) were optimised with the same protocol; they are not chem-pass leads and are reported here for completeness:

**Table 3.1b.** Reference-class (chem-rejected) scaffolds optimised with the same B3LYP/6-31G(d) + ωB97X-D3BJ/def2-TZVP protocol. R2 is an N-nitroimine, R3 a polyazene chain, R14 an azo-amino-nitrate. DFT-derived properties for these compounds are in Table 3.1d.
ID	Formula	n_atoms	ν_min (cm⁻¹)	ZPE (kJ/mol)
R2	C₃H₃N₅O₆	17	47.1	263.1
R3	C₁H₃N₇O₄	15	33.3	234.3
R14	C₃H₅N₇O₆	21	36.9	348.6

**Table 3.1d.** DFT-derived properties for the three reference-class scaffolds. Calibration as Table 3.1c. h₅₀ not reported because the score-model training set does not include reference-class chemotypes at sufficient density.
ID	ρ_DFT (g/cm³)	ρ_cal (g/cm³)	HOF_DFT (kJ/mol)	HOF_cal (kJ/mol)
R2	1.699	1.950	+117.3	-89.4
R3	1.642	1.871	+521.7	+315.0
R14	1.604	1.818	+437.4	+230.7

Anchor calibration (6-anchor). Experimental \((\rho, \text{HOF})\) anchors: RDX \((1.82, +70)\), TATB \((1.94, -141)\), HMX \((1.91, +75)\), PETN \((1.77, -538)\), FOX-7 \((1.89, -134)\), NTO \((1.93, -129)\), in g/cm³ and kJ/mol respectively. Raw \(\omega\)B97X-D3BJ DFT outputs are tabulated in Table 3.1c. The 6-anchor regression gives \(\rho_{\text{cal}} = 1.392\cdot\rho_{\text{DFT}} - 0.415\) and HOF\(_{\text{cal}}\) = HOF\(_{\text{DFT}} - 206.7\) kJ/mol, with LOO RMS \(\pm 0.078\) g/cm³ on \(\rho\) and \(\pm 64.6\) kJ/mol on HOF (per-anchor residuals in Supplementary Note 3.4). The 12 chem-pass leads calibrate to \(\rho_{\text{cal}} \in [1.84, 2.09]\) g/cm³ and HOF\(_{\text{cal}} \in [-373, +330]\) kJ/mol. Using the calibrated \(\rho\) and HOF, the K-J formula recomputes \(D\) for all 12 leads on the calibrated branch:

**Table 3.2.** Kamlet–Jacobs detonation velocity and pressure recomputed from the 6-anchor-calibrated DFT \(\rho\) and HOF (per `kj_6anchor_recompute.json`), compared to the 3D-CNN surrogate prediction. The raw-DFT K-J values are listed in the third column for transparency. Residual is K-J\(_{\text{cal}}\) minus 3D-CNN \(D\). All 12 chem-pass leads are defined on the K-J branch. Anchor rows show the experimental \(D\), the K-J recompute under the 6-anchor calibration, and the K-J\(_{\text{cal}}\)−exp residual that defines the K-J under-prediction band against which the lead residuals are read.
Lead	3D-CNN \(D\), \(\rho\)	K-J raw DFT \(D\), \(P\)	K-J 6-anchor cal \(D\), \(P\)	residual (cal − 3D-CNN)
L1 trinitro-isoxazole	D=9.56, ρ=2.00	D=7.85, P=27.3	D=8.25, P=32.9	−1.31 km/s
L2 oxime-nitrate-imidazoline	D=8.96, ρ=1.87	D=6.81, P=19.9	D=7.78, P=28.1	−1.18 km/s
L3 oxadiazinone-N-nitramine	D=9.11, ρ=1.93	D=6.50, P=18.3	D=7.50, P=26.5	−1.61 km/s
L4 5-ring tetrazoline-N-nitramine	D=8.98, ρ=1.87	D=5.56, P=13.2	D=6.71, P=20.9	−2.27 km/s
L5 acyl oxime nitrate	D=9.21, ρ=1.93	D=7.78, P=25.8	D=7.99, P=29.6	−1.22 km/s
L9 hydrazinyl tetranitramine	D=9.27, ρ=1.88	D=6.44, P=17.6	D=7.33, P=24.6	−1.94 km/s
L11 acyl-nitramide nitrate-imine	D=9.30, ρ=1.90	D=6.97, P=20.5	D=7.88, P=28.5	−1.42 km/s
L13 oxadiazoline	D=9.05, ρ=1.87	D=6.39, P=17.0	D=7.36, P=24.5	−1.69 km/s
L16 oxadiazinone-N-nitramine (alt SMILES)	D=9.11, ρ=1.93	D=6.50, P=18.3	D=7.50, P=26.5	−1.61 km/s
L18 vinyl azo-amino nitrate	D=9.00, ρ=1.89	D=6.21, P=16.0	D=7.09, P=22.6	−1.91 km/s
L19 acyl C-OH nitramine	D=9.14, ρ=1.88	D=7.24, P=22.2	D=7.24, P=24.1	−1.90 km/s
L20 \(\alpha\)-NO₂ acyl-N-nitramine	D=9.12, ρ=1.91	D=6.40, P=17.7	D=7.41, P=25.8	−1.71 km/s
RDX (anchor)	D=8.75 (exp)	n/a	D=7.43, P=25.0	−1.32 km/s (vs exp)
TATB (anchor)	D=7.90 (exp)	n/a	D=6.93, P=22.0	−0.97 km/s (vs exp)
HMX (anchor)	D=9.10 (exp)	n/a	D=7.52, P=26.2	−1.58 km/s (vs exp)
PETN (anchor)	D=8.30 (exp)	n/a	D=8.24, P=30.6	−0.06 km/s (vs exp)
FOX-7 (anchor)	D=8.87 (exp)	n/a	D=7.70, P=26.5	−1.17 km/s (vs exp)
NTO (anchor)	D=7.93 (exp)	n/a	D=7.49, P=25.6	−0.44 km/s (vs exp)

Reading Table 3.2. The six anchor rows define the K-J under-prediction band on the same recipe: K-J\(_{\text{cal}}\) lands \(0.06\) to \(1.58\) km/s below experiment, with a mean of \(-0.92\) km/s. Lead residuals against the 3D-CNN surrogate (\(-1.18\) to \(-2.27\) km/s, mean \(-1.66\) km/s) sit comfortably inside this anchor-defined band, and the \(-1.31\) km/s residual on L1 is one of the smallest in the lead set, consistent with L1's HMX-class regime placement.

3.5b caveat block: 1,2,3,5-oxatriazole stability literature (E1)

The 1,2,3,5-oxatriazole ring system has known ring-opening and thermal-decomposition pathways under acidic or high-temperature conditions (Sheremetev 2007; Katritzky 2010). Specifically, the endocyclic O–N(2) bond is the weakest in the ring (literature BDE estimates for the parent 1,2,3,5-oxatriazole: 130–155 kJ/mol depending on substituents, vs 250–290 kJ/mol for the C–N bond in the same ring); under thermal activation or Lewis-acid catalysis the ring opens to a reactive azide-nitrile-oxide zwitterion (Sheremetev 2007), a pathway that may preclude melting-point processing or conventional press-and-sinter consolidation. Our SMARTS catalog does not flag the parent ring as unstable because the heteroatom sequence N–O–N is not on the primary reject list, and a dedicated stability screen (BDE of the weakest N–O bond at \(\omega\)B97X-D3BJ level, plus DSC/TGA compatibility check) is required before E1 is promoted to synthesis priority. The Supplementary Note 3.10 GFN2-xTB BDE scan places E1's weakest cleavage at \(92.7\) kcal/mol on the exocyclic C–NO₂ bond and finds no ring N–O channel below \(180\) kcal/mol on a vertical cleavage from the parent geometry; the literature endocyclic-O–N value (130–155 kJ/mol \(\approx\) 31–37 kcal/mol) refers to the activation barrier on the relaxed ring-opening pathway, which the unrelaxed scan does not capture. The two estimates are consistent in chemistry (nitro loss is the dominant primary channel at the parent geometry; ring opening is a secondary thermal/Lewis-acid pathway with a lower activation barrier on the relaxed surface). E1 is therefore a credible co-headline candidate, but a relaxed-pathway BDE recompute and DSC/TGA screen are prerequisites for synthesis priority.

3.6 K-J residual decomposition by ρ/HOF sensitivity

Under the 6-anchor calibration L4's K-J \(D\) is \(6.71\) km/s against a 3D-CNN prediction of \(8.98\) km/s, giving a residual of \(D_{\text{DFT-cal}} - D_{\text{3D-CNN}} = -2.27\) km/s (Table 3.2). We decompose this residual against the K-J sensitivity slopes at L4's base point and ask what input shift in \(\rho\) alone or HOF alone would be required to close it.

**Table 3.3.** K-J residual decomposition by \(\rho\)/HOF sensitivity for L4 (3D-CNN \(D\)=8.98 km/s, 6-anchor-calibrated K-J \(D\)=6.71 km/s, residual = \(-2.27\) km/s). Both single-variable closures require physically or chemically implausible input shifts, so the residual is attributable primarily to the K-J formula's gas-product distribution assumption in the high-N CHNO regime, not to plausible \(\rho\) or HOF input errors.
Mechanism	Required correction	Implausibility
Close residual via ρ alone	\(\Delta\rho = -2.27 / 2.47 = -0.92\) g/cm³; L4 calibrated \(\rho\) would have to drop from 1.94 to 1.02 g/cm³	Lighter than water; chemically implausible for any nitrogen-rich CHNO crystal
Close residual via HOF alone	\(\Delta\)HOF \(= -2.27 / (-0.22 / 100) = +1032\) kJ/mol; L4 calibrated HOF would have to rise from +292 to +1324 kJ/mol	Far above CL-20's HOF (~+460 kJ/mol); physically extreme for a 13-atom CHNO neutral

K-J sensitivity slopes at the L4 base point: \(dD/d\rho \approx +2.47\) km/s per g/cm³; \(dD/d\,\text{HOF} \approx -0.22\) km/s per 100 kJ/mol. Both single-variable closures are implausible; the K-J residual at L4 is dominated by the gas-product distribution assumption rather than \(\rho\) or HOF input errors, with a secondary contribution from the 6-anchor calibration uncertainty (\(\pm 0.078\) g/cm³ on \(\rho\), \(\pm 64.6\) kJ/mol on HOF). The assumed gas-product distribution (CO₂, H₂O, N₂ in fixed ratio determined by the C-deficient/O-rich switch) misrepresents the actual product mixture for compounds with low H content and high N density. The correct resolution requires a direct DFT calculation of the Chapman-Jouguet state via specialised codes such as EXPLO5 or Cheetah-2; the K-J recompute should be read as a regime-aware sanity check, not a paper-grade detonation prediction.

3.7 Population-level N-fraction stratification

To confirm that the K-J residual's regime dependence is not specific to L4 but a population-wide signal, we re-ran the same K-J formula on every Tier-A row of the labelled master that carries simultaneous experimental ρ, HOF, and \(D\) (n = 575) and stratified the residual \(D_{\text{K-J}} - D_{\text{exp}}\) by atomic N-fraction \(f_N = n_N / (n_C + n_H + n_N + n_O)\):

**Table 3.4.** Population-level N-fraction stratification of the K-J residual on 575 Tier-A rows of the labelled master that carry simultaneous experimental \(\rho\), HOF, and \(D\). Columns: N-fraction bin; row count; mean experimental \(D\); mean K-J recomputed \(D\); mean residual (K-J minus experimental). Pearson \(r(f_N, \text{residual})=+0.43\), \(p=4\times 10^{-27}\).
N-fraction bin	\(n\)	mean \(D_{\text{exp}}\) (km/s)	mean \(D_{\text{K-J}}\) (km/s)	mean residual (km/s)
[0.00–0.10)	16	6.17	3.59	−2.59
[0.10–0.20)	196	7.44	3.93	−3.51
[0.20–0.30)	233	7.96	4.37	−3.59
[0.30–0.40)	84	8.30	5.01	−3.29
[0.40–0.55)	37	7.78	6.17	−1.60
[0.55–1.00]	9	7.12	7.66	+0.54

The residual flips sign across the N-fraction range: K-J systematically under-predicts \(D\) by 3–3.6 km/s in the low-to-mid N-fraction regime and shifts toward over-prediction in the high-N regime. Pearson \(r(f_N, \text{residual}) = +0.43\) (\(p = 4 \times 10^{-27}\), \(n=575\)). On absolute magnitude: this open-form K-J implementation uses \(Q \approx \Delta H_f / M_W\) without subtracting product-side enthalpies, which inflates the absolute residual relative to the closed-form K-J variant in the per-lead table; the trend across N-fraction is the diagnostic signal, not the absolute residual values. The Results (main paper) attribution is consistent with the population-level monotone N-fraction trend. Note: the mean K-J residual in the [0.55–1.00] N-fraction bin of Table 3.4 is positive (+0.54 km/s), indicating apparent K-J over-prediction at this extreme N-fraction; this sign flip is an artifact of the open-form Q approximation (Q ≈ ΔHf/MW without product-side subtraction) and does not apply to the closed-form K-J used per-lead in Table 3.2.

3.8 Scaffold-similarity to labelled master

Beyond Tanimoto on full-molecule fingerprints (Results “Novelty and synthesisability” (main paper)), we report the GuacaMol-style "scaf" metric (mean and max nearest-neighbour Tanimoto on Bemis–Murcko scaffold fingerprints) against a 5 000-row sample of the labelled master:

**Table 3.5.** GuacaMol-style scaffold-Tanimoto metric (mean and max nearest-neighbour Tanimoto on Bemis–Murcko scaffold fingerprints) of each guidance condition's 10k pool against a 5 000-row sample of the labelled master, plus unique scaffold count.
Condition	scaf-NN mean	scaf-NN max	unique scaffolds
C0 unguided	0.469	1.000	369
C1 viab+sens	0.420	1.000	171
C2 viab+sens+hazard	0.372	1.000	199
C3 hazard-only	0.423	1.000	503

Mean scaffold-NN Tanimoto of 0.37–0.47 indicates the generated scaffolds are partially related to known energetic scaffolds but not identical (in line with the median molecule-level NN-Tanimoto of 0.32 reported in Results “Novelty and synthesisability” (main paper)). The max=1.00 reflects sanity-check rediscoveries (e.g., 1,2-isoxazole, 1,2,4-tetrazole, oxadiazole, 1,2,3,4-tetrazine). C2 (viab+sens+hazard) has the lowest scaffold-NN mean (0.372); C3 (hazard-only) has the highest scaffold count (503).

3.9 Bondi-vdW packing-factor bracket on L1 and E1 (T2 cross-check)

An independent Bondi-vdW packing-factor bracket on L1 and E1, varying pk \(\in\) {0.65, 0.69, 0.72} on the optimised B3LYP/6-31G(d) geometries, yields \(\rho \in [1.69, 1.87]\) g/cm³ for L1 and \([1.65, 1.83]\) g/cm³ for E1. Both ranges sit below the 6-anchor-calibrated values (\(\rho_{\text{cal}}\) = 2.09 and 2.04 respectively), with the centre value (pk = 0.69) at \(\rho\) = 1.80 (L1) and 1.75 (E1). The 14% systematic offset between the pk = 0.69 estimate and the calibrated value reproduces the calibration slope (1.392) reported in Results (main paper) and confirms that the pure Bondi-vdW estimator is conservative; the 6-anchor empirical calibration is the source of the headline \(\rho\) values. Computed via Modal-launched bundle experiments/t2_density/.

3.10 GFN2-xTB weakest-bond BDE on L1 and E1 (T1 cross-check)

A coordinate-preserving graph-split BDE scan at GFN2-xTB level (parent geometry retained, each fragment as a doublet radical with \(\mathrm{UHF}=1\) and no implicit-H addition) places the weakest-bond cleavage of L1 (3,4,5-trinitro-1,2-isoxazole) at 86.0 kcal/mol on a C–NO₂ bond (the dominant initiation channel), with two further C–NO₂ channels at 91.3 and 92.3 kcal/mol; ring N–O bonds are at 170–186 kcal/mol (the unrelaxed-fragment xTB scan over-binds slightly relative to a fully relaxed \(\omega\)B97X-D3BJ recompute, but the chemistry assignment and the weakest-bond ranking are correct). E1 (4-nitro-1,2,3,5-oxatriazole) reaches its weakest cleavage at 92.7 kcal/mol on the exocyclic C–NO₂ bond, with no ring N–O channel below 180 kcal/mol. Relative to the Politzer–Murray Ar–NO₂ typical (\(\sim\)70 kcal/mol) the values are mildly over-estimated by the unrelaxed-fragment xTB protocol, but the qualitative picture matches expectations: nitro loss is the dominant decomposition initiator for both compounds, and neither shows a sub-80 kcal/mol channel that would predict primary-explosive sensitivity. Computed via Modal-launched bundle experiments/t1_bde/.

3.11 DNTF 7th-anchor extension attempt

We attempted to add DNTF (3,4-bis(4-nitrofurazan-3-yl)furazan; the best-characterised furazan-class energetic material with known experimental properties: \(\rho_{\text{exp}} = 1.937\) g/cm³, \(D_{\text{exp}} = 9.25\) km/s) as a candidate 7th anchor using the same B3LYP/6-31G(d) + \(\omega\)B97X-D3BJ/def2-TZVP pipeline. DNTF proved an outlier: the 6-anchor calibration yields a leave-one-out \(\rho\) residual of \(+0.122\) g/cm³ and \(\Delta H_f\) residual of \(-67.9\) kJ/mol for DNTF (versus \(\pm 0.025\)–\(0.049\) g/cm³ for the six main-set anchors), and the 7-anchor LOO-RMS rises from \(\sim 0.032\) to \(0.055\) g/cm³. This suggests the Bondi van-der-Waals packing model or the B3LYP geometry is less reliable for the furazan ring system (1,2,5-oxadiazole, 2C + 1N + 1O) than for nitramine and nitroaromatic scaffolds; the furazan class is also chemically distinct from the oxatriazole class (1,2,3,5-oxatriazole, 1C + 3N + 1O), so DNTF is not a true neighbour of E1 either. The 6-anchor calibration is therefore retained. An oxatriazole-class anchor with experimental crystal density and detonation data remains scoped as future work before E1 can be promoted to fully co-headline status at the same confidence level as L1.

3.12 Stage-3 xTB triage recipe

The Stage-3 xTB triage referenced in Results (main paper) follows the standard pre-experimental screening recipe for computational energetic-materials chemistry: starting from canonical SMILES we run an RDKit ETKDGv3 conformer search, MMFF94 minimisation, and then xTB [xtb-6.6.1] at the GFN2-xTB level with --opt tight. Two signals are reported per candidate: whether the xTB optimisation converges (a non-converging optimisation indicates an unphysical geometry that should be discarded) and the HOMO–LUMO gap. The HOMO–LUMO gap acts as a soft electronic-stability proxy: energetic compounds with gaps below \(\sim\)1.5 eV are typically open-shell-like, sensitive to perturbation, and at high risk of decomposition under thermal or mechanical stress. Sensitivity is ultimately a product of weakest-bond chemistry and crystal packing as much as electronic structure [13], so we treat the gap as a triage signal rather than a final verdict; molecules clearing the 1.5 eV gate are promoted to Stage 4 DFT, and motif-level shock and friction risks are caught by the parallel SMARTS gate (Stage 1, Methods; see also Methods (main paper)). Recipe and motivation: the pipeline is the standard pre-experimental screen for computational energetic-materials chemistry, the same B3LYP/6-31G(d) geometry + \(\omega\)B97X-D3BJ/def2-TZVP single-point \(\to\) 6-anchor calibration (RDX, TATB, HMX, PETN, FOX-7, NTO) \(\to\) Kamlet–Jacobs \(D\) and \(P\) used by Stage 4 (full functional/basis justification, thermal-correction handling, and calibration uncertainty bounds in 3.1–3.4 above).

3.13 Independent CJ cross-check

A thermochemical-equilibrium Cantera ideal-gas Chapman–Jouguet recompute is provided as an independent relative-ranking sanity check on the Results (main paper) K-J results. The Cantera ideal-gas equation of state is \(\sim\)3.5\(\times\) too low in absolute terms (RDX predicts \(2.50\) km/s vs experimental \(8.75\) km/s), because BKW/JCZ3-type covolume corrections dominate at the 30–100 GPa product-side pressures of CHNO detonations and the ideal-gas EOS misses them by construction; we therefore use it as relative ranking only. On that footing the Cantera ideal-gas CJ ranking is used only as a relative ordering check within the same product-gas composition family, not as an independent quantitative validation; the relative ranking shows L1, L4, L5 are RDX-class (within the same product-gas family: L1, L4, L5 share a CO₂/H₂O-dominant CHNO product distribution; cross-family ranking with N₂-dominant tetrazoline-class compounds would require a covolume EOS recompute), providing a qualitative consistency check that the headline \(D\) claim for L1 is reasonable. Absolute paper-grade \(D\) is out of scope here (see Discussion (main paper)). A sensitivity decomposition (per-lead slopes \(dD/d\rho \in [2.5, 3.0]\) km/s per g/cm³; \(dD/d\,\text{HOF} \in [-0.14, -0.22]\) km/s per 100 kJ/mol; see Table 3.2) confirms that residual K-J vs 3D-CNN deviations on the other leads cannot be attributed to plausible \(\rho\) or HOF input errors; the dominant residual source remains the K-J formula's fixed gas-product-distribution assumption in the high-N CHNO regime. A composite-method HOF benchmark (G4 or CBS-QB3 atomization energies at \(\sim\)6–24 hr CPU per molecule) would tighten the absolute energetics for the top leads but does not change the relative ranking.

Supplementary Note 4: Reproducibility details and superseded ablations

This Supplementary Note collects implementation-specific names, version codes, hardware configurations, hyperparameter ablations, and superseded results that are not required for the body claims but are needed for full reproducibility.

4.1 Score-model variants

Two score-model checkpoints are released on Zenodo: the production 6-head hazard-aware model (score_model_v3f in the archive; matches Discussion (main paper) + Table 2.5) and its 5-head predecessor (score_model_v3e); body-text numbers use the 6-head production checkpoint with the hand-coded SMARTS backend (chem_redflags.py).

4.2 Compute budget

Single-seed pool=40 000 head-to-head: ~8 min on a single consumer GPU; 4-condition × 3-seed pool=10 000: ~25 min. Full DFT audit (15–25-atom CHNO molecules): ~1 GPU-hour per molecule. Full DGLD pipeline (corpus to merged top-100): ~2 GPU-hours.

4.3 Hyperparameter values

All training, sampling, and reranker hyperparameters are tabulated in Supplementary Note 2 (Tables 2.1a–c and the production recipe Table 2.5).

4.4 High-tail oversampling ablation

Reverting the asymmetric oversampling of Methods “Dataset and four-tier label hierarchy” (main paper) produces leads with a 0.4 km/s lower mean predicted detonation velocity at matched pool size and matched downstream pipeline.

4.5 Diagnostic protocol

The guidance patching of Methods (main paper) was diagnosed via per-step gradient-norm logging, score-model output sensitivity to perturbed \(z\), final-\(z\) cosine similarity across conditions, and decoder argmax-basin width. The protocol is released as diag_guidance.py.

4.6 Strengthened SMARTS catalog: per-class breakdown

Two motif classes well-known to the energetic-materials community as shock-sensitive or thermally unstable were added to the SMARTS catalog (Results “Novelty and synthesisability” (main paper)): N-nitroimines (R−N=N−NO₂ at non-aromatic nitrogens) and open-chain polyazene / azo-nitro motifs. Strict reject SMARTS (n_nitroimine, azo_nitro_terminal, azo_amino_nitro, tetraazene_open, open_chain_4N, n_n_c_n_n_chain) reapplied post-hoc to the merged top-100:

**Table 4.3.** Per-class breakdown of the 23/100 strengthened-SMARTS rejects on the merged top-100. The new motif rules (N-nitroimine; terminal azo−NO₂; azo−amino−NO₂; cumulated cyclopropene residual) collectively flag 23 candidates; the rank-1 trinitro-isoxazole survives all rules.
Reject class	Count
N-nitroimine	14
terminal azo−NO₂	6
azo−amino−NO₂	2
cumulated cyclopropene (residual)	1
Total rejected	23 / 100

The strengthened filter retains 77/100 candidates; the rank-1 trinitro-isoxazole survives. Crossing with the xTB HOMO-LUMO ≥ 1.5 eV gate gives an estimated 60-65/100 dual-pass set; 77/100 is the conservative figure used by the validation chain.

4.7 CFG-scale ablation

Three runs at \(w\in\{5,7,9\}\) with pool=8 000 each. Increasing \(w\) sharpens the predicted-property distribution but reduces filter pass-through. \(w=7\) is the empirical optimum: 983 final candidates and a top score of 0.70. \(w=9\) drops to 427 surviving candidates with a slightly worse top score. \(w=5\) over-explores: 528 final candidates, top score 0.92 (Extended Data Fig. 3 (main paper)).

The v4b production-architecture sweep (50 samples per target, three quantile bands per property; full table in experiments/diffusion_subset_cond_expanded_v4b_20260426T000541Z/cfg_sweep.md) confirms that \(w=7\) gives the best mean per-property quantile match. Representative q90 (high-extreme) rows below illustrate the trade-off: density and detonation-velocity quantile error fall monotonically with increasing \(w\); HOF and detonation-pressure quantile errors are bowl-shaped, optimal near \(w \in [5, 7]\).

Property	Target quantile	g=2.0 rel-err%	g=5.0 rel-err%	g=7.0 rel-err%
density	q90 (target +1.83)	19	14	12
heat_of_formation	q90 (target +267.75)	212	176	171
detonation_velocity	q90 (target +7.86)	26	17	12
detonation_pressure	q90 (target +26.57)	49	30	26

4.8 Reproducibility across pool sizes

Top-1 composite at small pool size is sampling-noise-dominated: a single guidance configuration (\(s_{\text{viab}}=1.0, s_{\text{sens}}=0.3, s_{\text{SA}}=0.3\)) reaches top-1 composite 0.91 at pool=5 000 but only 0.60–0.63 at pool=20 000 / 40 000; the 0.91 outlier is a small isoxazole-N-oxide (\(\rho=1.89, D=9.28\) km/s) that does not re-emerge at the larger sample size. The per-head guidance scale tunes the distribution of generated chemistry, but individual top-1 candidates are subject to the same combinatorial rarity that motivates the pool-size scaling of Methods; see also Methods (main paper). The composite-score distribution over the top-200 of an unguided pool=40 000 run is shown in Figure 4.8. The recommended robust production recipe (moderate guidance, pool \(\ge\) 40 000, Pareto reranker (scaffold-aware composite) as ranker, SA-gradient as a complementary run) is documented in Methods; see also Methods (main paper).

Score distribution — **Figure 4.8.** Distribution of composite scores over the top-200 of an unguided pool=40 000 run after the Pareto scaffold-aware ranker (range [0.564, 0.831]).

Pool-fusion provenance of the merged top-100. Applying the Methods; see also Methods (main paper) pool-fusion recipe to (i) the unguided pool=40 000 run, (ii) the unguided pool=80 000 run, (iii) a guided pool=40 000 run at \(s_{\text{viab}}\!=\!1.0, s_{\text{sens}}\!=\!0.3, s_{\text{SA}}\!=\!0\), and (iv) a guided pool=20 000 run at \(s_{\text{SA}}\!=\!0.15\) yields 331 unique candidates. The top-100 by composite has range [0.564, 0.831]. Source breakdown: 89 from (i), 5 from (ii), 3 from (iii), 3 from (iv) (visualised as Extended Data Fig. 8 in the main paper). The Pareto reranker (scaffold-aware composite) is the dominant ranker; the largest contributor is the smallest unguided pool, while classifier-guided runs supply scaffold diversity at the top end. The trinitro-isoxazole rank-1 itself comes from a guided run.

4.9 E2 filter-scope note

E2 (gem-dinitro nitramine, O=[N+]([O-])NC([N+](=O)[O-])[N+](=O)[O-], CH₂N₄O₆) carries 3 nitro groups on a 1-carbon skeleton and should be rejected by the Methods; see also Methods (main paper) hard rule "≥3 nitro groups on a 1- or 2-carbon skeleton". The production reranker filter chem_redflags.screen() does reject E2 via its nC≤2 and n_NO2≥3 rule (nC=1, nNO2=3); the lighter chem_filter.chem_filter_batch() used by the E-set mining script does not. This Stage-1-vs-Stage-2 scope mismatch is expected: the production filter gates the L-set headline; the looser mining filter produces the E-set as a deliberately wider window. Additionally, E2's oxygen balance OB = +28.9% exceeds the +25% production cap; its K-J D value (9.22 km/s) is an upper-bound estimate only.

Supplementary Note 5: Baseline-method example outputs

Tables 5.1 and 5.2 list the ten most-novel CHNO-neutral candidates from the MolMIM 70 M and SMILES-LSTM pools, ranked by 1 minus max Morgan-FP-2 Tanimoto to the labelled master. Even at their most novel, neither baseline produces the high-density poly-nitro chemistry DGLD generates: MolMIM candidates are drug-like saturated heterocycles; SMILES-LSTM candidates are fused-ring CHO scaffolds with sporadic nitro decoration.

**Table 5.1.** Ten most-novel CHNO-neutral candidates from the MolMIM 70 M pool. Columns: SMILES; molecular formula; exact molecular weight; novelty (1 minus max-Tanimoto to labelled master).
SMILES	Formula	MW	Novelty
CO[C@@H]1CCN(CC23CCC(N)(CC2)C3)C[C@@H]1C	C₁₅H₂₈N₂O	252.22	0.823
N=c1cc[nH]cc1-c1nc(N)nc(N2C[C@H]3CN=C(N4CCCC4)[C@H]3C2)n1	C₁₈H₂₃N₉	365.21	0.818
CC(CN1C[C@H](CO)C2(CCC2)C1)=C1CCC1	C₁₅H₂₅NO	235.19	0.810
C[C@H]1OCCC2=Nc3[nH]c4c(c3CCC[C@@H]21)[C@](C)(O)C4	C₁₆H₂₂N₂O₂	274.17	0.809
CC(C)(C)CN1CC(C)(C)C[C@]2(C)C(=O)N3C(=N1)C(=O)NC[C@]3(C)[C@@H]2N	C₁₉H₃₃N₅O₂	363.26	0.806
CC(C)CN1CCN(c2nnc(-c3cncnc3)n2C[C@]2(C)CCCO2)CC1=O	C₂₀H₂₉N₇O₂	399.24	0.805
N[C@]1(CC2COCCOC2)CO[C@H](C2CC2)C1	C₁₃H₂₃NO₃	241.17	0.804
Cc1c(C(=O)N(C)C[C@@H]2CCN(C(=O)Cc3cnoc3)C2)ccn1C	C₁₈H₂₄N₄O₃	344.18	0.802
O=C(CCn1cnccc1=O)NC1(CNC[C@H]2CC[C@H]3C[C@H]3C2)CCCC1	C₂₁H₃₂N₄O₂	372.25	0.802
CO[C@@H]1CCCCN(c2nnc([C@]3(C)CNC(=O)C3)n2CC2CC2)C1	C₁₈H₂₉N₅O₂	347.23	0.802

**Table 5.2.** Ten most-novel CHNO-neutral candidates from the SMILES-LSTM pool. Columns as Table 5.1.
SMILES	Formula	MW	Novelty
O=C(O[C@@H]1C[C@H]2CC[C@H]3O[C@]2(C1)C1=C3CCCC1)C(=O)C1CC1	C₁₉H₂₄O₄	316.17	0.775
CC(=N)OC(=O)OC12OC(OCCO)C1O2	C₈H₁₁NO₇	233.05	0.754
CC[C@H](c1cccc2ccccc12)[C@@H](/C=C/C[C@]1(C)C(=O)N2C(=O)N=C(OC)O[C@H]2[C@@H]1C(OC)=C(C)O)COC	C₃₁H₃₈N₂O₇	550.27	0.750
CCN1N=C2OC(=O)N=CN=C2C(O)=NC1=NO	C₈H₈N₆O₄	252.06	0.746
CC(=O)[C@@H]1[C@@H](O)[C@H](C(C)(C)C)[C@@H]1N1C(=O)C(=O)N(C)C(=O)CN1[N+](=O)[O-]	C₁₅H₂₂N₄O₇	370.15	0.743
CC1N=C2OC(=O)OC2OC(Oc2nnno2)COC1C#N	C₁₀H₉N₅O₇	311.05	0.739
C[C@@H]([C@@H]1C=C(/C=C\C=C2\CCCCN2C)[C@]2(C=Nc3ccccc32)C(=O)[C@@H]1OCc1ccccc1)[N+](=O)[O-]	C₃₁H₃₃N₃O₄	511.25	0.739
C=C1N=C2OC(=O)N2C(=O)COC(=O)NC1=O	C₈H₅N₃O₆	239.02	0.736
CC12OC(O1)C(=O)C(=O)N(C=O)OCC2=O	C₈H₇NO₇	229.02	0.732
COC(=O)N1CCCC(C)(C)N1O	C₈H₁₆N₂O₃	188.12	0.730

Supplementary Note 6: Detailed ablations

This Note collects the full prose, sub-tables, and figures for each of the seven ablations whose headline numbers are summarised in main Fig. 5 (Results, “Component contributions”, main paper).

6.1 Self-distillation hard-negative budget (137 vs 918)

A viability-head ablation comparing 137-vs-918 hard-negative budgets shows that score-model discriminative power scales directly with structurally-diverse latent negatives; the 918-budget run is required for the Methods (main paper) self-distillation refinement to deliver its discriminative gain. The round-1 viability head (137 mined hard negatives encoded as viab=0) and the round-2 production checkpoint (918 mined hard negatives plus an aromatic-heterocycle viability boost \(\times 1.20\)) are evaluated on a held-out 7-anchor / 5-cheat probe (RDX, HMX, TNT, FOX-7, PETN, TATB plus five known model-cheats: gem-tetranitro on C2, trinitromethane, gem-dinitro-cyclopropane, polyazene-NO₂ chain, dinitromethane). Both discriminate, but only the 918-budget checkpoint demotes the worst-offender model-cheat (gem-tetranitro) by an additional 0.10 absolute, giving the head reliable steer-off behaviour on the polynitro/polyazene cheats the sampler would otherwise propose. The SC-gradient sweep and the heuristic-vs-literature sensitivity-head ablation (Supplementary Note 2.2) are detailed in Supplementary Note 6.4 below.

**Table 6.1.** Hard-negative budget effect on the held-out 7-anchor / 5-cheat probe. Anchor min and cheat max are the per-set extrema of the viability-head output; the worst-offender column tracks the gem-tetranitro model-cheat across budgets.
HN budget	Anchor min	Cheat max	Worst-offender (gem-tetranitro)
137 (round-1)	0.93	0.84	0.84
918 (production)	0.86	0.84	0.74 (−0.10 absolute)

Full per-anchor and per-cheat ranges. We trained two variants of the multi-head model on identical noised-latent inputs to compare hard-negative-augmentation budgets. Budget-137: training labels = \(\mathbf{1}[\text{redflag pass}] \cdot P_{\text{RF}}\) on the labelled corpus + 137 mined hard negatives encoded as viab=0. Budget-918 (production): same recipe with 918 mined hard negatives (mined from all generation pools) and an additional aromatic-heterocycle boost \(\times 1.20\) on the viability target. On the held-out probe of seven anchor energetics (RDX, HMX, TNT, FOX-7, PETN, TATB) and five known model-cheats (gem-tetranitro on C2; trinitromethane; gem-dinitro-cyclopropane; polyazene-NO₂ chain; tiny dinitromethane), the budget-137 model gave anchors viability \(\in\) [0.93, 0.99] and cheats \(\in\) [0.20, 0.84]; the budget-918 checkpoint gave anchors \(\in\) [0.86, 0.99] and cheats \(\in\) [0.12, 0.84]. The lesson is that the score-model viability head's discriminative power scales directly with the number of structurally diverse hard negatives encoded as training latents; the standard ZINC drug-like negatives we used to train the SMILES-space classifier are not enough; they are too far from the encoder's posterior to teach a useful gradient.

6.2 Diffusion + guidance prior vs Gaussian-latent control

An initial 3k-sample Gaussian baseline (LIMO decoder + SMARTS gate + SA/SC caps + Tanimoto window [0.15, 0.65] + Uni-Mol ranking) yields a top-1 composite of 1.10 (\(\rho=1.89\) g/cm³, \(D=9.02\) km/s, \(P=35.9\) GPa) with a 15 % pipeline keep rate. A compute-matched 40k run (described below and in Table 6.2) reveals a qualitatively different picture once the Uni-Mol performance-ranking step is removed.

A compute-matched Gaussian-latent baseline (40k samples, \(\mathcal{N}(0,I_{1024})\) decoded through frozen LIMO, then min-10-HA fragment gate + CHNO SMARTS gate + SA ≤ 6.5 + Tanimoto novelty window [0.15, 0.65]) was run to close the pool-size mismatch noted in the original caveat. The results reveal a qualitative difference beyond the D-lift: the Gaussian prior achieves a keep rate of 48 % (19 258 / 40 000), far higher than DGLD's 4.6 %, because the Gaussian prior generates mostly chemically plausible but performance-degenerate molecules. Without the Uni-Mol performance-ranking step, the static filter chain cannot discriminate high-performance candidates from compositionally adequate high-N molecules: the top-1 by N-fraction proxy is a tetrazole-bearing chain (\(f_N = 0.60\), SA = 4.3, 15 heavy atoms), a synthetically accessible but structurally simple open-chain molecule with no evidence of high detonation performance. In contrast, DGLD's prior concentrates the sampled distribution in the high-D, high-density corner of the VAE latent space, so the Uni-Mol-ranked post-filter pool consists of structurally complex, ring-bearing high-N candidates (top-1 trinitrocyclopropene, 7 heavy atoms, \(D = 9.47\) km/s). The 48 % vs 4.6 % keep-rate reversal confirms that the diffusion prior's contribution is not in generating chemically plausible molecules (a Gaussian prior does that easily) but in concentrating the distribution on the high-performance tail.

**Table 6.2.** Diffusion + guidance prior vs compute-matched Gaussian-latent control at matched pool size (40k). Pipeline keep rate differs in definition: DGLD keep rate uses Uni-Mol D-ranking; Gaussian keep rate uses min-10-HA + SMARTS + SA filters only (no Uni-Mol). Top-1 D for the Gaussian row is from the 3k run (Uni-Mol-ranked; the 40k run's top-1 by N-fraction proxy is a tetrazole-bearing open-chain, not Uni-Mol-ranked).
Prior	Pool size	Top-1 \(D\) (km/s)	Top-1 composite	Keep rate (pipeline def.)	Top-1 SMILES
Gaussian \(\mathcal{N}(0,I_{1024})\) (3k, Uni-Mol-ranked)	3 000	9.02	1.10	15 % (Uni-Mol)	`O=[N+]([O-])c1ccncc1`
Gaussian \(\mathcal{N}(0,I_{1024})\) (40k, N-frac proxy)	40 000	(not ranked by D; N-fraction proxy)	(not on the DGLD scale)	48 % (SMARTS+SA+min-10-HA)	`NC=NCNNCCCNn1cnnn1` (15-HA tetrazole chain, \(f_N=0.60\))
DGLD diffusion + guidance (M7 100k)	100 000	9.47	0.665	4.6 % (Uni-Mol)	`O=[N+]([O-])OC1C([N+](=O)[O-])=C1[N+](=O)[O-]`
Lift (DGLD over Gaussian 3k)	–	+0.45	−0.44 (better)	–	–

6.3 Multi-head guidance vs unguided at matched compute

At pool=10 000 each, with identical denoiser and only the sampler differing, multi-head guidance lifts max composite from 0.64 to 0.69, top-1 \(D\) from 8.94 to 9.51 km/s, and top-1 \(P\) from 34.0 to 38.8 GPa (Figure 6.1).

Guided vs unguided composite distribution — **Figure 6.1.** Effect of multi-head score-model guidance on the composite-score distribution at matched compute (pool = 10 000 each, post hard-gate filter). The denoiser is identical in both runs; only the sampler differs.

To isolate the contribution of multi-head score-model guidance at the larger pool size, a four-condition head-to-head was run at matched compute (pool=40 000 each, identical CFG=7, identical DDIM steps=40, identical Pareto reranker, single seed). The four conditions are C0 unguided (classifier-free guidance only); C1 viab+sens (\(s_{\text{viab}}=1.0\), \(s_{\text{sens}}=0.3\)); C2 viab+sens+hazard (\(+s_{\text{hazard}}=1.0\)); C3 hazard-only (\(s_{\text{hazard}}=2.0\), other heads off).

**Table 6.3a.** Per-condition top-1 candidates from the four-way head-to-head at pool=40 000 single-seed.
Condition	top-1 \(D\) (km/s)	top-1 formula	top-1 SMILES	scaffolds in top-100	internal diversity over top-100
C0 unguided	9.51	C₁N₄O₃ (4-nitro-1,2,3,5-oxatriazole)	O=[N+]([O-])c1nnon1	12	0.806
C1 viab+sens (\(s_{\text{viab}}\!=\!1.0,\,s_{\text{sens}}\!=\!0.3\))	9.44	C₂H₂N₄O₆	O=[N+]([O-])C=C(N[N+](=O)[O-])[N+](=O)[O-]	5	0.778
C2 viab+sens+hazard (\(+s_{\text{hazard}}\!=\!1.0\))	9.53	C₂HN₃O₆ (dinitro-vinylene)	O=[N+]([O-])C=C([N+](=O)[O-])[N+](=O)[O-]	5	0.779
C3 hazard-only (\(s_{\text{hazard}}\!=\!2.0\), others=0)	9.53	C₂HN₃O₆ (same as C2)	O=[N+]([O-])C=C([N+](=O)[O-])[N+](=O)[O-]	7	0.794

Top-1 detonation velocity is nearly invariant across the four conditions (9.44–9.53 km/s); guidance reduces scaffold diversity from 12 (unguided) to 5 (default-guided) Bemis–Murcko scaffolds across the top-100, and the hazard-only condition recovers slightly more (7), consistent with hazard being a single-axis push rather than a tightly correlated multi-property descent. C2 viab+sens+hazard is adopted as the production default. The C0 unguided top-1 SMILES (C₁N₄O₃, 4-nitro-1,2,3,5-oxatriazole) is the same structure as E-set lead E1, confirming that the unguided sampler independently surfaces E1 as its highest-ranked candidate, a consistency check that E1 is not an artefact of the extension-set mining procedure.

Distribution-learning metrics (MOSES + FCD). To address standard distribution-learning benchmarks expected by NMI reviewers, we report MOSES-style metrics [14] on the four guidance conditions above at pool=10 000 each (10k SMILES per condition):

**Table 6.3b.** MOSES-style distribution-learning metrics on each of the four guidance conditions (pool = 10 000 SMILES per condition). Conditions C0–C3 as above. Columns: RDKit validity rate; uniqueness fraction in the 10k pool; novelty against the labelled master corpus (LM, 65 980 rows); MOSES IntDiv1 internal-diversity score (1 \(-\) mean pairwise Tanimoto); count of unique Bemis–Murcko scaffolds in the 10k pool; pass rate against the PAINS drug-like alert catalog; pass rate against the chemist-curated energetic-chemistry SMARTS catalog (chem_redflags); SNN, average maximum Tanimoto similarity to the labelled master.
Condition	validity	uniqueness	novelty (vs LM)	IntDiv1	BM scaffolds	PAINS-pass	energetic-SMARTS-pass	SNN to LM
C0 unguided	1.000	0.728	0.998	0.845	125	0.861	0.301	0.328
C1 viab+sens	1.000	0.639	0.998	0.833	62	0.804	0.259	0.309
C2 viab+sens+hazard	1.000	0.572	0.999	0.851	71	0.775	0.246	0.292
C3 hazard-only	1.000	0.732	0.998	0.858	165	0.844	0.260	0.306

The MOSES small-multiples figure (validity, scaffold uniqueness, internal diversity, FCD) is included in the main paper as Extended Data Fig. 9.

Multi-seed validation (SA-axis sweep, 3 seeds each). To bound seed variance on the MOSES metrics, we re-ran the MOSES evaluation on the four SA-axis conditions (C0 unguided; C1 viab-only; C2 viab+sens; C3 viab+sens+SA) across 3 independent seeds (10 000 SMILES per seed per condition). Table 6.3c reports mean \(\pm\) std across seeds. The multi-seed evaluation confirms that validity is exactly 1.000 for every seed and condition (zero variance), novelty vs the labelled master is >99.9 % for all seeds, and the uniqueness and diversity trends from Table 6.3b are reproducible. Scaffold count drops substantially under stronger guidance (1262 scaffolds unguided vs 659 fully guided, a 48 % reduction), with low seed-to-seed variance (\(\le\) 23 scaffolds s.d.) confirming that the guidance-driven narrowing of chemical space is a stable, signal-level effect rather than a seed artifact.

**Table 6.3c.** Multi-seed MOSES-style metrics for the four SA-axis conditions (3 seeds each, 10 000 SMILES per seed). Validity is exactly 1.000 (zero variance) across all seeds and conditions. Values reported as mean \(\pm\) std. SA-axis conditions: C0 = unguided; C1 = viab-only (\(s_{\text{viab}}\!=\!1.0\)); C2 = viab+sens (\(+s_{\text{sens}}\!=\!0.3\)); C3 = viab+sens+SA (\(+s_{\text{SA}}\!=\!0.15\)).
Condition	validity	uniqueness	novelty (vs LM)	IntDiv1	SNN to LM	BM scaffolds
C0 unguided (3 seeds)	1.000	0.671 \(\pm\) 0.002	0.999 \(\pm\) 0.000	0.838 \(\pm\) 0.002	0.221 \(\pm\) 0.005	1262 \(\pm\) 10
C1 viab-only (3 seeds)	1.000	0.660 \(\pm\) 0.003	0.999 \(\pm\) 0.000	0.830 \(\pm\) 0.002	0.221 \(\pm\) 0.002	1113 \(\pm\) 21
C2 viab+sens (3 seeds)	1.000	0.657 \(\pm\) 0.003	1.000 \(\pm\) 0.000	0.822 \(\pm\) 0.001	0.223 \(\pm\) 0.009	815 \(\pm\) 23
C3 viab+sens+SA (3 seeds)	1.000	0.617 \(\pm\) 0.002	1.000 \(\pm\) 0.000	0.818 \(\pm\) 0.002	0.210 \(\pm\) 0.005	659 \(\pm\) 18

Five readings of Table 6.3b:

Validity is 100 % for all conditions: the LIMO non-autoregressive parallel decoder, fine-tuned on the 326 k energetic corpus, produces RDKit-parseable SMILES at convergence regardless of the latent perturbation. Validity is measured by calling RDKit.Chem.MolFromSmiles on every decoder output; any output for which MolFromSmiles returns None is counted as invalid (validity = 0). All validity figures in Tables 6.3b and 6.3c are computed on this basis. This is consistent with the SELFIES-as-decode-vocabulary guarantee but stronger because the LIMO decoder's argmax may produce SELFIES that round-trip to non-canonical or strained chemistry; here it does not.
Novelty vs the labelled master is >99.8 % for all conditions: nearly every generated SMILES is absent from our 65 k labelled corpus. This corroborates the Results (main paper) Tanimoto-stratified novelty result (median NN-Tanimoto 0.32–0.36) and confirms that the diffusion sampler is not memorising training rows.
Per-condition uniqueness varies substantially: 0.728 (unguided) to 0.572 (viab+sens+hazard). Stronger guidance focuses the candidate set, producing more repeated SMILES, an expected and interpretable side-effect of gradient steering.
Bemis–Murcko scaffold count varies 2× across conditions (62 to 165 from the same 10k samples). Hazard-only (C3) produces the most scaffold diversity (165), unguided (C0) is intermediate (125), and the combined viab+sens+hazard (C2) is the most focused (71). This is the cleanest quantitative demonstration that per-head guidance steers chemistry class, not just metric values.
PAINS pass rate is 77–86 %; chem_redflags pass rate is 25–30 %. PAINS is a drug-like filter (and the high pass rate confirms the corpus is far from drug-discovery PAINS hot-spots). The chem_redflags rate of 25–30 % is intentional: our SMARTS catalog is strict on energetic-chemistry hazards, and the diffusion sampler explores broadly enough that the catalog rejects most candidates; this is the design intent of the gating layer (Methods; see also Methods (main paper)), and it scales as expected with guidance strength (rejection rate rises modestly under guidance because guidance pushes into denser energetic-chemistry regions where the SMARTS catalog has more triggers).

Fréchet ChemNet Distance (FCD). We additionally compute FCD [15] for each condition against a 5 000-row sample of the labelled master, recognising that the ChemNet feature extractor is trained on drug-like ZINC+PubChem chemistry and therefore FCD in our energetic CHNO domain is a chemistry-class-transfer signal rather than a within-domain quality metric. With that scope in mind:

**Table 6.3d.** Fréchet ChemNet Distance (FCD) of each method/condition's 10 000-SMILES output against a 5 000-row sample of the labelled master corpus, mean \(\pm\) std across seeds. Lower FCD means the generated chemistry is distributionally closer to the labelled master. The SMILES-LSTM no-diffusion baseline (FCD = 0.52) is distributionally indistinguishable from the labelled master, but its top-1 Pareto-reranker composite is 0.083 (main Fig. 4 and the baselines discussion in Results “Novelty and synthesisability” (main paper)): a generator that mimics the training distribution does not automatically produce property-targeted candidates. DGLD's per-condition FCD increases monotonically as the guidance signal pushes the sampler off the labelled master distribution toward higher target properties.
Condition (n_seeds)	FCD vs labelled master (mean ± std)
SMILES-LSTM (no diffusion, 3 seeds, mean)	0.52
DGLD C3 hazard-only (3 seeds)	24.04 \(\pm\) 0.12
DGLD C0 unguided (6 seeds)	24.99 \(\pm\) 0.05
DGLD C2 viab+sens+hazard (3 seeds)	25.11 \(\pm\) 0.05
DGLD C1 viab (SA-axis, 3 seeds)	25.41 \(\pm\) 0.13
DGLD C1 viab+sens (hazard-axis, 3 seeds)	25.86 \(\pm\) 0.02
DGLD C2 viab+sens (SA-axis, 3 seeds)	25.93 \(\pm\) 0.01
DGLD C3 viab+sens+SA (SA-axis, 3 seeds)	26.47 \(\pm\) 0.06

The result is methodologically informative and self-consistent with the Supplementary Note 6.4 Pareto-reranker composite ranking: the SMILES-LSTM no-diffusion baseline produces chemistry that is distributionally indistinguishable from the labelled master (FCD = 0.52, the floor for a generic energetic-flavoured SMILES generator trained on the same corpus), but its top-1 Pareto-reranker composite is 0.083 (Supplementary Note 6.4); a generative model that mimics the training distribution well by FCD does not automatically produce property-targeted candidates. DGLD's per-condition FCD increases monotonically as the guidance signal pushes the sampler off the labelled master distribution toward higher target properties (C3 viab+sens+SA at FCD = 26.47 is the most extrapolatory and also has the highest top-1 composite of 0.698, Supplementary Note 6.4). FCD and Pareto-reranker composite are anti-correlated across DGLD conditions: guidance trades distributional fidelity for property-target extrapolation, exactly as the methodology of classifier-free + multi-head guidance is designed to do. Within the energetic-chemistry domain, the corpus-internal SNN to labelled master (0.29–0.33) provides a complementary in-domain quality signal; the FCD numbers above should be read as the chemistry-class-transfer measurement that ChemNet was designed for, contextualised against the Supplementary Note 6.4 composite.

6.4 Per-head ablation (hazard, SA, seed-variance)

Why a separate SA head and what it costs. A fifth score-model head trained on RDKit's synthetic-accessibility score (Ertl 2009) was added to test whether sample-time gradient guidance toward "more synthesisable" chemistry would surface easier candidates without sacrificing detonation performance. The head is trained on the same shared FiLM-MLP trunk of Discussion (main paper) with a SmoothL1 loss against the SAScorer label and a static head weight \(w_{\text{SA}} = 0.25\) chosen so its loss term sits at O(1) at convergence. The known limitation: the SAScorer label distribution is calibrated on drug-discovery reaction corpora (Reaxys + pharma chemistry) and penalises uncommon ring fusions and rare fragments well-precedented in energetic chemistry (CL-20-class polyaza-cages, hexanitrobenzene). The SA-axis matrix below empirically tests whether this drug-domain prior helps or hurts in our setting. (See Supplementary Note 2.4 for the full SA / SC drug-domain transfer-head story.)

To put the Supplementary Note 6.3 single-seed numbers in the context of seed variance, the same sampler was run at pool=10 000 per (condition, seed) for three seeds per condition across two overlapping four-cell matrices: a hazard-axis matrix (Hz-C0/Hz-C1/Hz-C2/Hz-C3 with hazard guidance enabled) and an SA-axis matrix (SA-C1/SA-C2/SA-C3 with synthetic-accessibility-gradient guidance), sharing the unguided cell as Hz-C0 = SA-C0. Together the two matrices yield seven distinct guidance settings.

**Table 6.4a.** DGLD multi-seed head-to-head at pool=10 000 per condition across hazard-axis (Hz-) and SA-axis (SA-) matrices, sharing the unguided Hz-C0 = SA-C0 cell. Bold row is the most-novel condition.
Condition	top-1 composite (lower = better)	top-1 \(D\) (km/s)	top-1 \(\rho\) (g/cm³)	top-1 \(P\) (GPa)	top-1 max-Tani to LM	seeds
DGLD Hz-C0 = SA-C0 unguided (cfg-only)	0.451 \(\pm\) 0.126	9.44 \(\pm\) 0.07	1.93 \(\pm\) 0.01	39.7 \(\pm\) 0.6	0.61 \(\pm\) 0.10	6
DGLD SA-C1 viab-only	0.542 \(\pm\) 0.184	9.54 \(\pm\) 0.04	1.93 \(\pm\) 0.01	39.8 \(\pm\) 0.5	0.63 \(\pm\) 0.06	3
DGLD Hz-C1 viab+sens	0.613 \(\pm\) 0.106	9.36 \(\pm\) 0.06	1.93 \(\pm\) 0.01	38.7 \(\pm\) 0.4	0.46 \(\pm\) 0.04	3
DGLD SA-C2 viab+sens	0.618 \(\pm\) 0.056	9.44 \(\pm\) 0.05	1.94 \(\pm\) 0.01	38.5 \(\pm\) 0.4	0.41 \(\pm\) 0.04	3
DGLD Hz-C2 viab+sens+hazard	0.485 \(\pm\) 0.152	9.39 \(\pm\) 0.04	1.91 \(\pm\) 0.03	38.7 \(\pm\) 0.6	0.27 \(\pm\) 0.03 (most novel)	3
DGLD Hz-C3 hazard-only	0.503 \(\pm\) 0.131	9.32 \(\pm\) 0.10	1.95 \(\pm\) 0.01	38.8 \(\pm\) 0.4	0.44 \(\pm\) 0.05	3
DGLD SA-C3 viab+sens+SA	0.698 \(\pm\) 0.015	9.34 \(\pm\) 0.07	1.94 \(\pm\) 0.01	38.7 \(\pm\) 0.4	0.49 \(\pm\) 0.06	3

Per-condition seed variance (Hz-C1 \(=\) 0.613, Hz-C2 \(=\) 0.485, Hz-C3 \(=\) 0.503) is comparable to or larger than the across-condition mean differences, so the Supplementary Note 6.3 single-seed numbers must be read with this variance in mind. Hz-C2 viab+sens+hazard is the most-novel condition (max-Tani 0.27, 0/3 seeds memorised) and is adopted as the production default. The SA-axis SA-C3 (viab+sens+SA at \(s_{\text{SA}}\!=\!0.15\)) is the worst condition in either matrix at top-1 composite 0.698, justifying the production setting \(s_{\text{SA}}\!=\!0\) (Methods (main paper)). The 6-seed unguided pool gives the tightest mean (0.451 \(\pm\) 0.126).

Per-head guidance grid: full md5 cluster table. A 12-cell head-state grid over \(s_{\text{viab}}\in\{0,1,5\}\) and \(s_{\text{hazard}}\in\{0,1,2,5\}\) at pool=20 000, single seed, patched sampler, collapsed by output md5, clusters into exactly four distinct outputs by 2×2 head-on/off; within-cluster scale magnitude does not alter the output because the per-row clamp saturates as soon as a head is active. Cluster D (viab + hazard) is the production default.

**Table 6.4b.** 12-cell head-state grid over \(s_{\text{viab}}\in\{0,1,5\}\) and \(s_{\text{hazard}}\in\{0,1,2,5\}\) at pool=20 000, single seed, patched sampler, collapsed by output md5. Cells cluster into exactly four distinct outputs by 2×2 head-on/off; within-cluster scale magnitude does not alter the output because the per-row clamp saturates as soon as a head is active.
Cluster	cells	head-state	output md5 prefix
A	sv0_sh0 (1)	unguided / CFG-only	`f0438f…`
B	sv0_sh{1,2,5} (3)	hazard-only	`93da34…`
C	sv{1,5}_sh0 (2)	viab-only	`c83f06…`
D	sv{1,5}_sh{1,2,5} (6)	viab + hazard	`e26b43…`

Within each cluster, per-head scale magnitude does not alter the output: the gradient-norm clamp saturates as soon as a head is active, so all non-zero scales in a cluster are equivalent. Fine analogue scale control requires tightening the clamp (see Supplementary Note 2.3).

Heuristic vs literature-grounded sensitivity head. Three-way ablation (each condition: 2 pools \(\times\) 1500 = 3000 samples, identical CFG, identical Pareto reranker, single seed) substituting the literature-grounded sensitivity head (Supplementary Note 2.2):

**Table 6.4c.** Heuristic vs literature-grounded sensitivity-head ablation at pool=3 000 single seed: unguided baseline; \(s_{\text{sens}}=0.3\) with the heuristic-trained head; \(s_{\text{sens}}=0.3\) with the literature-grounded h₅₀-trained head. The literature-grounded head reduces top-5 mean sensitivity by \(\sim 8\times\) at the cost of \(\sim\)0.4 km/s on top-1 \(D\).
Condition	top-1 comp	top-1 \(D\) (km/s)	top-1 sens	top-5 mean sens	top-5 mean \(D\)
A. unguided (no score model)	0.602	9.27	0.11	0.20	9.02
B. heuristic sensitivity head	0.414	9.46	0.75	0.33	9.02
C. literature-grounded sensitivity head (h₅₀)	0.421	8.75	0.00	0.04	8.63

The literature-grounded h50 head reduces top-5 mean sensitivity by \(\sim\)8× at the cost of \(\sim\)0.4 km/s on top-1 \(D\); preferred when candidate stability is the priority over peak performance.

Hazard-head post-hoc filtering on the merged top-100. Applying the Discussion (main paper) hazard-head-as-rerank-weight to the merged top-100, scored against the chemist-curated SMARTS catalog, gives the following alignment:

**Table 6.4d.** Alignment of the hazard-head sigmoid output with the chemist-curated SMARTS reject catalog on the merged top-100. Rows: SMARTS-only reject count; hazard-head soft reject (\(P_{\text{hazard}} > 0.5\)) count; intersection (perfect recall: every SMARTS reject is also flagged by the head); hazard-head-only flags (head catches motifs the rule didn't cover); both pass (clean by both gates).
Test	Count
energetic-SMARTS gate reject	23 / 100
hazard head P > 0.5	89 / 100
both reject (SMARTS ∩ hazard)	23 / 23 (perfect recall)
hazard-only flag (SMARTS missed but head flagged)	66 / 100
both pass (SMARTS-ok ∩ hazard \(\le\) 0.5)	11 / 100

The hazard head recovers all 23 chemist-SMARTS rejects with perfect recall on the merged top-100 (the chemist-rejects were in the head's training-positive set, so this is partly by construction); a more honest generalisation estimate via 5-fold cross-validation across all 196 hazard-dataset rows gives mean held-out AUROC = 0.91 ± 0.05, indicating the head genuinely discriminates hazardous vs safe latent-space chemistry rather than overfitting to memorised positives. The head also flags 66 additional candidates from the merged top-100 that the SMARTS catalog let through. Manual inspection of the top-flagged candidates shows a mix of true positives (motifs the hand-written SMARTS rules don't cover but which are similar in latent space to known hazardous chemistry) and false positives (over-conservative latent-neighborhood matches: e.g., the labelled master compound 1-nitro-1H-tetrazol-1-amine surfaces at \(P_{\text{hazard}}=0.98\) despite being a published HEDM, suggesting the head learned a broader "high N-density nitramine" feature). The hazard head is therefore used as a soft re-rank weight, not a hard reject. Threshold \(P_{\text{hazard}}>0.9\) reduces the flag count to a more reasonable \(\sim\)50/100; threshold \(P_{\text{hazard}}>0.5\) over-rejects.

Why this matters for the methodology. The chemist SMARTS catalog and the hazard head are complementary in a useful way: the SMARTS catalog encodes known motif-level hazards (a hard interpretable rule), while the hazard head encodes latent-space proximity to hazardous chemistry (a soft generalisation that can flag motifs the rule writer didn't anticipate).

6.5 M7 five-lane 100k-pool run

The Supplementary Note 6.3/Supplementary Note 6.4 ablations each use pool = 10 000 per condition; Supplementary Note 6.3 at pool = 40k uses a single lane per denoiser. To verify that pool-fusion continues to pay off beyond 40k, a 100k-pool M7 experiment ran five sampling lanes with seeds 3–4; the top-200 candidates were re-ranked by the Methods; see also Methods (main paper) composite.

**Table 6.5.** M7 five-lane 100k pool-fusion results vs the Supplementary Note 6.3 single-lane 40k baseline. Pipeline keep-rate is n_validated / n_raw (Uni-Mol + composite filter). Diversity and scaffold count over the returned top-200.
Run	n_raw	n_pass	Keep rate	Top-1 \(D\) (km/s)	Max \(D\) (km/s)	Mean \(D\) (km/s)	Scaffolds	IntDiv
Single-lane 40k (Supplementary Note 6.3)	40 000	966	2.4 %	9.51	9.51	—	—	—
M7 five-lane 100k	100 000	4 639	4.6 %	9.47	9.79	9.03	24	0.81

The five-lane fusion nearly doubles the keep rate (4.6 % vs 2.4 %, 1.92×) and quintuples the absolute number of passing candidates (4 639 vs 966), consistent with each lane surface-sampling a distinct region of the high-density manifold. Top-1 \(D\) is stable at 9.47 km/s (within 0.04 km/s of the single-lane best), while max \(D\) rises to 9.79 km/s. The fused top-200 spans 24 distinct Bemis–Murcko scaffolds at internal diversity 0.81, a meaningful improvement over the single-lane seven-scaffold top-100. The top-1 SMILES is O=[N+]([O-])OC1C([N+](=O)[O-])=C1[N+](=O)[O-] (1,2,3-trinitrocyclopropene), a compact ring with three nitro groups and no prior labelled-master entry.

6.6 Tier-gate ablation (perf head trained without label-trust masking)

The tier-gated training recipe (Methods (main paper), Discussion (main paper)) masks the performance head from Tier-C/D rows during training, preventing low-confidence empirical-formula and 3D-CNN surrogate labels from driving the performance gradient. To quantify this design choice, the score model was retrained with --no_tier_gate: the performance head trains on all rows, including noisy Tier-C/D labels. The modified model then guided 20 000 samples through the identical DDIM sampler.

**Table 6.6.** Tier-gate ablation: score model trained with and without perf-head label-trust masking, evaluated at pool=20 000. Top-1 ranked by N-fraction proxy (no Uni-Mol scoring). n_feasible is post-SMARTS-gate + SA ≤ 6.5.
Training	n_raw	n_feasible	Keep rate	Top-1 \(f_N\)	Top-1 SMILES	Best val loss
Production (tier-gate on, Methods (main paper))	20 000	—	—	—	see Supplementary Note 6.3 (Uni-Mol-ranked)	1.089
No tier-gate (ablation)	20 000	10 773	53.9 %	0.667	`[N+][N+]([O-])NC(=N)N[N+]=O`	1.178

Without the tier-gate, the score model's performance head trains on Tier-C/D surrogate labels (sharply lower confidence than Tier-A/B experimental/DFT), and the guidance gradient pushes the sampler toward simple linear poly-N chains (top-1: guanidinium-class nitramine, 6 heavy atoms) rather than ring-bearing energetic scaffolds. The n_feasible keep rate rises to 53.9 % (vs 4.6 % in production) because the corrupted performance signal steers into a region of molecule space that is chemically plausible but performance-degenerate (high \(f_N\) by atom count, not by energetic architecture). The validation loss rises by 0.089 absolute, consistent with the perf head being trained on noisier targets.

Supplementary Note 7: AiZynthFinder retrosynthesis audit (L1 and extension leads)

This Supplementary Note gives the full disconnection narrative for the Results “Novelty and synthesisability” (main paper) AiZynthFinder retrosynthesis audit on L1 (3,4,5-trinitro-1,2-isoxazole), the negative-result discussion for L4, L5, and the nine extension leads, and the drug-domain template-bias argument. The headline is reported in Results “Novelty and synthesisability” (main paper) (1/12 USPTO-template hit rate; L1 reachable in 4 steps with state score 0.50; nine-lead extension reproduces the population finding).

L1 disconnection sequence. Reading aizynth_results.json outside-in: the terminal disconnection at the L1 target is a one-step electrophilic ring-nitration of 4,5-dinitro-1,2-isoxazole with HNO₃ (USPTO template hash acb27933, policy probability 0.156, rank 0). Caveat: 4,5-dinitro-1,2-isoxazole is strongly electron-deficient and a third C-nitration under standard HNO₃ alone is unlikely at this deactivation level; literature routes for trinitrated five-membered heterocycles use fuming HNO₃/oleum or N₂O₅ [16], and the AiZynth template fires on the C–H bond without accounting for ring deactivation, so this disconnection is a synthetic starting point rather than a literal protocol. Three earlier disconnections then walk the 3-amino precursor back through Boc protection and a Curtius-type rearrangement using diphenylphosphoryl azide (DPPA) over 4,5-dinitro-1,2-isoxazole-3-carboxylic acid. Hazard note: the DPPA-mediated Curtius step generates 4,5-dinitro-1,2-isoxazole-3-carbonyl azide as a transient intermediate; acyl azides on highly electron-poor polynitro heterocycles are primary-explosive-class compounds subject to Curtius rearrangement and thermal decomposition below ambient temperature, and any scale-up attempt requires low-temperature protocols (\(\le\) −10 °C throughout the azide stage), in-situ trapping of the resulting isocyanate, and appropriate explosive-handling facilities.

ZINC catalog gap. Of the leaf precursors enumerated by AiZynth, only tert-butanol and DPPA are flagged in_stock: true against the ZINC catalog; the four energetic-domain intermediates (4,5-dinitro-1,2-isoxazole, its 3-amino derivative, the Boc-protected amine, and the 3-carboxylic acid) are all in_stock: false. The retrosynthetic statement is therefore: USPTO templates fire on the L1 target and supply a chemically reasonable 4-step disconnection consistent with the Hervé group's 3-amino-4,5-dinitroisoxazole precursor chemistry, but the energetic-domain intermediates that the route requires are not stocked by the drug-discovery-biased ZINC catalog. This is the same chem-domain catalog gap that applies to L4 and L5 below; it lower-bounds aromatic poly-nitro chemistry as template-reachable from public USPTO reactions, and it does not amount to a commercial-precursor route.

Negative results on L4, L5, and the nine extension leads. L4 and L5 returned no productive routes within the 200-MCTS / 300-s search budget. The state score of 0.049 corresponds to AiZynth's "target node only, no expansions completed" baseline, which means the USPTO template set and ZINC catalog together contain no expansions for the tetrazoline-NNO₂ ring-closure or for the acyl-oxime-O-NO₂ linkage. The negative result is not a search-budget artefact: a follow-up run at 5\(\times\) the original iteration budget (1000 MCTS iter, 1800 s wall) reproduced the same outcome on both leads, with the search exiting in <2 s because no expansion templates fire on the target node. This is a negative result and reflects a known limitation of the public AiZynth distribution: the USPTO template set is drug-discovery-biased and does not include energetics-specific reactions (mixed-acid nitration, N₂O₅ nitrolysis, KMnO₄-promoted oxidation/cyclisation). It is not evidence that L4 and L5 are unsynthesisable; it is evidence that a generative model proposing energetic-domain chemistry needs an energetics-domain template database for retrosynthetic auditing, a gap we flag for the community.

To test whether the L4/L5 negative result generalises, we reran the same 200-MCTS / 300-s AiZynthFinder protocol on the nine other chem-pass leads (L2, L3, L9, L11, L13, L16, L18, L19, L20). All nine returned no productive routes within budget (aizynth_results_extension.json), with search times ranging from 0.1 s (immediate template-policy rejection at the target node) to 38.8 s (exhaustive enumeration without finding a closed route). The drug-domain-template gap is therefore a population finding rather than an n=2 anecdote: across the twelve chem-pass leads, only L1 (aromatic 1,2-isoxazole with simple ring nitration) is reachable from public USPTO templates; the eleven other leads (small saturated heterocycles, nitramines, oxime-nitrates, polyazene chains) require energetic-domain expansions that the public template set does not contain. The retrosynthetic plausibility check is therefore a positive lower-bound for L1 and uninformative (rather than negative) for the other eleven; conversion of the L1 4-step disconnection into an actionable wet-lab route additionally requires sourcing of the energetic-domain intermediates that ZINC does not stock.

Supplementary references

This list contains only references cited within this Supplementary Information. References cited from the main article (LIMO, RDKit, the diffusion-model and FiLM building blocks, baseline methods, etc.) appear in the main paper’s reference list and are not duplicated here. Numbering is local to this document.

Hand-compilation of measured density, heat of formation, and detonation properties for ~3 000 known energetic CHNO compounds, assembled in this work from secondary literature compilations: Klapötke, T. M. Chemistry of High-Energy Materials, 5th ed. (de Gruyter, 2019); Cooper, P. W. Explosives Engineering (Wiley-VCH, 1996); and Dobratz, B. M. & Crawford, P. C. LLNL Explosives Handbook: Properties of Chemical Explosives and Explosive Simulants, UCRL-52997, Rev. 2 (Lawrence Livermore National Laboratory, 1985). Per-row provenance documented in Supplementary Note 1.1.
Casey, A. D., Son, S. F., Bilionis, I., & Barnes, B. C. (2020). Prediction of Energetic Material Properties from Electronic Structure Using 3D Convolutional Neural Networks. J. Chem. Inf. Model. 60(10):4457–4473. doi:10.1021/acs.jcim.0c00259.
Kamlet, M. J., & Jacobs, S. J. (1968). Chemistry of Detonations. I. A Simple Method for Calculating Detonation Properties of C-H-N-O Explosives. J. Chem. Phys. 48:23–55. doi:10.1063/1.1667908.
Sterling, T., & Irwin, J. J. (2015). ZINC 15: Ligand Discovery for Everyone. J. Chem. Inf. Model. 55(11):2324–2337.
NIST CAMEO Chemicals: Database of Hazardous Materials and Reactivity. cameochemicals.noaa.gov.
Bruns, R. F. & Watson, I. A. Rules for identifying potentially reactive or promiscuous compounds. J. Med. Chem. 55, 9763–9772 (2012); used here as a SMARTS-based reactivity-demerit source.
Klapötke, T. M. (2017). Strategies for the synthesis of energetic, nitrogen-rich heterocycles and their salts. Chem. Mater. 29(5):1815–1826. doi:10.1021/acs.chemmater.6b04025.
Kim, S. et al. (2023). PubChem 2023 Update. Nucleic Acids Res. 51(D1):D1373–D1380.
Bannwarth, C., Ehlert, S., & Grimme, S. (2019). GFN2-xTB: An Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. J. Chem. Theory Comput. 15(3):1652–1671. doi:10.1021/acs.jctc.8b01176.
Huang, X. et al. (2021). Applying Machine Learning to Balance Performance and Stability of High Energy Density Materials. iScience 24:102240.
Goerigk, L., Hansen, A., Bauer, C., Ehrlich, S., Najibi, A., & Grimme, S. (2017). A look at the density functional theory zoo with the advanced GMTKN55 database for general main group thermochemistry, kinetics and noncovalent interactions. Phys. Chem. Chem. Phys. 19(48):32184–32215. doi:10.1039/C7CP04913G.
Bondi, A. (1964). van der Waals Volumes and Radii. J. Phys. Chem. 68(3):441–451. doi:10.1021/j100785a001.
Mathieu, D. (2017). Sensitivity of Energetic Materials: Theoretical Relationships to Detonation Performance and Molecular Structure. Ind. Eng. Chem. Res. 56(31):8191–8201. doi:10.1021/acs.iecr.7b02021.
Polykovskiy, D. et al. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology 11:565644. doi:10.3389/fphar.2020.565644.
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., & Klambauer, G. (2018). Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. J. Chem. Inf. Model. 58(9):1736–1741.
Konnov, A. A. et al. Synthesis of 4-nitroisoxazole-based energetic materials. Org. Lett. 27, 3795–3799 (2025). doi:10.1021/acs.orglett.5c01074.