Acting on the Unseen: Collaborative Filtering for Decentralized Multi-Robot Task Allocation under Minimal Assumptions

Streamlined draft (v2). Companion tutorial: tutorial.html. Repo: ApartsinProjects/ZKDroneSwarm. All numbers regenerable from saved per-seed data.

Abstract. We study multi-robot task allocation under the LEAST possible information: a swarm with NO prior knowledge (no labels, no task models, no known latent structure, not even the problem's rank), ZERO communication (no messages, no parameter sharing, no coordinator), and only PARTIAL, NOISY observation of a public outcome stream, with every decision made independently and distributedly. We show that, even so, each agent can act well on tasks it has NEVER tried, onboard brand-new tasks, and welcome new agents, by running online collaborative filtering over the public broadcast. Against any structure-free learner this is a CATEGORICAL separation, not a constant factor (the structure-free learner is at the error floor on unseen pairs by construction), and we prove a matching per-agent sample-complexity theory ($\Theta(d)$ vs $\Theta(n)$) plus an anytime (cumulative-reward) separation. Against the field of low-rank methods our online, masking-robust, anytime-optimal estimators dominate throughout the limited-observability regime that defines the problem.

Naming. Our methods form the SwarmCF family: SwarmCF is the core online weighted-ALS filter (code name RewardCF), SwarmCF-B the Bayesian-confidence variant (EMCF), SwarmCF-D/D+ the contention variants (ContentionCF/AdaCF), SwarmCF-batch the batch hybrid (PTF), and so on. Figures and tables use the SwarmCF names; running prose and code listings retain the original class names for traceability to the implementation.

1. Introduction

Minimal assumptions. No prior knowledge. Zero communication. Only partial, noisy observables. Fully distributed decisions. Can a swarm still act intelligently, in particular on the unseen? We answer yes, and prove the advantage is categorical.

A swarm repeatedly chooses which target to engage. Compatibility between agents and targets is hidden but low-rank; there are far more targets than rounds (n >> T); agents neither coordinate nor share parameters and each senses only a noisy, partial slice of the public outcomes. The crux is GENERALIZATION: a structure-free learner that estimates a target only from its own engagements is, on any target it never touched, at the prior-mean error floor, which dominates when targets outnumber rounds.

Contributions. (1) A minimal-assumption, decentralized, online, broadcast-only formulation of CF for MRTA. (2) The categorical unseen-pair and onboarding/cold-start results (Theta(d) vs Theta(n)); the cold-start holds even for a strictly zero-knowledge newcomer that recovers the target basis from its OWN masked broadcast rather than receiving any peer's model, with the separation degrading gracefully as its visibility falls. (3) An anytime separation under sample starvation. (4) A masking-model dichotomy (durable vs transient decentralization). (5) A method family that is masking-robust and anytime-optimal, dominating the prior-art low-rank competitors; and collective active exploration via the shared broadcast.

2. Setting

World. $m$ drones, $n$ targets, with $K_1,K_2$ latent types; the reward is the cosine of unit latent factors $p_i,u_j\in\mathbb{R}^d$,

$R_{ij} = \langle p_i, u_j\rangle \in [-1,1], \qquad R = PU^\top, \qquad \operatorname{rank}(R)=\min(d,K_1,K_2)$ (no nonlinear link).

Robotics grounding. Concretely, $p_i$ is robot $i$'s (hidden) CAPABILITY profile along $d$ latent traits (payload, sensor suite, speed, endurance, software skills) and $u_j$ is task $j$'s REQUIREMENT profile along the same traits, so $R_{ij}$ is the capability-requirement match (how well robot $i$ performs task $j$, e.g. inspection / coverage / pickup quality). Neither profile is known a priori, robots learn them only from outcomes. Crucially, the observation channel is LIMITED-RANGE SENSING, not radio: a robot perceives a teammate's engagement (which task, with what outcome) only when the engagement falls inside its sensing footprint, and more noisily the farther away it is. This is what makes the masking $\rho$ and noise $\sigma_{\mathrm{obs}}$ PHYSICAL rather than free parameters; we validate this directly by deriving observability from 2-D sensing geometry (range + distance-noise) and showing the categorical result survives (Section on sensing-grounded observability), the regime a real decentralized swarm operates in.

Observability. A public stream of (action, outcome); each agent senses its own outcome cleanly (noise $\sigma_{\mathrm{own}}$) and a per-agent masked (rate $\rho$), noisy (noise $\sigma_{\mathrm{obs}}$) slice of teammates' outcomes (passive sensing, not transmission, so communication-free). Sensing is independent per observer: agent $i$ applies its own persistent mask $M_{ik}$ and draws its own noise on each engagement, recording $\tilde R^{(i)}_{k}=R_{k,a_k}+\eta_{ik}^t$ with $\eta_{ik}^t\sim\mathcal{N}(0,\sigma_{\mathrm{obs}}^2)$ fresh per observer and round, so $\eta_{ik}^t\perp\eta_{i'k}^t$ for $i\neq i'$. Different agents thus read the SAME action differently and no single value is ever transmitted, which is what keeps their internal states distinct (state-uniqueness) and the channel communication-free. The observation precision is $\beta=1/\sigma^2$, with masked events absent (weight $0$). Baselines. Random, UCBIndep, UCBHomo, Tabular (no-structure); MFSGD, ESTR, PTF, BPMF (low-rank). Fairness/ZK. Every method gets a guessed rank (or none), an independent per-agent instance with no parameter sharing, and the same noisy partial stream; the Oracle (centralized, complete information) only normalizes scores. Metrics. normalized skill $=$ (method $-$ random)/(oracle $-$ random) (the standard skill score of forecast verification and the random-normalized score of RL benchmarking; an affine rescaling of regret with random$=0$, oracle$=1$), unseen-pair skill, anytime cumulative-reward skill, state-uniqueness.

3. Method

Each agent runs its own ONLINE weighted alternating least squares over the events it senses. An unobserved/masked event gets zero WEIGHT, not an imputed zero VALUE; this is what makes the estimator masking-robust (batch fill-in methods impute zeros for missing entries and decay under masking). Among OBSERVED events, inverse-variance (1/sigma^2) weighting is an optional noise-adaptive refinement that pays off when the broadcast is noisy; at low noise uniform weighting generalizes slightly better to unseen pairs (we ablate this honestly). Estimation is separated from the decision policy. The family: RewardCF (teammates' rewards; anytime-optimal), ChoiceZK (teammates' actions only; noise-immune fallback), HybridCFconv (probe, SVD warm-start, then converged online ALS; best final-policy unseen), and ActiveCFconv (latent-space UCB with a broadcast count-based uncertainty bonus, so one agent's probe lowers everyone's uncertainty; best balanced).

One core, scoped extensions (read this if nothing else). The hero is a SINGLE estimator, online noise-weighted ridge ALS over the public stream, with an $\epsilon$-greedy policy. Everything else, Bayesian confidence (EMCF), directed exploration, rank self-determination (ARD), contention de-confliction, and the choice channel, is a SCOPED EXTENSION of that one estimator: each is applied only when a stated condition holds and otherwise reduces to the core. We present them as a single theorem-backed menu (Section 5) rather than a method zoo, so the whole design space collapses to one method plus a short list of conditional add-ons.

4. Theory

(Full proofs in THEORY_FORMAL.md; step-by-step in the tutorial.)

Tabular floor. A structure-free learner has $\Omega(1)$ error and $0$ skill on any unseen pair, and the broadcast is provably useless to it; pooling recovers only the rank-1 popularity term. CF row completion. Given $U$, an agent's whole row is recovered exactly by an $O(d)$ least-squares fold-in; per-agent $\Theta(d)$ vs tabular $\Theta(n)$, categorical on unseen pairs. Anytime separation. Under $n \gg T$ any structure-free learner's anytime skill $\le g(cT/n) \to 0$ (even with full broadcast); CF reaches near-oracle in $O(d)$ rounds. Masking model. i.i.d. loss makes decentralization transient; persistent masking makes it durable; the categorical results are invariant to the choice. Additive ceiling. An additive predictor $a+b_i+c_j$ has rank $\le 2$ and reduces to the popularity order, so it cannot personalize and is strictly below CF for $d > 1$.

Recent results and honest scope. Precise scope (iff): the CATEGORICAL unseen gap vs a structure-free learner needs only a shared channel ($\rho>0$) and some structure; $d>1$ is what additionally beats a POPULARITY baseline (the additive ceiling); sample-starvation is what makes the unseen advantage OPERATIONALLY dominant (the anytime separation), three conditions with three distinct roles, not one AND. Confidence: precision fit-weighting is suboptimal for unseen prediction (it over-weights the drone's own, here irrelevant, data); the Bayesian posterior is the right object, and a sharper analysis shows bounded ratio-capped precision dominates uniform exactly when teammate sources DIFFER in reliability, while unbounded $1/\sigma^2$ starves coverage. Exploration: EMCF's predictive-variance UCB is LinUCB on the recovered latent features, giving $\tilde O(d\sqrt T)$ per-drone regret given $U$. Contention = fixed private offset; rank = ARD recovers the identifiable rank, removing the hyperparameter; choice channel = held-out informativeness is identifiable where in-sample is not. The keystone, now bounded: the per-drone recovery of $U$ from the persistent-masked broadcast, formerly invoked from centralized completion theory, is now pinned to an explicit per-drone condition. Drone $i$ recovers a target's factor $u_j$ (and so predicts the unseen pair) EXACTLY iff its own factor $p_i$ lies in the span of the factors of the visible teammates that engaged $j$ (an anchor block fixing the frame plus per-target spanning coverage); it is provably at the prior floor otherwise; under noise the error is bounded by the fold-in bound with the coverage degree controlling the noise term; and under non-adaptive exploration the condition self-achieves in $O((nd/\rho m)\log n)$ rounds, faster as the swarm grows. An oracle-reconstruction test on the actual coverage patterns confirms the threshold exactly (recovered pairs: error $0.000$; below the span condition: the prior floor). The one remaining open piece is the finite-time coverage rate under an adaptive exploiting policy (full statements, proofs, and the validation in THEORY_FORMAL.md and P15_VALIDATION.md).

Foundational results. Fold-in error bound (exact): the cold-start / onboarding prediction error splits cleanly into three sources, the swarm's basis-recovery error $\varepsilon$, own-probe noise $\sigma\sqrt{d/k}$, and ridge bias, and is EXACT once $k\ge d$; this is the first self-contained bound for our fold-in and it QUANTITATIVELY explains the sensing experiment (sparser sensing $\Rightarrow$ larger $\varepsilon \Rightarrow$ lower cold-start skill). Collective broadcast speedup: an isolated agent CANNOT recover a rank-$d{>}1$ column space (one matrix row is unconstrained, so it stays at the floor), whereas the broadcast lets the swarm cross the completion threshold in $\tilde O(d(1{+}n/m))$ rounds, a $\Theta(m)$-fold COLLECTIVE speedup with no communication; this is the theoretical content behind the $\rho>0$ condition. Minimax tightness: $\Omega(d)$ own probes are necessary even given the exact basis (with $k

5. Experiments

How to read the comparison. The ZK-MRTA setting itself is new, so this is not 'our method vs external rivals' but a CONTROLLED SWEEP across the low-rank design space we instantiate in the setting, measured against the genuinely external structure-free paradigm and full-information reference ceilings. Provenance is explicit: our invented methods form ONE family, SwarmCF (the core RewardCF plus Bayesian, choice, de-confliction, unified and batch variants, shown as 'SwarmCF-variant'), and PTF (SwarmCF-batch) is ours too, a strong batch-flavored hybrid we built from standard parts; MFSGD, ESTR, BPMF, SoftImpute, KNNCF and CLUB are named prior-work estimators we ADAPT to the setting (the algorithms are theirs, the decentralized broadcast-only application is part of our framework); the per-arm UCB and tabular learners are the structure-free paradigm we beat; and three full-information upper bounds (NOT competitors) bracket the top: the Oracle (knows the true reward), CentralClean (a centralized low-rank matcher that observes everything with NO noise and NO masking and assigns optimally, the strongest a centralized low-rank system can do without the true factors), and the noisy comms-full CTDE matcher. Harness fact: the evaluation builds one estimator PER drone, each seeing only the broadcast it observes and acting on its own row, so every method in the sweep is decentralized and communication-free in-harness; the low-rank methods differ in the UPDATE RULE (online vs batch-refit vs explore-then-commit) and the refinements, not the information regime (ESTR is spectral/centralized only in its origin and is run here as a per-drone explore-then-commit reduction). Our online CF sits in the hardest cell and any method that beats it in some regime relaxes one axis (a batch refit at full broadcast, full information for the ceilings).

Notation [dist | comm | obs | prior | compute]: dist D=decentralized, C=centralized; comm 0=none (passive), B=broadcast params, full=message-passing; obs full / ρ=masked / ρσ=masked+noisy / self; prior –=none, d̂=guessed rank (itself removable: the ARD variant learns it), ARD=rank self-determined, d=true rank, U*=oracle factors; compute online / batch / ETC=explore-then-commit / mem. In-harness every method is decentralized and communication-free (one estimator per drone on the passive broadcast); the low-rank methods differ in the update rule and refinements, not the information regime. Our flagship sits in the hardest cell [D | 0 | ρσ | d̂ | online]. Our invented methods are ONE family, SwarmCF (shown as 'SwarmCF-variant (code name)'), differing only in the menu axes of the mechanism table; everything else is a standard estimator we adapt, the structure-free paradigm, or a reference ceiling.

Method	Provenance	Dist	Comm	Observability	Prior	Compute	Profile
SwarmCF (RewardCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-B (EMCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-RC (BothCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-Ch (ChoiceCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-D+ (ContentionAdaCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-U (UnifiedCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-H (HybridCF)	ours	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
SwarmCF-B-ARD (ARD-EMCF)	ours	decentralized	0	ρσ	ARD	online	`D\|0\|ρσ\|ARD\|online`
SwarmCF-batch (PTF)	ours (hybrid)	decentralized	0	ρσ	d̂	batch	`D\|0\|ρσ\|d̂\|batch`
MFSGD	standard, adapted	decentralized	0	ρσ	d̂	online	`D\|0\|ρσ\|d̂\|online`
KNNCF	standard, adapted	decentralized	0	ρσ	–	mem	`D\|0\|ρσ\|–\|mem`
BiasModel	standard, adapted	decentralized	0	ρσ	–	online	`D\|0\|ρσ\|–\|online`
BPMF	standard, adapted	decentralized	0	ρσ	d̂	batch	`D\|0\|ρσ\|d̂\|batch`
SoftImpute	standard, adapted	decentralized	0	ρσ	d̂	batch	`D\|0\|ρσ\|d̂\|batch`
CLUB	standard, adapted	decentralized	0	ρσ	–	batch	`D\|0\|ρσ\|–\|batch`
ESTR	standard, adapted	decentralized	0	ρ	d̂	ETC	`D\|0\|ρ\|d̂\|ETC`
UCBIndep	structure-free baseline	decentralized	0	ρσ	–	online	`D\|0\|ρσ\|–\|online`
UCBHomo	structure-free baseline	decentralized	0	ρσ	–	online	`D\|0\|ρσ\|–\|online`
Tabular	structure-free baseline	decentralized	0	ρσ	–	online	`D\|0\|ρσ\|–\|online`
Random	structure-free baseline	decentralized	0	–	–	–	`D\|0\|–\|–\|–`
CentralClean-ceiling	reference (upper bound)	centralized	full	full	d̂	online	`C\|full\|full\|d̂\|online`
CTDE-ceiling	reference (upper bound)	centralized	full	fullσ	d̂	online	`C\|full\|fullσ\|d̂\|online`
Oracle	reference (upper bound)	centralized	full	full	U*	–	`C\|full\|full\|U*\|–`

Categorical (acting on the unseen). CF acts well on never-observed pairs at every observation density while structure-free learners sit at the floor (Fig. 1); new targets onboard for the whole swarm from ~d shared probes and new agents from ~d own probes (vs ~n for tabular).

**Fig. 1.** Unseen-pair skill vs observation density: CF acts on the unseen at every density; structure-free is pinned at the floor.

Operational (anytime). On reward actually earned over time, our online methods dominate at every horizon; per-arm bandits are stuck near random because, with n >> T, they never stop exploring untried arms (Fig. 2).

**Fig. 2.** Anytime cumulative reward: online CF earns from round one; explore-then-commit pays a probe phase; UCBIndep is stuck.

Robustness. Results are invariant to the masking model (i.i.d. vs persistent), while decentralization is durable only under persistent masking (Fig. 3); and they hold across rank, horizon, target/agent counts, guessed rank, cluster structure, and latent spread.

**Fig. 3.** Persistent vs i.i.d. masking: unseen/anytime invariant; state-uniqueness durable (persistent) vs transient (i.i.d.).

Out-of-harness transfer (independent simulator). Beyond our generative code, we drop the method as a drop-in policy into ZK-MRTA-Sim, an independently-built, spatially-explicit simulator of the same Zero-Knowledge / Zero-prior / Zero-communication MRTA setting (PettingZoo package tabula_drone; not to be confused with our Tabular baseline), which has dynamics our model lacks: spatial geometry, target health that DEPLETES as engaged (a soft contention/matching pressure), episodic resets, and a Hungarian optimal-assignment oracle. We benchmark against the environment's own policies (Fig. 3b). On the efficiency metric (reward per step), our method reaches skill ~0.81 with low variance, beating the environment's SGD matrix-factorization (~0.25) and UCBIndep (~0.72) and approaching the matching oracle. This tests TRANSFER and partial-contention robustness, not real-world realism (it is still a simulator, and itself latent-based, so it is not an independent test of low-rankness). For the low-rank assumption itself we stress it directly (approximate low-rank via entrywise noise; a nonlinear link that raises the effective rank) and observe graceful degradation with the ranking preserved.

F13 — **Fig. 3b.** ZK-MRTA-Sim, the independent simulator (spatial, depleting HP, episodic; matching oracle): converged skill and learning curves; ours beats the env's SGD-MF and UCBIndep, approaching the oracle.

Versus the field. The categorical unseen win is shared by all low-rank methods over no-structure ones (a property of structure). Our specific edge is masking-robustness and anytime optimality; with a converged estimator our methods dominate the strongest competitor (PTF) on every metric and density (Fig. 4). We further compare against a broader, fair set, run under the same limits: SoftImpute (nuclear-norm convex completion), kNN-CF (memory-based, model-free), and BiasModel (additive, no interaction). SoftImpute is the strongest at full broadcast (it tops unseen at rho=1) but, like all batch fill-in methods, decays under masking and loses on anytime; kNN-CF generalizes but earns little; BiasModel sits at the additive ceiling. Under masking and on anytime, our methods beat the best of these with non-overlapping CIs.

F11 — **Fig. 4.** Pareto frontier (anytime vs unseen): our methods dominate; PTF trades all anytime for unseen and is dominated under masking.

Headline numbers (20 seeds, bootstrap 95%% CI). The unseen-pair and anytime skill of the no-structure floor (UCBIndep), the strongest competitor (PTF), and our methods, at full broadcast (rho=1) and heavy masking (rho=0.25):

skill = (method - random)/(oracle - random); 0 = random, 1 = oracle. rho = fraction of broadcast observed. Ours in bold.

UNSEEN-pair skill

method	rho=1.0	rho=0.25
UCBIndep	0.000 [-0.004, 0.004]	-0.001 [-0.005, 0.004]
PTF	0.490 [0.458, 0.517]	0.272 [0.251, 0.293]
RewardCF	0.377 [0.355, 0.398]	0.326 [0.303, 0.349]
HybridCFconv	0.488 [0.468, 0.509]	0.379 [0.354, 0.401]
ActiveCFconv	0.476 [0.461, 0.493]	0.339 [0.317, 0.358]

ANYTIME skill

method	rho=1.0	rho=0.25
UCBIndep	0.002 [-0.006, 0.010]	-0.004 [-0.010, 0.002]
PTF	0.273 [0.257, 0.290]	0.226 [0.215, 0.236]
RewardCF	0.383 [0.366, 0.399]	0.342 [0.332, 0.352]
HybridCFconv	0.340 [0.327, 0.354]	0.300 [0.288, 0.311]
ActiveCFconv	0.436 [0.419, 0.451]	0.341 [0.323, 0.357]

Method ablation

We foreground ONE recommended method (ActiveCFconv) and present the rest as ablations rather than a method zoo. The table below isolates each design choice (online vs batch, precision weighting, SVD warm-start, active vs eps-greedy exploration) and the robustness to a guessed rank, under one identical config (12 seeds, 95%% CI):

Recommended method = ActiveCFconv (online weighted-ALS + latent-UCB active exploration; best balanced: top anytime, strong unseen). Each row below toggles ONE design choice; skill: 0 = random, 1 = oracle.

Design-choice ablation (unseen-pair and anytime skill)

variant (knob)	unseen rho=1.0	unseen rho=0.25	anytime rho=1.0	anytime rho=0.25
ActiveCFconv _{online wALS + precision + active-UCB (RECOMMENDED)}	0.485 [0.466, 0.506]	0.360 [0.345, 0.376]	0.440 [0.424, 0.455]	0.348 [0.330, 0.367]
HybridCFconv _{- active; + probe->SVD warm-start (eps-greedy)}	0.501 [0.478, 0.526]	0.396 [0.371, 0.421]	0.353 [0.337, 0.367]	0.308 [0.295, 0.321]
RewardCFconv _{- active, - warm-start (plain online wALS, eps-greedy)}	0.450 [0.427, 0.473]	0.372 [0.347, 0.395]	0.390 [0.373, 0.405]	0.332 [0.321, 0.342]
RewardCFconv_noprec _{- precision weighting (uniform obs weights)}	0.584 [0.574, 0.594]	0.392 [0.371, 0.413]	0.438 [0.426, 0.452]	0.339 [0.328, 0.349]
PTF _{batch SVD + SGD finetune (explore-then-commit)}	0.505 [0.482, 0.528]	0.276 [0.248, 0.305]	0.277 [0.260, 0.295]	0.231 [0.217, 0.245]
ESTR _{explore-then-spectral commit}	0.224 [0.197, 0.251]	0.047 [0.034, 0.059]	0.218 [0.196, 0.241]	0.185 [0.173, 0.197]

Rank sensitivity of ActiveCFconv (rho=0.25; true d=5, default d_hat=8)

guessed rank d_hat	unseen skill	anytime skill
2	0.222 [0.198, 0.245]	0.270 [0.247, 0.293]
5	0.327 [0.296, 0.356]	0.351 [0.331, 0.369]
8	0.360 [0.345, 0.376]	0.348 [0.330, 0.368]
12	0.359 [0.328, 0.390]	0.347 [0.329, 0.365]
20	0.344 [0.313, 0.369]	0.340 [0.324, 0.354]

Takeaways: (i) online weighted-ALS dominates batch / explore-then-commit (PTF, ESTR) under masking and on anytime (anytime rho=1.0: RewardCFconv 0.39 vs PTF 0.28; unseen rho=0.25: 0.37 vs 0.28). (ii) HONEST FINDING: precision weighting 1/sigma^2 does NOT help at the default broadcast noise (sigma_obs=0.3) -- UNIFORM weighting (noprec) is better on both unseen (0.584 vs 0.450 at rho=1.0) and anytime, because it uses the abundant broadcast fully (the broadcast carries the cross-target structure that unseen generalization needs). Precision weighting only pays off once the broadcast is high-noise; see PRECISION_SWEEP.md for the crossover. (iii) the SVD warm-start (HybridCFconv) lifts unseen, most at dense rho, at an anytime cost. (iv) active (latent-UCB) exploration improves anytime and dense-rho unseen over eps-greedy. (v) robust across guessed rank for d_hat >= true d (5..20 all ~0.35 unseen at rho=0.25), degrading only when d_hat=2 < true d=5.

Scoped refinements: one core, a menu of conditional extensions

To avoid a method zoo, every refinement beyond the core estimator is presented as a single MENU of scoped, theorem-backed extensions. Each helps ONLY under a stated condition and reduces to the core otherwise, so the practitioner adds an extension exactly when its condition holds. This is the entire design space in one view.

Refinement	Helps when	Result (8-20 seeds, CIs)	Theory
Bayesian confidence (EMCF)	directed exploration; coverage-scarce or noisy broadcast	predictive uncertainty is DISCRIMINATIVE (actual RMSE rises monotonically Q1$\to$Q5), and beats uniform when the broadcast is noisy	Bayesian posterior
Bounded precision weighting	observation sources DIFFER in noise	ratio-capped precision wins under heterogeneous noise ($+0.09$ unseen, non-overlapping CIs); UNBOUNDED $1/\sigma^2$ over-concentrates and loses everywhere	precision weighting
Collective info-directed exploration	anytime sample efficiency	one agent's broadcast probe lowers everyone's uncertainty; best FINAL anytime skill	n/a
ARD rank adaptation	rank unknown	self-determines the IDENTIFIABLE rank ($\le d$) and is INVARIANT to the guessed $\hat d$, which is what removes the rank hyperparameter (the recovered value reflects SNR/coverage, not the raw true rank)	rank identifiability
Fixed private offset ($+$ adaptive)	capacity-1 contention	$\approx 2\times$ earned reward vs greedy at severe contention ($0.10$ vs $0.059$), and BEATS both recognized no-comms de-confliction primitives, the CBBA auction-with-backoff ($0.064$) and multiplayer-MAB re-seating ($0.028$), with non-overlapping CIs; the adaptive scarcity-gated version extends the win across the contested range ($\|S\|\le m$)	private-offset result
Choice channel $+$ held-out $\gamma$	extreme reward-noise; mixed-reliability teammates	noise-immune fallback (overtakes the reward channel at $\sigma_{\mathrm{obs}}{=}2$); held-out $\gamma_k$ correctly ranks teammates (oracle $0.49\gg$ random $0.10$)	choice identifiability

What does NOT help (honest, diagnosed nulls): UNBOUNDED precision-ALS at homogeneous noise (coverage, not noise, binds); a LEARNED per-teammate choice gate vs a simple competence schedule when teammates are homogeneous (it ties, and in-sample it over-trusts random teammates, the bug held-out $\gamma$ fixes); confidence-GATED reward$+$choice fusion. Each is a documented null with a stated cause, several surfaced by deliberate SANITY experiments whose answer is obvious (oracle-vs-random teammates; homogeneous-vs-heterogeneous noise).

Capstone: the menu composes into ONE method (UnifiedCF). The refinements are not just individually scoped, they COMPOSE: UnifiedCF runs the Bayesian-confidence core (EMCF: predictive-variance UCB, ARD-capable) with the de-confliction offset and the exploration gate, where each extension SELF-GATES on an observable so it switches itself off when its condition is absent. The offset scales with the agent's own observed loss rate (0 until it actually loses contests); exploration anneals when the agent is losing AND when the offer is abundant ($|S| > 4m$, nothing to explore-for-coverage). With one config and NO per-regime tuning, on the same 8 seeds UnifiedCF is best-or-statistically-tied in EVERY regime-metric: standard anytime $0.437$ (vs EMCF $0.433$); churn active $0.851$ / recent $0.347$ (vs $0.842$ / $0.371$); contention earned reward at severe $|S|{=}15$ $0.104$ ($\approx 2\times$ greedy $0.059$, ties the contention specialist) and at no-contention $|S|{=}240$ $0.425$ (ties greedy $0.439$ / adaptive-offset $0.448$, overlapping CIs). The abundance gate fires ONLY at $|S|{>}4m$, so the small-offer regimes are untouched. So the design space does not just reduce to a core plus a menu, it collapses to a single self-tuning policy that a practitioner can deploy without knowing the regime in advance.

Sensing-grounded observability (robotics check). To confirm the masking/noise are not a convenient abstraction, we place the $m$ robots and $n$ tasks in a 2-D arena and DERIVE observability from sensing geometry: a robot perceives a teammate's engagement only if the task is within its sensing radius $R$, with read-off noise growing with distance ($\sigma(d)=\sigma_0(1+d/R)$). This yields a persistent, structured per-robot visibility (the persistent-masking regime) emergent from physics. Sweeping $R$ (effective coverage $0.11\to0.86$): the categorical unseen-pair win SURVIVES once coverage exceeds $\approx0.3$, with RewardCF unseen skill $0.15\to0.30$ versus the structure-free floor ($\approx0$) at EVERY radius, and degrades gracefully only at very sparse sensing (10\% coverage, where there is too little data to recover the shared structure). So the generalization result holds under geometry-limited, distance-noisy sensing, the operating regime of a real decentralized swarm, not merely under an injected mask.

Why a swarm: collaboration value and positive scaling (Fig. 5). Two experiments isolate what the swarm and the broadcast actually buy. (a) Collaboration value. We sweep the broadcast rate from $\rho{=}0$ (ISOLATED: each robot sees ONLY its own outcomes) to $\rho{=}1$ (full passive sensing). A lone robot cannot recover the shared low-rank structure from its single matrix row, so isolated unseen skill is $\approx0$; the broadcast UNLOCKS generalization, worth $+0.39$ unseen skill to our online RewardCF but only $-0.00$ / $+0.01$ to the structure-free UCBIndep / Tabular, which have no model linking tasks and so cannot use the broadcast at all. (The batch-refit baseline PTF gains the most from a broadcast, $+0.51$, because a one-shot refit on a near-fully-observed matrix is the best case for batch low-rank completion; it accordingly overtakes us at full broadcast $\rho{=}1$, whereas our online method leads in the masked regime $\rho\le0.5$ that defines the operational problem. This crossover near $\rho\approx0.6$ reproduces across the bake-off and crossover sweeps with near-identical numbers and is exactly what matrix-completion theory predicts: a one-shot SVD is the near-optimal estimator once the matrix is densely observed.) This is the operational answer to 'why share?': communication-free sensing turns $m$ isolated, near-useless learners into one effective swarm. (b) Positive scaling with team size. Holding $n$, horizon $T$, and $\rho$ fixed and growing the swarm $m{=}5\to80$, RewardCF unseen skill RISES monotonically $0.08\to0.43$ (non-overlapping CIs): more robots feed $\sim mT\rho$ broadcast observations into the SHARED structure, so each robot's skill on tasks it never serviced grows with the team (the collective-recovery effect). The structure-free learner is FLAT ($m$ cannot help an agent that only learns its own row). Positive scaling with team size, the swarm getting SMARTER as it grows, is unique to structure-sharing and the opposite of the usual interference penalty as agents are added.

F18 — **Fig. 5.** (a) Collaboration value: unseen-pair skill vs broadcast rate (left edge $\rho{=}0$ is isolated, self-only); the comms-free broadcast unlocks generalization for low-rank CF (RewardCF, PTF) and is worthless to structure-free (UCBIndep, Tabular). (b) Positive scaling: unseen skill RISES with swarm size $m$ for CF (the swarm gets smarter as it grows), flat for structure-free.

Operational mission (application). The abstract result is an operational target-servicing / dispatch mission: the latent factors are robot CAPABILITY traits $p_i$ and task REQUIREMENT traits $u_j$ (sensor modalities, payload, effector type, endurance), low-rank because few traits govern fit; effectiveness $\langle p_i,u_j\rangle$ is the match. Over the mission each robot repeatedly services an offered target and earns that match (our standard reward), under range-limited, distance-noisy passive sensing. On the SAME metric (servicing skill), against the FULL field (low-rank PTF/BPMF/SoftImpute/MFSGD and the explore-then-commit ESTR, AND structure-free) under limited observability ($\rho{=}0.25$): ours $0.36$ vs the best of the entire field $0.29$ (SoftImpute), non-overlapping CIs, structure-free at the random-dispatch floor; at full broadcast the structured field is competitive and the separation opens under masking. The swarm dispatches the right asset to targets it never personally serviced because it recovers the shared trait structure from the broadcast. This generalizes trait-based MRTA (capability-vs-requirement matching) to UNKNOWN traits learned online, decentralized, communication-free. Applications: limited-resource servicing under degraded comms, search-and-service, coverage-with-capability, SEAD-style sensor-target matching.

Operational metrics (mission language, Fig. 6). Restating the same runs in operational terms makes the swarm value concrete. Time-to-competence: under limited observability our online CF reaches a quarter of oracle dispatch in about 35 rounds (every seed), while the structure-free and batch-refit competitors mostly never reach it within the mission. Cumulative lost value: summed over the run, our methods accumulate the least regret (oracle minus earned) at every broadcast rate. New-asset readiness: a freshly added drone becomes effective on targets it never personally serviced after only a handful of engagements (on the order of the latent rank), independent of the total target count, whereas a structure-free newcomer never becomes ready. Resilience to attrition: under continuous turnover, confidence-directed CF retains the highest skill on both the standing set and (especially) the newest arrivals, while structure-free collapses on newcomers. Two further metrics need careful reading rather than a raw number. Redundancy: raw collision rate is minimized trivially by random dispatch (spreading avoids collisions but earns nothing), so we read it as an efficiency frontier instead: among the reward-seeking de-confliction methods at severe contention, our private-offset variants are the top earners and the lowest-collision method overall, dominating the CBBA-backoff, MAB-re-seating, and batch baselines on the earned-vs-collision frontier (a genuine coordination win once collisions are read against earned value). Information efficiency splits in two: for OFFLINE estimation (skill per observed entry) all low-rank methods turn a tiny budget into real skill while structure-free turns it into none, with the batch PTF marginally ahead; for ONLINE earning (reward per round while learning) we win, because the batch methods pay an explore/probe phase, so our cumulative lost value is the lowest.

F19 — **Fig. 6.** Operational metrics (re-analysis of existing runs). (a) Time-to-competence: rounds to reach 25% of oracle dispatch under partial broadcast; ours reach it fastest, most competitors never do. (b) Cumulative lost mission value over the run (lower is better). (c) New-asset readiness: a new drone's skill on never-tried targets vs its own engagements; a structure-free newcomer never becomes ready. (d) Resilience to attrition under continuous turnover. All methods run decentralized and communication-free in-harness; they differ in the update rule (online vs batch-refit vs explore-then-commit).

Performance scorecard. One view of the design-space sweep on a single canonical masked harness: our online CF leads the masked-observation column and both operational columns (lowest cumulative regret, fastest to competence); our batch hybrid PTF wins only the full-broadcast column (its one-shot refit is best when the matrix is densely observed); the standard low-rank estimators trail; and the structure-free learners sit at the floor on every column.

Method	Provenance	unseen skill (ρ=0.25, masked)	unseen skill (ρ=1, full)	cumulative regret (ρ=0.25, lower=better)	rounds to 25% of oracle	profile
SwarmCF (RewardCF)	ours	0.336	0.376	41.2	35	`D\|0\|rho,sig\|dhat\|online`
SwarmCF-RC (BothCF)	ours	0.349	0.372	41.2	36	`D\|0\|rho,sig\|dhat\|online`
SwarmCF-batch (PTF)	ours (hybrid)	0.280	0.516	46.0	never	`D\|0\|rho,sig\|dhat\|batch`
MFSGD	standard, adapted	-0.019	0.042	46.5	never	`D\|0\|rho,sig\|dhat\|online`
BPMF	standard, adapted	0.126	0.233	49.9	never	`D\|0\|rho,sig\|dhat\|batch`
ESTR	standard, adapted	0.058	0.232	46.4	never	`D\|0\|rho\|dhat\|ETC`
UCBHomo	structure-free baseline	0.070	0.167	50.3	never	`D\|0\|rho,sig\|none\|online`
UCBIndep	structure-free baseline	0.003	0.004	50.4	never	`D\|0\|rho,sig\|none\|online`
Tabular	structure-free baseline	0.003	-0.001	43.0	46	`D\|0\|rho,sig\|none\|online`
Random	structure-free baseline	0.004	0.007	50.0	never	`D\|0\|-\|none\|-`

One canonical masked harness (m=30, n=240, 8 seeds): unseen skill from the bake-off, regret and time-to-competence from the anytime trajectories. Our online CF leads the masked column and the operational columns; the batch-refit PTF (also ours) wins only the full-broadcast column; structure-free sits at the floor. The confidence/contention/unified refinements (EMCF, ContentionAdaCF, UnifiedCF) and the SoftImpute / KNNCF baselines are evaluated in their dedicated experiments.

Scope: when does CF beat structure-free? The advantage is not universal. Across a structure-by-observability grid it holds precisely when THREE conditions co-occur. (1) Low-rank but PERSONALIZED (1 < d << min(m,n)): the rank-1 part of the reward is shared 'popularity' that a pooling/bias baseline already captures, so at d=1 there is nothing personal to transfer (the additive-ceiling result); at the other extreme, d near full means no structure to exploit (graceful collapse, Fig. on stress). Personalization is the interaction BEYOND popularity that only a rank>1 model can represent. (2) SAMPLE-STARVED (n >> cT): if instead sample-rich, a tabular learner eventually measures every entry directly and the unseen advantage vanishes. (3) SHARED reward channel (rho > 0): without a broadcast each agent has only its own outcomes and CF degenerates to tabular. This scopes the contribution and explains why structure-free methods are competitive in sample-rich, low-personalization, or unshared settings, our earlier null results.

6. Related work

The table below locates the major MRTA and decentralized-learning paradigms on the four axes that define our problem (prior knowledge, communication, distribution, observability). Each established family relaxes at least one axis we hold fixed; our cell, low-rank with only a GUESSED rank, no communication, decentralized, masked and noisy, is the one left open.

Paradigm	Prior knowledge	Communication	Distribution	Observability	Note
Auction / CBBA (consensus bundle)	known task values/costs	message-passing (bids)	decentralized	full task info	needs communication AND known utilities
DCOP / consensus MRTA	known constraints/utilities	message-passing	decentralized	full	needs communication AND a known objective
Cooperative MARL (CTDE: MAPPO/QMIX/VDN)	none (learned)	centralized training	centralized-train / decentralized-exec	full (in training)	central critic; not comms-free
Learned-communication MARL (CommNet/TarMAC)	none (learned)	learned messages	decentralized	partial + messages	broadcast is learned message-passing, not passive
No-comms multiplayer bandits (SIC-MMAB, musical chairs)	none (per-arm)	none	decentralized	own pulls + collisions	comms-free but STRUCTURE-FREE (no unseen generalization)
Matrix completion (nuclear norm, spectral)	low-rank	centralized (one matrix)	centralized	partial (uniform)	centralized estimation, not online/decision
Low-rank / bilinear bandits (ESTR, etc.)	low-rank	centralized	centralized	partial	centralized and/or explore-then-commit
Federated / gossip CF	low-rank	broadcast of factors/gradients	decentralized	partial	shares PARAMETERS, not a passive stream
Multi-user RL, low-rank rewards (Nagaraj-Agarwal)	low-rank	centralized aggregation	centralized	partial	closest prior; centralizes trajectory aggregation
Trait-based MRTA (Prorok et al.)	KNOWN traits	varies	decentralized	full	capability/requirement traits are GIVEN, not learned
OURS (broadcast CF for ZK-MRTA)	low-rank, GUESSED rank only	none (passive sensing)	decentralized	masked + noisy	the hardest cell: no comms, no known utilities/traits, guessed rank

Matrix completion (Candes-Recht; Keshavan-Montanari-Oh; Recht) gives centralized estimation bounds; we make them decentralized, online, and broadcast-only, with the unseen-pair floor making the gap categorical and an anytime (decision-reward) separation the completion theory does not address. Low-rank / bilinear bandits (explore-then-commit spectral ESTR; probe-then-fit; rank-1 and generalized low-rank matrix bandits) are centralized and/or phase-structured; we are anytime and distributed. Bayesian PMF over-explores in the anytime regime. Federated/gossip CF shares factors; we share nothing but a passively-sensed public stream. The CLOSEST neighbor is multi-user RL with low-rank rewards (Nagaraj/Agarwal 2022), which shares the low-rank-across-agents structure but CENTRALIZES trajectory aggregation, whereas we keep parameters private. CF-bandits / clustering-of-bandits (CLUB, COFIBA, BanditMF) cluster users centrally and optimize regret on PULLED arms, not unseen-pair generalization. For contention our communication-free symmetry-breaking parallels no-communication multiplayer bandits but acts on a LEARNED low-rank preference; we benchmark the two standard no-comms de-confliction primitives DIRECTLY, a CBBA-style auction with public-loss backoff and SIC-MMAB / musical-chairs randomized re-seating (both given our SAME CF utility, so the comparison isolates the primitive), and the fixed private offset beats both at severe contention with non-overlapping CIs ($0.10$-$0.105$ vs $0.064$ vs $0.028$). MRTA (auction / consensus CBBA / DCOP) and cooperative MARL (CTDE: MAPPO/QMIX/VDN) assume communication, known utilities, or centralized training, so we compare against their admissible comms-free reductions (the CBBA-lite and multiplayer-MAB baselines above).

Honest positioning. Every individual ingredient (low-rank bandits, online MF, observe-neighbors bandits, no-comms multiplayer matching, Dawid-Skene reliability) exists in prior work; the contribution is COMPOSITIONAL, this exact combination (decentralized + private parameters + broadcast-only + online + sample-starved, with an unseen-pair categorical separation) is the open niche, and the genuinely new mechanisms are the masking-robust zero-weight (not zero-value) estimator, the anytime separation under starvation, the comms-free de-confliction on a learned preference, and held-out choice-informativeness. We do not claim a new estimator.

MARL view. In multi-agent-RL terms this is a COOPERATIVE, decentralized, communication-free multi-agent BANDIT with low-rank structure (a multiplayer matrix/low-rank bandit), not a sequential Markov game (reward is stateless). This locates each MARL family exactly: (i) independent learners (IQL / independent-UCB) are our UCBIndep, with no shared structure, hence the categorical unseen floor (the tabular-floor result); (ii) CTDE (MAPPO/QMIX/VDN) needs centralized training or parameter sharing, and learned-communication methods (CommNet/TarMAC/DIAL) need a message channel, both inadmissible here, our broadcast is PASSIVE sensing, not transmission; (iii) the admissible no-comms multiplayer-bandit matchers (SIC-MMAB / musical chairs) are exactly the de-confliction baselines we beat. Read positively, broadcast-CF is DECENTRALIZED MODEL-BASED MARL whose shared world model is the low-rank reward matrix, estimated privately by each agent; the active-exploration and de-confliction mechanisms are emergent coordination WITHOUT communication; and the choice channel is decentralized teammate modeling. The differentiator over standard MARL is generalization to UNSEEN tasks, which independent learners provably cannot do and structured single-agent bandits do not address in the no-comms multi-agent regime.

7. Limitations and conclusion

Contention. Our formal results assume non-contention (a target may be engaged by more than one drone). Hard contention turns allocation into matching, where explicit coordination may add value beyond preference quality. We test this directly with a capacity-1 matching variant (each engaged target is awarded to one drone; collisions earn nothing; earned reward is normalized by the per-round Hungarian optimum). Two things hold: the CATEGORICAL preference quality (unseen-pair skill, evaluated contention-free) is contention-invariant because it is a property of the learned model; and on EARNED reward our methods still lead, because diverse, accurate preferences spread drones across targets and reduce collisions, though the operational gap narrows as collisions, not preferences, become the bottleneck. And the operational loss is recoverable WITHOUT communication: keeping the CF estimate but swapping greedy argmax for a FIXED PRIVATE per-target offset de-conflicts look-alike drones and roughly DOUBLES earned reward at severe contention; an adaptive, scarcity-gated version of that offset extends the win across the contested range and reduces to plain CF when targets are plentiful.

Shared offer pool of size `pool` (smaller = more contention; m=30 drones, n=240). Each target is awarded to one random contender; losers earn 0. anytime = earned reward normalized by the per-round MATCHING optimum (Hungarian), 8 seeds, bootstrap 95% CI. unseen = preference quality evaluated contention-free AFTER training under contention (the categorical claim). rho=1.00 (full broadcast, to isolate contention from masking).

Operational: earned-reward skill (matching-normalized)

method	pool=240	pool=60	pool=30	pool=15
ContentionAdaCF	0.448 [0.422, 0.472]	0.205 [0.186, 0.223]	0.153 [0.132, 0.173]	0.100 [0.086, 0.113]
ContentionCF	0.314 [0.290, 0.337]	0.178 [0.160, 0.200]	0.134 [0.115, 0.155]	0.105 [0.096, 0.114]
CBBAlite	0.464 [0.434, 0.487]	0.216 [0.204, 0.226]	0.127 [0.112, 0.142]	0.064 [0.047, 0.081]
MusicalChairs	0.310 [0.279, 0.344]	0.124 [0.108, 0.141]	0.077 [0.053, 0.103]	0.028 [0.019, 0.040]
ActiveCFconv	0.439 [0.412, 0.464]	0.230 [0.206, 0.252]	0.125 [0.109, 0.145]	0.046 [0.022, 0.071]
RewardCFconv	0.439 [0.418, 0.463]	0.199 [0.186, 0.214]	0.121 [0.102, 0.138]	0.059 [0.044, 0.079]
PTF	0.243 [0.224, 0.260]	0.151 [0.136, 0.168]	0.096 [0.083, 0.110]	0.057 [0.040, 0.075]
UCBIndep	0.005 [-0.004, 0.014]	-0.002 [-0.009, 0.005]	-0.002 [-0.010, 0.006]	0.004 [-0.004, 0.013]
Random	-0.010 [-0.023, -0.000]	0.002 [-0.010, 0.014]	-0.003 [-0.010, 0.005]	-0.010 [-0.026, 0.007]

Categorical: unseen-pair skill (learned under contention, eval contention-free)

method	pool=240	pool=60	pool=30	pool=15
ContentionAdaCF	0.320 [0.282, 0.356]	0.302 [0.283, 0.323]	0.352 [0.325, 0.382]	0.295 [0.260, 0.326]
ContentionCF	0.023 [0.002, 0.048]	0.290 [0.268, 0.313]	0.312 [0.270, 0.345]	0.273 [0.233, 0.311]
CBBAlite	0.339 [0.311, 0.365]	0.385 [0.351, 0.421]	0.355 [0.334, 0.379]	0.288 [0.256, 0.322]
MusicalChairs	0.120 [0.090, 0.150]	0.257 [0.215, 0.304]	0.298 [0.264, 0.326]	0.275 [0.251, 0.297]
ActiveCFconv	0.113 [0.079, 0.145]	0.350 [0.319, 0.383]	0.310 [0.287, 0.330]	0.195 [0.165, 0.221]
RewardCFconv	0.323 [0.291, 0.361]	0.383 [0.351, 0.414]	0.339 [0.314, 0.367]	0.287 [0.242, 0.324]
PTF	0.343 [0.333, 0.356]	0.338 [0.302, 0.375]	0.262 [0.232, 0.293]	0.141 [0.114, 0.169]
UCBIndep	-0.007 [-0.014, -0.000]	0.001 [-0.006, 0.005]	-0.003 [-0.009, 0.002]	-0.006 [-0.015, 0.003]
Random	-0.007 [-0.026, 0.014]	0.014 [-0.002, 0.031]	-0.009 [-0.027, 0.009]	-0.005 [-0.028, 0.019]

Collision rate (fraction of engagements lost to contention)

method	pool=240	pool=60	pool=30	pool=15
ContentionAdaCF	0.126	0.392	0.465	0.581
ContentionCF	0.245	0.466	0.535	0.625
CBBAlite	0.091	0.356	0.507	0.658
MusicalChairs	0.391	0.555	0.625	0.697
ActiveCFconv	0.207	0.376	0.503	0.685
RewardCFconv	0.147	0.371	0.526	0.660
PTF	0.187	0.323	0.448	0.624
UCBIndep	0.054	0.209	0.363	0.561
Random	0.061	0.211	0.369	0.563

Takeaway: the CATEGORICAL preference quality (unseen) is contention-invariant (CF stays high, structure-free at the floor) because it is a property of the learned model; the OPERATIONAL earned-reward gap narrows as collisions, not preferences, become the bottleneck, yet CF still leads by spreading drones across targets via diverse, accurate preferences (lower collision rate).

De-confliction primitive (a recognized MRTA baseline). CBBAlite is the canonical consensus-based auction (CBBA) with the comms/consensus step REMOVED: each drone bids its OWN CF-predicted utility on the public pool (identical model class to RewardCF) and de-conflicts by a reactive, public-loss BACKOFF. It isolates the de-confliction primitive against our fixed PRIVATE per-drone offset (Theorem 7). At severe contention (pool=15): ContentionAdaCF 0.100 vs CBBAlite 0.064 vs greedy RewardCFconv 0.059 earned. Reading: our proactive private-offset de-confliction BEATS the reactive auction-with-backoff -- a proactive STATIC private offset spreads drones once and for all, whereas a reactive SHARED backoff makes all colliders flee the same target together, re-synchronizing them.

Non-stationarity (churn). When targets continuously turn over (departures plus fresh arrivals), PLAIN exploitative CF does not suffice: fold-in's order-$d$ probe latency is outpaced by churn on the freshest arrivals, where exploitative CF even trails a structure-free optimist ($0.074$ vs $0.132$). The fix is the SAME confidence-directed exploration from the refinements menu: a CF estimator that PROBES uncertain (fresh) targets and folds them in (ActiveCF count-bonus, or EMCF predictive-variance UCB) DOMINATES under churn on both the active set and the freshest arrivals (EMCF active $0.842$ vs $0.619$ structure-free; fresh arrivals $0.371$ vs $0.132$, non-overlapping CIs). The swarm stays adapted under continuous change, provided it unites low-rank fold-in with directed newcomer-probing, neither alone suffices.

Under minimal assumptions, collaborative filtering over the public broadcast lets a swarm act on the unseen, onboard new tasks and agents, and earn from the first round, with a categorical, theory-backed advantage over structure-free learning and dominance over prior-art low-rank methods in the limited-observability regime.

Reproducibility: complete per-seed data in results/pilots/; figures via experiments/make_figures.py; proofs in docs/THEORY_FORMAL.md; ZK audit in docs/ZK_COMPLIANCE.md; full v1 draft in docs/PAPER_DRAFT.md.