Streamlined draft (v2). Companion tutorial: tutorial.html. Repo: ApartsinProjects/ZKDroneSwarm. All numbers regenerable from saved per-seed data.
A swarm repeatedly chooses which target to engage. Compatibility between agents and targets is hidden but low-rank; there are far more targets than rounds (n >> T); agents neither coordinate nor share parameters and each senses only a noisy, partial slice of the public outcomes. The crux is GENERALIZATION: a structure-free learner that estimates a target only from its own engagements is, on any target it never touched, at the prior-mean error floor, which dominates when targets outnumber rounds.
Contributions. (1) A minimal-assumption, decentralized, online, broadcast-only formulation of CF for MRTA. (2) The categorical unseen-pair and onboarding/cold-start results (Theta(d) vs Theta(n)); the cold-start holds even for a strictly zero-knowledge newcomer that recovers the target basis from its OWN masked broadcast rather than receiving any peer's model, with the separation degrading gracefully as its visibility falls. (3) An anytime separation under sample starvation. (4) A masking-model dichotomy (durable vs transient decentralization). (5) A method family that is masking-robust and anytime-optimal, dominating the prior-art low-rank competitors; and collective active exploration via the shared broadcast.
World. $m$ drones, $n$ targets, with $K_1,K_2$ latent types; the reward is the cosine of unit latent factors $p_i,u_j\in\mathbb{R}^d$,
$R_{ij} = \langle p_i, u_j\rangle \in [-1,1], \qquad R = PU^\top, \qquad \operatorname{rank}(R)=\min(d,K_1,K_2)$ (no nonlinear link).
Robotics grounding. Concretely, $p_i$ is robot $i$'s (hidden) CAPABILITY profile along $d$ latent traits (payload, sensor suite, speed, endurance, software skills) and $u_j$ is task $j$'s REQUIREMENT profile along the same traits, so $R_{ij}$ is the capability-requirement match (how well robot $i$ performs task $j$, e.g. inspection / coverage / pickup quality). Neither profile is known a priori, robots learn them only from outcomes. Crucially, the observation channel is LIMITED-RANGE SENSING, not radio: a robot perceives a teammate's engagement (which task, with what outcome) only when the engagement falls inside its sensing footprint, and more noisily the farther away it is. This is what makes the masking $\rho$ and noise $\sigma_{\mathrm{obs}}$ PHYSICAL rather than free parameters; we validate this directly by deriving observability from 2-D sensing geometry (range + distance-noise) and showing the categorical result survives (Section on sensing-grounded observability), the regime a real decentralized swarm operates in.
Observability. A public stream of (action, outcome); each agent senses its own outcome cleanly (noise $\sigma_{\mathrm{own}}$) and a per-agent masked (rate $\rho$), noisy (noise $\sigma_{\mathrm{obs}}$) slice of teammates' outcomes (passive sensing, not transmission, so communication-free). Sensing is independent per observer: agent $i$ applies its own persistent mask $M_{ik}$ and draws its own noise on each engagement, recording $\tilde R^{(i)}_{k}=R_{k,a_k}+\eta_{ik}^t$ with $\eta_{ik}^t\sim\mathcal{N}(0,\sigma_{\mathrm{obs}}^2)$ fresh per observer and round, so $\eta_{ik}^t\perp\eta_{i'k}^t$ for $i\neq i'$. Different agents thus read the SAME action differently and no single value is ever transmitted, which is what keeps their internal states distinct (state-uniqueness) and the channel communication-free. The observation precision is $\beta=1/\sigma^2$, with masked events absent (weight $0$). Baselines. Random, UCBIndep, UCBHomo, Tabular (no-structure); MFSGD, ESTR, PTF, BPMF (low-rank). Fairness/ZK. Every method gets a guessed rank (or none), an independent per-agent instance with no parameter sharing, and the same noisy partial stream; the Oracle (centralized, complete information) only normalizes scores. Metrics. normalized skill $=$ (method $-$ random)/(oracle $-$ random) (the standard skill score of forecast verification and the random-normalized score of RL benchmarking; an affine rescaling of regret with random$=0$, oracle$=1$), unseen-pair skill, anytime cumulative-reward skill, state-uniqueness.
Each agent runs its own ONLINE weighted alternating least squares over the events it senses. An unobserved/masked event gets zero WEIGHT, not an imputed zero VALUE; this is what makes the estimator masking-robust (batch fill-in methods impute zeros for missing entries and decay under masking). Among OBSERVED events, inverse-variance (1/sigma^2) weighting is an optional noise-adaptive refinement that pays off when the broadcast is noisy; at low noise uniform weighting generalizes slightly better to unseen pairs (we ablate this honestly). Estimation is separated from the decision policy. The family: RewardCF (teammates' rewards; anytime-optimal), ChoiceZK (teammates' actions only; noise-immune fallback), HybridCFconv (probe, SVD warm-start, then converged online ALS; best final-policy unseen), and ActiveCFconv (latent-space UCB with a broadcast count-based uncertainty bonus, so one agent's probe lowers everyone's uncertainty; best balanced).
One core, scoped extensions (read this if nothing else). The hero is a SINGLE estimator, online noise-weighted ridge ALS over the public stream, with an $\epsilon$-greedy policy. Everything else, Bayesian confidence (EMCF), directed exploration, rank self-determination (ARD), contention de-confliction, and the choice channel, is a SCOPED EXTENSION of that one estimator: each is applied only when a stated condition holds and otherwise reduces to the core. We present them as a single theorem-backed menu (Section 5) rather than a method zoo, so the whole design space collapses to one method plus a short list of conditional add-ons.
(Full proofs in THEORY_FORMAL.md; step-by-step in the tutorial.)
Tabular floor. A structure-free learner has $\Omega(1)$ error and $0$ skill on any unseen pair, and the broadcast is provably useless to it; pooling recovers only the rank-1 popularity term. CF row completion. Given $U$, an agent's whole row is recovered exactly by an $O(d)$ least-squares fold-in; per-agent $\Theta(d)$ vs tabular $\Theta(n)$, categorical on unseen pairs. Anytime separation. Under $n \gg T$ any structure-free learner's anytime skill $\le g(cT/n) \to 0$ (even with full broadcast); CF reaches near-oracle in $O(d)$ rounds. Masking model. i.i.d. loss makes decentralization transient; persistent masking makes it durable; the categorical results are invariant to the choice. Additive ceiling. An additive predictor $a+b_i+c_j$ has rank $\le 2$ and reduces to the popularity order, so it cannot personalize and is strictly below CF for $d > 1$.
Recent results and honest scope. Precise scope (iff): the CATEGORICAL unseen gap vs a structure-free learner needs only a shared channel ($\rho>0$) and some structure; $d>1$ is what additionally beats a POPULARITY baseline (the additive ceiling); sample-starvation is what makes the unseen advantage OPERATIONALLY dominant (the anytime separation), three conditions with three distinct roles, not one AND. Confidence: precision fit-weighting is suboptimal for unseen prediction (it over-weights the drone's own, here irrelevant, data); the Bayesian posterior is the right object, and a sharper analysis shows bounded ratio-capped precision dominates uniform exactly when teammate sources DIFFER in reliability, while unbounded $1/\sigma^2$ starves coverage. Exploration: EMCF's predictive-variance UCB is LinUCB on the recovered latent features, giving $\tilde O(d\sqrt T)$ per-drone regret given $U$. Contention = fixed private offset; rank = ARD recovers the identifiable rank, removing the hyperparameter; choice channel = held-out informativeness is identifiable where in-sample is not. The keystone, now bounded: the per-drone recovery of $U$ from the persistent-masked broadcast, formerly invoked from centralized completion theory, is now pinned to an explicit per-drone condition. Drone $i$ recovers a target's factor $u_j$ (and so predicts the unseen pair) EXACTLY iff its own factor $p_i$ lies in the span of the factors of the visible teammates that engaged $j$ (an anchor block fixing the frame plus per-target spanning coverage); it is provably at the prior floor otherwise; under noise the error is bounded by the fold-in bound with the coverage degree controlling the noise term; and under non-adaptive exploration the condition self-achieves in $O((nd/\rho m)\log n)$ rounds, faster as the swarm grows. An oracle-reconstruction test on the actual coverage patterns confirms the threshold exactly (recovered pairs: error $0.000$; below the span condition: the prior floor). The one remaining open piece is the finite-time coverage rate under an adaptive exploiting policy (full statements, proofs, and the validation in THEORY_FORMAL.md and P15_VALIDATION.md).
Foundational results. Fold-in error bound (exact): the cold-start / onboarding prediction error splits cleanly into three sources, the swarm's basis-recovery error $\varepsilon$, own-probe noise $\sigma\sqrt{d/k}$, and ridge bias, and is EXACT once $k\ge d$; this is the first self-contained bound for our fold-in and it QUANTITATIVELY explains the sensing experiment (sparser sensing $\Rightarrow$ larger $\varepsilon \Rightarrow$ lower cold-start skill). Collective broadcast speedup: an isolated agent CANNOT recover a rank-$d{>}1$ column space (one matrix row is unconstrained, so it stays at the floor), whereas the broadcast lets the swarm cross the completion threshold in $\tilde O(d(1{+}n/m))$ rounds, a $\Theta(m)$-fold COLLECTIVE speedup with no communication; this is the theoretical content behind the $\rho>0$ condition. Minimax tightness: $\Omega(d)$ own probes are necessary even given the exact basis (with $k How to read the comparison. The ZK-MRTA setting itself is new, so this is not 'our method vs external rivals' but a CONTROLLED SWEEP across the low-rank design space we instantiate in the setting, measured against the genuinely external structure-free paradigm and full-information reference ceilings. Provenance is explicit: our invented methods form ONE family, SwarmCF (the core RewardCF plus Bayesian, choice, de-confliction, unified and batch variants, shown as 'SwarmCF-variant'), and PTF (SwarmCF-batch) is ours too, a strong batch-flavored hybrid we built from standard parts; MFSGD, ESTR, BPMF, SoftImpute, KNNCF and CLUB are named prior-work estimators we ADAPT to the setting (the algorithms are theirs, the decentralized broadcast-only application is part of our framework); the per-arm UCB and tabular learners are the structure-free paradigm we beat; and three full-information upper bounds (NOT competitors) bracket the top: the Oracle (knows the true reward), CentralClean (a centralized low-rank matcher that observes everything with NO noise and NO masking and assigns optimally, the strongest a centralized low-rank system can do without the true factors), and the noisy comms-full CTDE matcher. Harness fact: the evaluation builds one estimator PER drone, each seeing only the broadcast it observes and acting on its own row, so every method in the sweep is decentralized and communication-free in-harness; the low-rank methods differ in the UPDATE RULE (online vs batch-refit vs explore-then-commit) and the refinements, not the information regime (ESTR is spectral/centralized only in its origin and is run here as a per-drone explore-then-commit reduction). Our online CF sits in the hardest cell and any method that beats it in some regime relaxes one axis (a batch refit at full broadcast, full information for the ceilings). Notation [dist | comm | obs | prior | compute]: dist D=decentralized, C=centralized; comm 0=none (passive), B=broadcast params, full=message-passing; obs full / ρ=masked / ρσ=masked+noisy / self; prior –=none, d̂=guessed rank (itself removable: the ARD variant learns it), ARD=rank self-determined, d=true rank, U*=oracle factors; compute online / batch / ETC=explore-then-commit / mem. In-harness every method is decentralized and communication-free (one estimator per drone on the passive broadcast); the low-rank methods differ in the update rule and refinements, not the information regime. Our flagship sits in the hardest cell [D | 0 | ρσ | d̂ | online]. Our invented methods are ONE family, SwarmCF (shown as 'SwarmCF-variant (code name)'), differing only in the menu axes of the mechanism table; everything else is a standard estimator we adapt, the structure-free paradigm, or a reference ceiling. Categorical (acting on the unseen). CF acts well on never-observed pairs at every observation density while structure-free learners sit at the floor (Fig. 1); new targets onboard for the whole swarm from ~d shared probes and new agents from ~d own probes (vs ~n for tabular). Operational (anytime). On reward actually earned over time, our online methods dominate at every horizon; per-arm bandits are stuck near random because, with n >> T, they never stop exploring untried arms (Fig. 2). Robustness. Results are invariant to the masking model (i.i.d. vs persistent), while decentralization is durable only under persistent masking (Fig. 3); and they hold across rank, horizon, target/agent counts, guessed rank, cluster structure, and latent spread. Out-of-harness transfer (independent simulator). Beyond our generative code, we drop the method as a drop-in policy into ZK-MRTA-Sim, an independently-built, spatially-explicit simulator of the same Zero-Knowledge / Zero-prior / Zero-communication MRTA setting (PettingZoo package Versus the field. The categorical unseen win is shared by all low-rank methods over no-structure ones (a property of structure). Our specific edge is masking-robustness and anytime optimality; with a converged estimator our methods dominate the strongest competitor (PTF) on every metric and density (Fig. 4). We further compare against a broader, fair set, run under the same limits: SoftImpute (nuclear-norm convex completion), kNN-CF (memory-based, model-free), and BiasModel (additive, no interaction). SoftImpute is the strongest at full broadcast (it tops unseen at rho=1) but, like all batch fill-in methods, decays under masking and loses on anytime; kNN-CF generalizes but earns little; BiasModel sits at the additive ceiling. Under masking and on anytime, our methods beat the best of these with non-overlapping CIs. Headline numbers (20 seeds, bootstrap 95%% CI). The unseen-pair and anytime skill of the no-structure floor (UCBIndep), the strongest competitor (PTF), and our methods, at full broadcast (rho=1) and heavy masking (rho=0.25): skill = (method - random)/(oracle - random); 0 = random, 1 = oracle. rho = fraction of broadcast observed. Ours in bold. We foreground ONE recommended method (ActiveCFconv) and present the rest as ablations rather than a method zoo. The table below isolates each design choice (online vs batch, precision weighting, SVD warm-start, active vs eps-greedy exploration) and the robustness to a guessed rank, under one identical config (12 seeds, 95%% CI): Recommended method = ActiveCFconv (online weighted-ALS + latent-UCB active exploration; best balanced: top anytime, strong unseen). Each row below toggles ONE design choice; skill: 0 = random, 1 = oracle. Takeaways: (i) online weighted-ALS dominates batch / explore-then-commit (PTF, ESTR) under masking and on anytime (anytime rho=1.0: RewardCFconv 0.39 vs PTF 0.28; unseen rho=0.25: 0.37 vs 0.28). (ii) HONEST FINDING: precision weighting 1/sigma^2 does NOT help at the default broadcast noise (sigma_obs=0.3) -- UNIFORM weighting (noprec) is better on both unseen (0.584 vs 0.450 at rho=1.0) and anytime, because it uses the abundant broadcast fully (the broadcast carries the cross-target structure that unseen generalization needs). Precision weighting only pays off once the broadcast is high-noise; see PRECISION_SWEEP.md for the crossover. (iii) the SVD warm-start (HybridCFconv) lifts unseen, most at dense rho, at an anytime cost. (iv) active (latent-UCB) exploration improves anytime and dense-rho unseen over eps-greedy. (v) robust across guessed rank for d_hat >= true d (5..20 all ~0.35 unseen at rho=0.25), degrading only when d_hat=2 < true d=5. To avoid a method zoo, every refinement beyond the core estimator is presented as a single MENU of scoped, theorem-backed extensions. Each helps ONLY under a stated condition and reduces to the core otherwise, so the practitioner adds an extension exactly when its condition holds. This is the entire design space in one view. What does NOT help (honest, diagnosed nulls): UNBOUNDED precision-ALS at homogeneous noise (coverage, not noise, binds); a LEARNED per-teammate choice gate vs a simple competence schedule when teammates are homogeneous (it ties, and in-sample it over-trusts random teammates, the bug held-out $\gamma$ fixes); confidence-GATED reward$+$choice fusion. Each is a documented null with a stated cause, several surfaced by deliberate SANITY experiments whose answer is obvious (oracle-vs-random teammates; homogeneous-vs-heterogeneous noise). Sensing-grounded observability (robotics check). To confirm the masking/noise are not a convenient abstraction, we place the $m$ robots and $n$ tasks in a 2-D arena and DERIVE observability from sensing geometry: a robot perceives a teammate's engagement only if the task is within its sensing radius $R$, with read-off noise growing with distance ($\sigma(d)=\sigma_0(1+d/R)$). This yields a persistent, structured per-robot visibility (the persistent-masking regime) emergent from physics. Sweeping $R$ (effective coverage $0.11\to0.86$): the categorical unseen-pair win SURVIVES once coverage exceeds $\approx0.3$, with RewardCF unseen skill $0.15\to0.30$ versus the structure-free floor ($\approx0$) at EVERY radius, and degrades gracefully only at very sparse sensing (10\% coverage, where there is too little data to recover the shared structure). So the generalization result holds under geometry-limited, distance-noisy sensing, the operating regime of a real decentralized swarm, not merely under an injected mask. Why a swarm: collaboration value and positive scaling (Fig. 5). Two experiments isolate what the swarm and the broadcast actually buy. (a) Collaboration value. We sweep the broadcast rate from $\rho{=}0$ (ISOLATED: each robot sees ONLY its own outcomes) to $\rho{=}1$ (full passive sensing). A lone robot cannot recover the shared low-rank structure from its single matrix row, so isolated unseen skill is $\approx0$; the broadcast UNLOCKS generalization, worth $+0.39$ unseen skill to our online RewardCF but only $-0.00$ / $+0.01$ to the structure-free UCBIndep / Tabular, which have no model linking tasks and so cannot use the broadcast at all. (The batch-refit baseline PTF gains the most from a broadcast, $+0.51$, because a one-shot refit on a near-fully-observed matrix is the best case for batch low-rank completion; it accordingly overtakes us at full broadcast $\rho{=}1$, whereas our online method leads in the masked regime $\rho\le0.5$ that defines the operational problem. This crossover near $\rho\approx0.6$ reproduces across the bake-off and crossover sweeps with near-identical numbers and is exactly what matrix-completion theory predicts: a one-shot SVD is the near-optimal estimator once the matrix is densely observed.) This is the operational answer to 'why share?': communication-free sensing turns $m$ isolated, near-useless learners into one effective swarm. (b) Positive scaling with team size. Holding $n$, horizon $T$, and $\rho$ fixed and growing the swarm $m{=}5\to80$, RewardCF unseen skill RISES monotonically $0.08\to0.43$ (non-overlapping CIs): more robots feed $\sim mT\rho$ broadcast observations into the SHARED structure, so each robot's skill on tasks it never serviced grows with the team (the collective-recovery effect). The structure-free learner is FLAT ($m$ cannot help an agent that only learns its own row). Positive scaling with team size, the swarm getting SMARTER as it grows, is unique to structure-sharing and the opposite of the usual interference penalty as agents are added. Operational mission (application). The abstract result is an operational target-servicing / dispatch mission: the latent factors are robot CAPABILITY traits $p_i$ and task REQUIREMENT traits $u_j$ (sensor modalities, payload, effector type, endurance), low-rank because few traits govern fit; effectiveness $\langle p_i,u_j\rangle$ is the match. Over the mission each robot repeatedly services an offered target and earns that match (our standard reward), under range-limited, distance-noisy passive sensing. On the SAME metric (servicing skill), against the FULL field (low-rank PTF/BPMF/SoftImpute/MFSGD and the explore-then-commit ESTR, AND structure-free) under limited observability ($\rho{=}0.25$): ours $0.36$ vs the best of the entire field $0.29$ (SoftImpute), non-overlapping CIs, structure-free at the random-dispatch floor; at full broadcast the structured field is competitive and the separation opens under masking. The swarm dispatches the right asset to targets it never personally serviced because it recovers the shared trait structure from the broadcast. This generalizes trait-based MRTA (capability-vs-requirement matching) to UNKNOWN traits learned online, decentralized, communication-free. Applications: limited-resource servicing under degraded comms, search-and-service, coverage-with-capability, SEAD-style sensor-target matching. Operational metrics (mission language, Fig. 6). Restating the same runs in operational terms makes the swarm value concrete. Time-to-competence: under limited observability our online CF reaches a quarter of oracle dispatch in about 35 rounds (every seed), while the structure-free and batch-refit competitors mostly never reach it within the mission. Cumulative lost value: summed over the run, our methods accumulate the least regret (oracle minus earned) at every broadcast rate. New-asset readiness: a freshly added drone becomes effective on targets it never personally serviced after only a handful of engagements (on the order of the latent rank), independent of the total target count, whereas a structure-free newcomer never becomes ready. Resilience to attrition: under continuous turnover, confidence-directed CF retains the highest skill on both the standing set and (especially) the newest arrivals, while structure-free collapses on newcomers. Two further metrics need careful reading rather than a raw number. Redundancy: raw collision rate is minimized trivially by random dispatch (spreading avoids collisions but earns nothing), so we read it as an efficiency frontier instead: among the reward-seeking de-confliction methods at severe contention, our private-offset variants are the top earners and the lowest-collision method overall, dominating the CBBA-backoff, MAB-re-seating, and batch baselines on the earned-vs-collision frontier (a genuine coordination win once collisions are read against earned value). Information efficiency splits in two: for OFFLINE estimation (skill per observed entry) all low-rank methods turn a tiny budget into real skill while structure-free turns it into none, with the batch PTF marginally ahead; for ONLINE earning (reward per round while learning) we win, because the batch methods pay an explore/probe phase, so our cumulative lost value is the lowest. Performance scorecard. One view of the design-space sweep on a single canonical masked harness: our online CF leads the masked-observation column and both operational columns (lowest cumulative regret, fastest to competence); our batch hybrid PTF wins only the full-broadcast column (its one-shot refit is best when the matrix is densely observed); the standard low-rank estimators trail; and the structure-free learners sit at the floor on every column. One canonical masked harness (m=30, n=240, 8 seeds): unseen skill from the bake-off, regret and time-to-competence from the anytime trajectories. Our online CF leads the masked column and the operational columns; the batch-refit PTF (also ours) wins only the full-broadcast column; structure-free sits at the floor. The confidence/contention/unified refinements (EMCF, ContentionAdaCF, UnifiedCF) and the SoftImpute / KNNCF baselines are evaluated in their dedicated experiments. The table below locates the major MRTA and decentralized-learning paradigms on the four axes that define our problem (prior knowledge, communication, distribution, observability). Each established family relaxes at least one axis we hold fixed; our cell, low-rank with only a GUESSED rank, no communication, decentralized, masked and noisy, is the one left open. Matrix completion (Candes-Recht; Keshavan-Montanari-Oh; Recht) gives centralized estimation bounds; we make them decentralized, online, and broadcast-only, with the unseen-pair floor making the gap categorical and an anytime (decision-reward) separation the completion theory does not address. Low-rank / bilinear bandits (explore-then-commit spectral ESTR; probe-then-fit; rank-1 and generalized low-rank matrix bandits) are centralized and/or phase-structured; we are anytime and distributed. Bayesian PMF over-explores in the anytime regime. Federated/gossip CF shares factors; we share nothing but a passively-sensed public stream. The CLOSEST neighbor is multi-user RL with low-rank rewards (Nagaraj/Agarwal 2022), which shares the low-rank-across-agents structure but CENTRALIZES trajectory aggregation, whereas we keep parameters private. CF-bandits / clustering-of-bandits (CLUB, COFIBA, BanditMF) cluster users centrally and optimize regret on PULLED arms, not unseen-pair generalization. For contention our communication-free symmetry-breaking parallels no-communication multiplayer bandits but acts on a LEARNED low-rank preference; we benchmark the two standard no-comms de-confliction primitives DIRECTLY, a CBBA-style auction with public-loss backoff and SIC-MMAB / musical-chairs randomized re-seating (both given our SAME CF utility, so the comparison isolates the primitive), and the fixed private offset beats both at severe contention with non-overlapping CIs ($0.10$-$0.105$ vs $0.064$ vs $0.028$). MRTA (auction / consensus CBBA / DCOP) and cooperative MARL (CTDE: MAPPO/QMIX/VDN) assume communication, known utilities, or centralized training, so we compare against their admissible comms-free reductions (the CBBA-lite and multiplayer-MAB baselines above). Honest positioning. Every individual ingredient (low-rank bandits, online MF, observe-neighbors bandits, no-comms multiplayer matching, Dawid-Skene reliability) exists in prior work; the contribution is COMPOSITIONAL, this exact combination (decentralized + private parameters + broadcast-only + online + sample-starved, with an unseen-pair categorical separation) is the open niche, and the genuinely new mechanisms are the masking-robust zero-weight (not zero-value) estimator, the anytime separation under starvation, the comms-free de-confliction on a learned preference, and held-out choice-informativeness. We do not claim a new estimator. MARL view. In multi-agent-RL terms this is a COOPERATIVE, decentralized, communication-free multi-agent BANDIT with low-rank structure (a multiplayer matrix/low-rank bandit), not a sequential Markov game (reward is stateless). This locates each MARL family exactly: (i) independent learners (IQL / independent-UCB) are our UCBIndep, with no shared structure, hence the categorical unseen floor (the tabular-floor result); (ii) CTDE (MAPPO/QMIX/VDN) needs centralized training or parameter sharing, and learned-communication methods (CommNet/TarMAC/DIAL) need a message channel, both inadmissible here, our broadcast is PASSIVE sensing, not transmission; (iii) the admissible no-comms multiplayer-bandit matchers (SIC-MMAB / musical chairs) are exactly the de-confliction baselines we beat. Read positively, broadcast-CF is DECENTRALIZED MODEL-BASED MARL whose shared world model is the low-rank reward matrix, estimated privately by each agent; the active-exploration and de-confliction mechanisms are emergent coordination WITHOUT communication; and the choice channel is decentralized teammate modeling. The differentiator over standard MARL is generalization to UNSEEN tasks, which independent learners provably cannot do and structured single-agent bandits do not address in the no-comms multi-agent regime. Contention. Our formal results assume non-contention (a target may be engaged by more than one drone). Hard contention turns allocation into matching, where explicit coordination may add value beyond preference quality. We test this directly with a capacity-1 matching variant (each engaged target is awarded to one drone; collisions earn nothing; earned reward is normalized by the per-round Hungarian optimum). Two things hold: the CATEGORICAL preference quality (unseen-pair skill, evaluated contention-free) is contention-invariant because it is a property of the learned model; and on EARNED reward our methods still lead, because diverse, accurate preferences spread drones across targets and reduce collisions, though the operational gap narrows as collisions, not preferences, become the bottleneck. And the operational loss is recoverable WITHOUT communication: keeping the CF estimate but swapping greedy argmax for a FIXED PRIVATE per-target offset de-conflicts look-alike drones and roughly DOUBLES earned reward at severe contention; an adaptive, scarcity-gated version of that offset extends the win across the contested range and reduces to plain CF when targets are plentiful. Shared offer pool of size `pool` (smaller = more contention; m=30 drones, n=240). Each target is awarded to one random contender; losers earn 0. anytime = earned reward normalized by the per-round MATCHING optimum (Hungarian), 8 seeds, bootstrap 95% CI. unseen = preference quality evaluated contention-free AFTER training under contention (the categorical claim). rho=1.00 (full broadcast, to isolate contention from masking). Takeaway: the CATEGORICAL preference quality (unseen) is contention-invariant (CF stays high, structure-free at the floor) because it is a property of the learned model; the OPERATIONAL earned-reward gap narrows as collisions, not preferences, become the bottleneck, yet CF still leads by spreading drones across targets via diverse, accurate preferences (lower collision rate). De-confliction primitive (a recognized MRTA baseline). CBBAlite is the canonical consensus-based auction (CBBA) with the comms/consensus step REMOVED: each drone bids its OWN CF-predicted utility on the public pool (identical model class to RewardCF) and de-conflicts by a reactive, public-loss BACKOFF. It isolates the de-confliction primitive against our fixed PRIVATE per-drone offset (Theorem 7). At severe contention (pool=15): ContentionAdaCF 0.100 vs CBBAlite 0.064 vs greedy RewardCFconv 0.059 earned. Reading: our proactive private-offset de-confliction BEATS the reactive auction-with-backoff -- a proactive STATIC private offset spreads drones once and for all, whereas a reactive SHARED backoff makes all colliders flee the same target together, re-synchronizing them. Non-stationarity (churn). When targets continuously turn over (departures plus fresh arrivals), PLAIN exploitative CF does not suffice: fold-in's order-$d$ probe latency is outpaced by churn on the freshest arrivals, where exploitative CF even trails a structure-free optimist ($0.074$ vs $0.132$). The fix is the SAME confidence-directed exploration from the refinements menu: a CF estimator that PROBES uncertain (fresh) targets and folds them in (ActiveCF count-bonus, or EMCF predictive-variance UCB) DOMINATES under churn on both the active set and the freshest arrivals (EMCF active $0.842$ vs $0.619$ structure-free; fresh arrivals $0.371$ vs $0.132$, non-overlapping CIs). The swarm stays adapted under continuous change, provided it unites low-rank fold-in with directed newcomer-probing, neither alone suffices. Under minimal assumptions, collaborative filtering over the public broadcast lets a swarm act on the unseen, onboard new tasks and agents, and earn from the first round, with a categorical, theory-backed advantage over structure-free learning and dominance over prior-art low-rank methods in the limited-observability regime. Reproducibility: complete per-seed data in 5. Experiments
Method Provenance Dist Comm Observability Prior Compute Profile SwarmCF (RewardCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-B (EMCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-RC (BothCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-Ch (ChoiceCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-D+ (ContentionAdaCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-U (UnifiedCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-H (HybridCF) ours decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineSwarmCF-B-ARD (ARD-EMCF) ours decentralized 0 ρσ ARD online D|0|ρσ|ARD|onlineSwarmCF-batch (PTF) ours (hybrid) decentralized 0 ρσ d̂ batch D|0|ρσ|d̂|batchMFSGD standard, adapted decentralized 0 ρσ d̂ online D|0|ρσ|d̂|onlineKNNCF standard, adapted decentralized 0 ρσ – mem D|0|ρσ|–|memBiasModel standard, adapted decentralized 0 ρσ – online D|0|ρσ|–|onlineBPMF standard, adapted decentralized 0 ρσ d̂ batch D|0|ρσ|d̂|batchSoftImpute standard, adapted decentralized 0 ρσ d̂ batch D|0|ρσ|d̂|batchCLUB standard, adapted decentralized 0 ρσ – batch D|0|ρσ|–|batchESTR standard, adapted decentralized 0 ρ d̂ ETC D|0|ρ|d̂|ETCUCBIndep structure-free baseline decentralized 0 ρσ – online D|0|ρσ|–|onlineUCBHomo structure-free baseline decentralized 0 ρσ – online D|0|ρσ|–|onlineTabular structure-free baseline decentralized 0 ρσ – online D|0|ρσ|–|onlineRandom structure-free baseline decentralized 0 – – – D|0|–|–|–CentralClean-ceiling reference (upper bound) centralized full full d̂ online C|full|full|d̂|onlineCTDE-ceiling reference (upper bound) centralized full fullσ d̂ online C|full|fullσ|d̂|onlineOracle reference (upper bound) centralized full full U* – C|full|full|U*|–tabula_drone; not to be confused with our Tabular baseline), which has dynamics our model lacks: spatial geometry, target health that DEPLETES as engaged (a soft contention/matching pressure), episodic resets, and a Hungarian optimal-assignment oracle. We benchmark against the environment's own policies (Fig. 3b). On the efficiency metric (reward per step), our method reaches skill ~0.81 with low variance, beating the environment's SGD matrix-factorization (~0.25) and UCBIndep (~0.72) and approaching the matching oracle. This tests TRANSFER and partial-contention robustness, not real-world realism (it is still a simulator, and itself latent-based, so it is not an independent test of low-rankness). For the low-rank assumption itself we stress it directly (approximate low-rank via entrywise noise; a nonlinear link that raises the effective rank) and observe graceful degradation with the ranking preserved.UNSEEN-pair skill
method rho=1.0 rho=0.25 UCBIndep 0.000 [-0.004, 0.004] -0.001 [-0.005, 0.004] PTF 0.490 [0.458, 0.517] 0.272 [0.251, 0.293] RewardCF 0.377 [0.355, 0.398] 0.326 [0.303, 0.349] HybridCFconv 0.488 [0.468, 0.509] 0.379 [0.354, 0.401] ActiveCFconv 0.476 [0.461, 0.493] 0.339 [0.317, 0.358] ANYTIME skill
method rho=1.0 rho=0.25 UCBIndep 0.002 [-0.006, 0.010] -0.004 [-0.010, 0.002] PTF 0.273 [0.257, 0.290] 0.226 [0.215, 0.236] RewardCF 0.383 [0.366, 0.399] 0.342 [0.332, 0.352] HybridCFconv 0.340 [0.327, 0.354] 0.300 [0.288, 0.311] ActiveCFconv 0.436 [0.419, 0.451] 0.341 [0.323, 0.357] Method ablation
Design-choice ablation (unseen-pair and anytime skill)
variant (knob) unseen rho=1.0 unseen rho=0.25 anytime rho=1.0 anytime rho=0.25 ActiveCFconv
online wALS + precision + active-UCB (RECOMMENDED)0.485 [0.466, 0.506] 0.360 [0.345, 0.376] 0.440 [0.424, 0.455] 0.348 [0.330, 0.367] HybridCFconv
- active; + probe->SVD warm-start (eps-greedy)0.501 [0.478, 0.526] 0.396 [0.371, 0.421] 0.353 [0.337, 0.367] 0.308 [0.295, 0.321] RewardCFconv
- active, - warm-start (plain online wALS, eps-greedy)0.450 [0.427, 0.473] 0.372 [0.347, 0.395] 0.390 [0.373, 0.405] 0.332 [0.321, 0.342] RewardCFconv_noprec
- precision weighting (uniform obs weights)0.584 [0.574, 0.594] 0.392 [0.371, 0.413] 0.438 [0.426, 0.452] 0.339 [0.328, 0.349] PTF
batch SVD + SGD finetune (explore-then-commit)0.505 [0.482, 0.528] 0.276 [0.248, 0.305] 0.277 [0.260, 0.295] 0.231 [0.217, 0.245] ESTR
explore-then-spectral commit0.224 [0.197, 0.251] 0.047 [0.034, 0.059] 0.218 [0.196, 0.241] 0.185 [0.173, 0.197] Rank sensitivity of ActiveCFconv (rho=0.25; true d=5, default d_hat=8)
guessed rank d_hat unseen skill anytime skill 2 0.222 [0.198, 0.245] 0.270 [0.247, 0.293] 5 0.327 [0.296, 0.356] 0.351 [0.331, 0.369] 8 0.360 [0.345, 0.376] 0.348 [0.330, 0.368] 12 0.359 [0.328, 0.390] 0.347 [0.329, 0.365] 20 0.344 [0.313, 0.369] 0.340 [0.324, 0.354] Scoped refinements: one core, a menu of conditional extensions
Refinement Helps when Result (8-20 seeds, CIs) Theory Bayesian confidence (EMCF) directed exploration; coverage-scarce or noisy broadcast predictive uncertainty is DISCRIMINATIVE (actual RMSE rises monotonically Q1$\to$Q5), and beats uniform when the broadcast is noisy Bayesian posterior Bounded precision weighting observation sources DIFFER in noise ratio-capped precision wins under heterogeneous noise ($+0.09$ unseen, non-overlapping CIs); UNBOUNDED $1/\sigma^2$ over-concentrates and loses everywhere precision weighting Collective info-directed exploration anytime sample efficiency one agent's broadcast probe lowers everyone's uncertainty; best FINAL anytime skill n/a ARD rank adaptation rank unknown self-determines the IDENTIFIABLE rank ($\le d$) and is INVARIANT to the guessed $\hat d$, which is what removes the rank hyperparameter (the recovered value reflects SNR/coverage, not the raw true rank) rank identifiability Fixed private offset ($+$ adaptive) capacity-1 contention $\approx 2\times$ earned reward vs greedy at severe contention ($0.10$ vs $0.059$), and BEATS both recognized no-comms de-confliction primitives, the CBBA auction-with-backoff ($0.064$) and multiplayer-MAB re-seating ($0.028$), with non-overlapping CIs; the adaptive scarcity-gated version extends the win across the contested range ($|S|\le m$) private-offset result Choice channel $+$ held-out $\gamma$ extreme reward-noise; mixed-reliability teammates noise-immune fallback (overtakes the reward channel at $\sigma_{\mathrm{obs}}{=}2$); held-out $\gamma_k$ correctly ranks teammates (oracle $0.49\gg$ random $0.10$) choice identifiability Method Provenance unseen skill
(ρ=0.25, masked)unseen skill
(ρ=1, full)cumulative regret
(ρ=0.25, lower=better)rounds to 25%
of oracleprofile SwarmCF (RewardCF) ours 0.336 0.376 41.2 35 D|0|rho,sig|dhat|onlineSwarmCF-RC (BothCF) ours 0.349 0.372 41.2 36 D|0|rho,sig|dhat|onlineSwarmCF-batch (PTF) ours (hybrid) 0.280 0.516 46.0 never D|0|rho,sig|dhat|batchMFSGD standard, adapted -0.019 0.042 46.5 never D|0|rho,sig|dhat|onlineBPMF standard, adapted 0.126 0.233 49.9 never D|0|rho,sig|dhat|batchESTR standard, adapted 0.058 0.232 46.4 never D|0|rho|dhat|ETCUCBHomo structure-free baseline 0.070 0.167 50.3 never D|0|rho,sig|none|onlineUCBIndep structure-free baseline 0.003 0.004 50.4 never D|0|rho,sig|none|onlineTabular structure-free baseline 0.003 -0.001 43.0 46 D|0|rho,sig|none|onlineRandom structure-free baseline 0.004 0.007 50.0 never D|0|-|none|-6. Related work
Paradigm Prior knowledge Communication Distribution Observability Note Auction / CBBA (consensus bundle) known task values/costs message-passing (bids) decentralized full task info needs communication AND known utilities DCOP / consensus MRTA known constraints/utilities message-passing decentralized full needs communication AND a known objective Cooperative MARL (CTDE: MAPPO/QMIX/VDN) none (learned) centralized training centralized-train / decentralized-exec full (in training) central critic; not comms-free Learned-communication MARL (CommNet/TarMAC) none (learned) learned messages decentralized partial + messages broadcast is learned message-passing, not passive No-comms multiplayer bandits (SIC-MMAB, musical chairs) none (per-arm) none decentralized own pulls + collisions comms-free but STRUCTURE-FREE (no unseen generalization) Matrix completion (nuclear norm, spectral) low-rank centralized (one matrix) centralized partial (uniform) centralized estimation, not online/decision Low-rank / bilinear bandits (ESTR, etc.) low-rank centralized centralized partial centralized and/or explore-then-commit Federated / gossip CF low-rank broadcast of factors/gradients decentralized partial shares PARAMETERS, not a passive stream Multi-user RL, low-rank rewards (Nagaraj-Agarwal) low-rank centralized aggregation centralized partial closest prior; centralizes trajectory aggregation Trait-based MRTA (Prorok et al.) KNOWN traits varies decentralized full capability/requirement traits are GIVEN, not learned OURS (broadcast CF for ZK-MRTA) low-rank, GUESSED rank only none (passive sensing) decentralized masked + noisy the hardest cell: no comms, no known utilities/traits, guessed rank 7. Limitations and conclusion
Operational: earned-reward skill (matching-normalized)
method pool=240 pool=60 pool=30 pool=15 ContentionAdaCF 0.448 [0.422, 0.472] 0.205 [0.186, 0.223] 0.153 [0.132, 0.173] 0.100 [0.086, 0.113] ContentionCF 0.314 [0.290, 0.337] 0.178 [0.160, 0.200] 0.134 [0.115, 0.155] 0.105 [0.096, 0.114] CBBAlite 0.464 [0.434, 0.487] 0.216 [0.204, 0.226] 0.127 [0.112, 0.142] 0.064 [0.047, 0.081] MusicalChairs 0.310 [0.279, 0.344] 0.124 [0.108, 0.141] 0.077 [0.053, 0.103] 0.028 [0.019, 0.040] ActiveCFconv 0.439 [0.412, 0.464] 0.230 [0.206, 0.252] 0.125 [0.109, 0.145] 0.046 [0.022, 0.071] RewardCFconv 0.439 [0.418, 0.463] 0.199 [0.186, 0.214] 0.121 [0.102, 0.138] 0.059 [0.044, 0.079] PTF 0.243 [0.224, 0.260] 0.151 [0.136, 0.168] 0.096 [0.083, 0.110] 0.057 [0.040, 0.075] UCBIndep 0.005 [-0.004, 0.014] -0.002 [-0.009, 0.005] -0.002 [-0.010, 0.006] 0.004 [-0.004, 0.013] Random -0.010 [-0.023, -0.000] 0.002 [-0.010, 0.014] -0.003 [-0.010, 0.005] -0.010 [-0.026, 0.007] Categorical: unseen-pair skill (learned under contention, eval contention-free)
method pool=240 pool=60 pool=30 pool=15 ContentionAdaCF 0.320 [0.282, 0.356] 0.302 [0.283, 0.323] 0.352 [0.325, 0.382] 0.295 [0.260, 0.326] ContentionCF 0.023 [0.002, 0.048] 0.290 [0.268, 0.313] 0.312 [0.270, 0.345] 0.273 [0.233, 0.311] CBBAlite 0.339 [0.311, 0.365] 0.385 [0.351, 0.421] 0.355 [0.334, 0.379] 0.288 [0.256, 0.322] MusicalChairs 0.120 [0.090, 0.150] 0.257 [0.215, 0.304] 0.298 [0.264, 0.326] 0.275 [0.251, 0.297] ActiveCFconv 0.113 [0.079, 0.145] 0.350 [0.319, 0.383] 0.310 [0.287, 0.330] 0.195 [0.165, 0.221] RewardCFconv 0.323 [0.291, 0.361] 0.383 [0.351, 0.414] 0.339 [0.314, 0.367] 0.287 [0.242, 0.324] PTF 0.343 [0.333, 0.356] 0.338 [0.302, 0.375] 0.262 [0.232, 0.293] 0.141 [0.114, 0.169] UCBIndep -0.007 [-0.014, -0.000] 0.001 [-0.006, 0.005] -0.003 [-0.009, 0.002] -0.006 [-0.015, 0.003] Random -0.007 [-0.026, 0.014] 0.014 [-0.002, 0.031] -0.009 [-0.027, 0.009] -0.005 [-0.028, 0.019] Collision rate (fraction of engagements lost to contention)
method pool=240 pool=60 pool=30 pool=15 ContentionAdaCF 0.126 0.392 0.465 0.581 ContentionCF 0.245 0.466 0.535 0.625 CBBAlite 0.091 0.356 0.507 0.658 MusicalChairs 0.391 0.555 0.625 0.697 ActiveCFconv 0.207 0.376 0.503 0.685 RewardCFconv 0.147 0.371 0.526 0.660 PTF 0.187 0.323 0.448 0.624 UCBIndep 0.054 0.209 0.363 0.561 Random 0.061 0.211 0.369 0.563 results/pilots/; figures via experiments/make_figures.py; proofs in docs/THEORY_FORMAL.md; ZK audit in docs/ZK_COMPLIANCE.md; full v1 draft in docs/PAPER_DRAFT.md.