Collaborative filtering for decentralized multi-robot task allocation under limited, communication-free observability. A self-contained, step-by-step, graduate-level tutorial of the whole project: motivation, background, model, baselines, method, theory, metrics, and results.
zero prior knowledge zero communication partial + noisy observation fully distributed low-rank latent structure online anytimeCompanion streamlined paper (v2): paper_v2.html. Full draft: docs/PAPER_DRAFT.md. Repository: ApartsinProjects/ZKDroneSwarm.
This tutorial uses a handful of symbols and abbreviations over and over. Here is every one of them in plain words; keep this table handy and refer back whenever a formula looks dense. None of the later sections assume you have memorized any of this.
| Symbol / term | Plain meaning |
|---|---|
| Counts and indices | |
| $m$ | number of drones (the agents / decision-makers). In our experiments $m=30$. |
| $n$ | number of targets (the tasks a drone can pick). In our experiments $n=240$. |
| $T$ | number of rounds (each round, every drone picks one target). We use $T=50$. |
| $i$ | index of a particular drone (a row), e.g. 'drone $i$'. |
| $j$ | index of a particular target (a column), e.g. 'target $j$'. |
| $k$ | index of a teammate (another drone) when we talk about one drone watching another. |
| The hidden structure | |
| $R$ | the reward table: $R_{ij}$ is how well drone $i$ does on target $j$ (its compatibility / payoff). This is what we are trying to learn. It is never fully observed. |
| $R_{ij}$ | one entry of that table: the reward (a number, here in $[-1,1]$) drone $i$ gets from target $j$. |
| $\hat R_{ij}$ | our estimate of $R_{ij}$ (the 'hat' always means 'estimated / guessed', not the true value). |
| $d$ | the true rank: the number of hidden factors that actually explain compatibility. Small $d$ (here $d=5$) means the table is 'low-rank', i.e. simple underneath. |
| $\hat d$ | the guessed rank the algorithms assume (here $\hat d=8$). They do NOT know the true $d$. |
| $p_i$ | drone $i$'s hidden taste vector (a list of $\hat d$ numbers). Think 'what this drone is good at'. |
| $u_j$ | target $j$'s hidden profile vector (also $\hat d$ numbers). Think 'what this target needs'. |
| $\langle p_i, u_j\rangle$ | the dot product of the two vectors: multiply them entry-by-entry and add up. It is our predicted compatibility $\hat R_{ij}$ (taste meets profile). |
| Observability (what a drone can see) | |
| $\rho$ (rho) | broadcast probability: the chance a drone sees any given teammate's outcome each round. $\rho=1$ means it sees everything; small $\rho$ means it sees little (heavy masking). |
| $\sigma_{\mathrm{obs}}$ (sigma-obs) | noise on a teammate's observed reward: how garbled the numbers a drone overhears are. Large $\sigma_{\mathrm{obs}}$ = very noisy. |
| $\sigma_{\mathrm{own}}$ | noise on a drone's OWN reward (smaller, here $0.10$): even your own sensor is imperfect. |
| $\varepsilon$ (epsilon) | exploration rate: the chance a drone tries a random target instead of its current best guess (so it keeps learning). It shrinks over time. |
| Methods and math tools | |
| CF | collaborative filtering: predicting your unknowns by borrowing patterns from others (the 'people who liked X also liked Y' idea, here for drones and targets). |
| MF | matrix factorization: approximating the big table $R$ as $p_i$ times $u_j$, i.e. discovering the hidden taste/profile vectors. The engine under CF. |
| ALS | alternating least squares: a way to fit MF by repeatedly fixing one set of vectors and solving for the other, back and forth, until they settle. |
| EM | expectation-maximization: a two-step loop that alternates 'guess the hidden thing given current beliefs' and 'update beliefs given that guess'. Used for our uncertainty-aware variants. |
| $\arg\max$ | 'the option that scores highest'. $\arg\max_j \hat R_{ij}$ = the target $j$ with the biggest estimated reward for drone $i$. |
| $\Sigma$ (Sigma) | a covariance / uncertainty bundle: how unsure we are about a vector, and how its components co-vary. Bigger $\Sigma$ = less confident. |
| $\gamma_k$ (gamma) | choice-informativeness of teammate $k$: how much drone trusts that $k$'s pick reflects real knowledge rather than a random guess (Section 5.6). |
| ZK | zero-knowledge / zero-communication: agents never message each other or share parameters; they only passively watch the public outcome stream. |
| Scoring | |
| skill | a 0-to-1 score: $0$ = no better than random, $1$ = as good as an all-knowing oracle. Formally $(\text{method}-\text{random})/(\text{oracle}-\text{random})$. Lets us compare fairly. This is a STANDARD normalization, not ours: it is the skill score of forecast verification (Murphy, (forecast-reference)/(perfect-reference)) and the random/human-normalized score ubiquitous in RL benchmarking (e.g. Atari); an affine rescaling of regret. |
| unseen-pair skill | skill measured ONLY on drone-target pairs the drone never tried, the acid test of generalizing to the unknown. |
| anytime skill | skill accumulated over ALL rounds (not just the end), rewarding methods that are good EARLY too. |
| CI | confidence interval: the bracket $[\text{lo},\text{hi}]$ after a number is the range we are $95\%$ sure the true average lies in. Non-overlapping CIs = a real difference, not luck. |
RewardCF), SwarmCF-B the Bayesian-confidence variant (EMCF), SwarmCF-D/D+ the contention variants (ContentionCF/AdaCF), SwarmCF-batch the batch hybrid (PTF), and so on. Figures and tables use the SwarmCF names; running prose and the code listings below retain the original class names so they match experiments/pilot_noise.py.Imagine a swarm of autonomous drones that, round after round, must each pick a target to engage. Three facts make this both hard and interesting:
Five classical tools combine into our method. Each is given formally (a definition or formula) and in plain words; Section 5 then specializes them to the decentralized, online, masked setting.
Predict the unknown entries of a user-item (here drone-target) matrix from a few observed ones by assuming the matrix is low-rank.
Let $R\in\mathbb{R}^{m\times n}$ have rank $d\ll\min(m,n)$, and let $\Omega\subseteq[m]\times [n]$ be the observed entries (with noise). Matrix completion seeks $\hat R$ of rank $\le d$ fitting the observations, e.g. the nuclear-norm (convex) program $\min_{X} \lVert X\rVert_* \ \text{s.t.}\ \sum_{(i,j)\in\Omega}(X_{ij}-R_{ij})^2\le\delta$, where $\lVert X\rVert_*=\sum_r\sigma_r(X)$ is the sum of singular values. Guarantee: a rank-$d$, $\mu$-incoherent matrix is recovered (exactly, noiseless) from $|\Omega| = \tilde O\big(d(m+n)\big)$ entries sampled at random (Candes-Recht 2009; Keshavan-Montanari-Oh OptSpace 2010; Recht 2011). Incoherence rules out 'spiky' factors so that uniformly random samples are informative.
We use this as the statistical backbone but in a NON-standard way: decentralized (per drone), online (one round at a time), and masked (each drone sees a different, biased subset of $\Omega$).
The non-convex but biconvex way to fit $R\approx PU^\top$: fix one factor, solve the other in closed form, alternate. It is the engine of our estimator (Section 5.1).
Minimize $\sum_{(i,j)\in\Omega}(R_{ij}-\langle p_i,u_j\rangle)^2 + \lambda(\lVert P\rVert_F^2 +\lVert U\rVert_F^2)$. With one factor fixed the objective is a sum of independent ridge regressions:
$$ p_i = \Big(\textstyle\sum_{j\in\Omega_i} u_j u_j^\top + \lambda I\Big)^{-1} \textstyle\sum_{j\in\Omega_i} R_{ij}\,u_j, \qquad u_j = \Big(\textstyle\sum_{i\in\Omega_j} p_i p_i^\top + \lambda I\Big)^{-1} \textstyle\sum_{i\in\Omega_j} R_{ij}\,p_i, $$where $\Omega_i,\Omega_j$ are the observed columns of row $i$ / rows of column $j$. Each solve is $O(d^3 + |\Omega_i| d^2)$. Weighted ALS (WALS) puts a weight $w_{ij}$ on each term; the matrix inverted is then the per-factor Gaussian posterior precision, the door to confidence (Section 5.3).
Sequential decision-making that exploits low-rank reward structure to act with few samples.
Each round, choosing an (arm-feature, context) pair $(x,z)$ yields reward $\langle x, \Theta z\rangle + \text{noise}$ for an unknown LOW-RANK $\Theta$ (rank $d$). The goal minimizes cumulative regret $\mathrm{Reg}(T)=\sum_{t}\big(\max_{x,z}\langle x,\Theta z\rangle - \langle x_t,\Theta z_t\rangle\big)$. Explore-then-commit spectral methods (e.g. ESTR, Jun et al. 2019) spend a probe phase, estimate $\Theta$ by SVD, then exploit; they are typically CENTRALIZED and phase-structured. Our setting is the special case where rows are drones and columns are targets, but ANYTIME (no probe phase) and DISTRIBUTED.
Given known item factors $U$, a brand-new user's factor is fit from a few interactions by one small least-squares solve, no retraining (the WALS 'fold-in').
A new drone with observed rewards $\{r_a\}$ on targets with known factors $\{u_a\}_{a=1}^k$ gets factor $\hat p = \big(\sum_a u_a u_a^\top + \lambda I\big)^{-1}\sum_a r_a u_a$, after which $\hat R_{\cdot j}=\langle\hat p, u_j\rangle$ is available for ALL targets $j$. Exact once $k\ge d$ and $\{u_a\}$ span $\mathbb{R}^d$ (row completion). We reuse this symmetrically for onboarding new TARGETS (fold-in a $u$ from drones' probes) and welcoming new DRONES.
Treating an observed CHOICE (which target a teammate engaged) as a preference signal, while correcting for what options were available (exposure).
Implicit-feedback CF (Hu-Koren-Volinsky 2008) fits preferences $y_{ij}\in\{0,1\}$ with per-pair CONFIDENCE $c_{ij}$: $\min \sum_{i,j} c_{ij}(y_{ij}-\langle p_i,u_j\rangle)^2 + \lambda(\dots)$, where a chosen item is a (confident) positive and unchosen offered items are (less-confident) negatives. We treat a teammate's engaged target as a positive and sample negatives from the PUBLIC active-target set (the exposure / choice set), so no private menu is needed (the ZK choice channel). This 'confidence' is exactly what Section 5.5 deepens.
There are $m$ drones and $n$ targets. Drones come in $K_1$ latent types and targets in $K_2$ types (a 'block model'); within a type, agents and targets are similar but not identical. The hidden compatibility between drone $i$ and target $j$ is the cosine similarity of their latent factor vectors $p_i, u_j \in \mathbb{R}^d$. Everything below is generated once per random seed; no learner ever sees $P$, $U$, or $R$.
Draw type prototypes $A\in\mathbb{R}^{K_1\times d}$ and $B\in\mathbb{R}^{K_2\times d}$ with i.i.d. $\mathcal{N}(0,1)$ entries and unit-normalize each row. Assign each drone a type $t_i\in\{1,\dots,K_1\}$ and each target a type $s_j\in\{1,\dots,K_2\}$. Factor vectors are the prototype plus a small within-type spread $\xi$ (default $0.15$), renormalized to the unit sphere:
$$ p_i = \frac{A_{t_i} + \xi\,\varepsilon_i}{\lVert A_{t_i} + \xi\,\varepsilon_i\rVert}, \qquad u_j = \frac{B_{s_j} + \xi\,\eta_j}{\lVert B_{s_j} + \xi\,\eta_j\rVert}, \qquad \varepsilon_i,\eta_j \sim \mathcal{N}(0, I_d). $$The true reward (compatibility) matrix is the bilinear cosine form
$$ R_{ij} = \langle p_i, u_j\rangle \in [-1,1], \qquad R = P U^\top \in \mathbb{R}^{m\times n}, \qquad \operatorname{rank}(R) = \min(d, K_1, K_2). $$Here $P\in\mathbb{R}^{m\times d}$ stacks the drone factors as rows and $U\in\mathbb{R}^{n\times d}$ the target factors. Unit norms remove any popularity/scale artifact (no target is good simply by having a large magnitude); the link is exactly bilinear (no nonlinear squashing $g(\cdot)$), so the matrix is GENUINELY rank $d$, the structure CF exploits. We stress-test violations of both assumptions in Section 8.11.
In words. A few hidden 'knobs' ($d$ of them) explain every drone's taste for every target. Two drones of the same type want similar targets; two targets of the same type are wanted by similar drones. That shared, low-dimensional structure is exactly what lets one drone learn about a target from other drones' experiences, the engine of collaborative filtering.
| Symbol | Meaning | Default |
|---|---|---|
| $m$ | number of drones (agents) | 30 |
| $n$ | number of targets (tasks/arms) | 240 |
| $d$ | true latent rank (hidden) | 5 |
| $\hat d$ | GUESSED rank used by every structured method (fair) | 8 |
| $K_1,K_2$ | drone / target latent types | 10, 10 |
| $T$ | rounds (engagements per drone) | 50 |
| $c$ | candidate-set size offered each round | 20 |
| $\sigma_{\mathrm{own}}$ | noise on a drone's OWN observed reward | 0.10 |
| $\sigma_{\mathrm{obs}}$ | noise on an observed TEAMMATE reward | 0.30 |
| $\rho$ | fraction of the broadcast each drone senses (masking) | swept 0..1 |
| $\xi$ | within-type factor spread | 0.15 |
Note $n=240 \gg cT/\ldots$: with $T=50$ rounds and $c=20$ offers, a drone can engage at most $T=50$ distinct targets out of $n=240$, so the regime is SAMPLE-STARVED (most targets are never tried). This is the regime where generalizing to the unseen matters.
Time is discrete rounds $t=1,\dots,T$. In each round EVERY drone acts ONCE and SIMULTANEOUSLY (no turn-taking, no waiting on others): drone $i$ is shown a candidate set $S_i^t\subseteq[n]$ of $c$ targets, picks one $a_i^t\in S_i^t$, and earns that target's true reward $R_{i,a_i^t}$. A public stream then carries the (action, outcome) of every engagement that round; drone $i$ senses its OWN outcome cleanly (noise $\sigma_{\mathrm{own}}$) and a per-drone-limited, noisy subset of teammates' outcomes:
Let $a_k^t$ be the target drone $k$ engages at round $t$, and let $M\in\{0,1\}^{m\times m}$ be the observation mask ($M_{ii}=1$ always). What drone $i$ records about round $t$ is the noisy value
$$ \tilde r^{(i)}_{k,t} = \begin{cases} R_{k,a_k^t} + \mathcal{N}(0,\sigma_{\mathrm{own}}^2) & k=i \ \text{(own, clean)}\\[2pt] R_{k,a_k^t} + \mathcal{N}(0,\sigma_{\mathrm{obs}}^2) & k\neq i,\ M_{ik}=1 \ \text{(teammate, sensed)}\\[2pt] \text{unobserved} & k\neq i,\ M_{ik}=0 \ \text{(masked)} \end{cases} $$The noise is independent per observer (the superscript $(i)$ is essential). Each drone $i$ draws its OWN noise on each engagement and applies its OWN mask $M_{ik}$, so for two observers $i\neq i'$ the values recorded for the SAME action $a_k^t$ differ: $\tilde r^{(i)}_{k,t}=R_{k,a_k^t}+\eta^{(i)}_{k,t}$ and $\tilde r^{(i')}_{k,t}=R_{k,a_k^t}+\eta^{(i')}_{k,t}$ with $\eta^{(i)}_{k,t}\perp\eta^{(i')}_{k,t}$, both $\sim\mathcal{N}(0,\sigma_{\mathrm{obs}}^2)$ and redrawn every round. There is NO single shared 'broadcast number': the stream is what each drone privately SENSES, not a value any drone transmits. This independent per-drone sensing (plus the per-drone mask) is exactly why no two drones hold the same internal state, and why the channel is genuinely passive/communication-free.
The recorded precision of each observation is $\beta = 1/\sigma^2$ (so $\beta_{\mathrm{own}} = 1/\sigma_{\mathrm{own}}^2$, $\beta_{\mathrm{obs}} = 1/\sigma_{\mathrm{obs}}^2$). Two masking regimes: PERSISTENT ($M$ drawn once, $M_{ik}\sim\mathrm{Bernoulli}(\rho)$, fixed for all $t$) gives durably-different per-drone views; i.i.d. ($M^t_{ik}\sim\mathrm{Bernoulli}(\rho)$ redrawn each round, packet-loss style) gives the same headline results (the masking-model result, Section 8.7).
Data format a learner sees. A stream of tuples $(k, a_k^t, \tilde r^{(i)}_{k,t})$ for the $k$ it sensed: the actor id, the discrete action (which target), and the noisy scalar outcome. No factors, no ids of TYPES, no rank, no teammate parameters, ever.
The same rules bind ours and all baselines: no method ever receives the latent factors, the true rank, or labels; structured methods all get the SAME guessed rank $\hat d=8$ (true $d=5$); every method is an INDEPENDENT per-drone learner with no parameter sharing and no coordinator; all see the same partial+noisy stream. The Oracle (centralized, complete information) is used only to normalize scores, never as a competitor.
Putting the world and the observation model together, one episode runs as below. Each drone $i$ carries a policy $\pi_i$ and a private estimator; nothing is shared between drones except what the broadcast passively reveals.
Reading the loop. Step 2 is the only place a method's intelligence enters the EARNED reward (the anytime metric). Step 4 is the only information a method gets about teammates: passive, partial, noisy. No message passing, no shared parameters, no coordinator; 'communication-free' is literal. The estimator update in step 4 is what differs across methods (next two sections).
| Method | Class | What it does |
|---|---|---|
| Random | no-structure | uniform pick; the floor. |
| UCBIndep | no-structure | a separate UCB1 bandit per (drone,target) pair: correct heterogeneity, zero cross-target generalization. |
| UCBHomo | no-structure | one shared target table (assumes all drones identical): recovers only rank-1 'popularity'. |
| Tabular | no-structure | epsilon-greedy on own-row averages. |
| MFSGD | low-rank | plain online SGD matrix factorization; underfits when starved. |
| ESTR | low-rank | explore-then-spectral-refit: random explore, one SVD of the empirical matrix, then commit. |
| PTF (ours: SwarmCF-batch) | low-rank | probe-then-fit: UCB probe, SVD warm-start, then SGD finetune. OUR strongest BATCH variant (listed here for the low-rank comparison); wins only at full broadcast, see Section 5. |
| BPMF | low-rank | Bayesian PMF with Thompson sampling; over-explores in the anytime regime. |
| SoftImpute | low-rank | nuclear-norm CONVEX completion (iterate fill, SVD, soft-threshold). Strongest at full broadcast; decays under masking. |
| kNN-CF | memory-CF | model-FREE collaborative filtering (similarity-weighted neighbor rewards); tests whether explicit factorization is needed. |
| BiasModel | additive | additive drone+target bias, NO interaction (rank≤2); isolates the value of personalization. |
A 'batch-SVD hybrid' (ESTR, and our own PTF / SwarmCF-batch) extracts factors by one SVD of an accumulated reward table whose unobserved entries are imputed 0; under masking that table is sparse and biased, which is why these methods decay (Section 8.5). In our harness every baseline runs as an independent per-drone instance (ESTR's literature default is centralized; we run it per drone for a fair, fully-distributed comparison).
Every method has the same interface: each round it scores the offered targets and picks $a=\arg\max_{j\in S}\hat R_{ij}$ (with probability $\epsilon$ it explores at random instead), then updates from the masked broadcast. Notation: $N_{ij}$ = drone $i$'s own pull count of $j$; $\bar r_{ij}$ = its empirical mean reward on $j$; $t$ = round; $c$ = UCB constant; $\hat d$ = guessed rank. All match the implementations in experiments/pilot_baselines.py and pilot_noise.py.
Random. Intuition: the floor; learns nothing. Rule: $a\sim\mathrm{Unif}(S)$; $\hat R$ is noise. Skill $\approx 0$ by construction.
Tabular ($\epsilon$-greedy). Intuition: treat each target independently; average your OWN rewards on it. Rule: $\hat R_{ij}=\bar r_{ij}$ if $N_{ij}>0$, else the global mean of your observed rewards (a prior constant). Fails because: any never-pulled $j$ stays at the constant, so it cannot rank the unseen, the tabular floor.
UCB-Indep. Intuition: one UCB1 bandit per (drone,target) pair; be optimistic about rarely-tried arms. It updates ALL arms it senses from the broadcast but selects on its own row. Rule: $\text{score}_j=\bar r_{ij}+c\sqrt{\ln(t)/N_{ij}}$, with an untried arm scored $+\infty$ (so it is pulled first). Fails because: no link between arms, so unseen stays at the floor; and with $n\gg T$ almost every offer contains an untried ($+\infty$) arm, so it explores forever and earns $\approx$ random (anytime $\approx 0$).
UCB-Homo. Intuition: assume all drones are identical, so pool everyone's rewards into ONE shared per-target table. Rule: same UCB1 but on the pooled $\bar r_{\cdot j}=\tfrac1m\sum_k \bar r_{kj}$. Fails because: pooling estimates $\langle\bar p, u_j\rangle$ (rank-1 'popularity'), one ranking for everyone, so it gets partial unseen skill ($\approx 0.17$) but is confounded on heterogeneous drones.
MFSGD. Intuition: plain SGD matrix factorization; nudge the factors after each observed reward. Rule: on $(k,j,r)$ with residual $e=\langle p_k,u_j\rangle-r$: $p_k\mathrel{-}=\eta(e\,u_j+\lambda p_k)$, $u_j\mathrel{-}=\eta(e\,p_k+\lambda u_j)$; predict $\langle p_i,u_j\rangle$. Fails because: SGD makes tiny steps and there are few observations ($n\gg T$), so it underfits in the starved regime.
ESTR (explore-then-spectral-refit). Intuition: a phase method: explore at random, then do ONE spectral fit, then commit. Rule: for the first $\sim 0.4T$ rounds pick randomly and accumulate the empirical $\hat M$ (unobserved entries imputed $0$); take its rank-$\hat d$ SVD $\hat M\approx U_{\hat d}\Sigma_{\hat d}V_{\hat d}^\top$, set $\hat P=U_{\hat d}\Sigma^{1/2}$, $\hat U=V_{\hat d}\Sigma^{1/2}$, then exploit $\arg\max\langle\hat P_i,\hat U_j\rangle$. Fails because: the explore phase earns $\approx$ random (anytime cost), it COMMITS (no online adaptation), and the $0$-imputation biases the SVD as masking grows.
PTF (probe-then-fit), the strongest competitor. Intuition: like ESTR but probe with UCB (targeted, not random), warm-start by SVD, THEN keep fine-tuning online. Rule: probe phase = own-row UCB1; at probe end, SVD warm-start of $\hat M$ (clipped to $[-1,1]$ for stable SVD); thereafter $\epsilon$-greedy on $\langle P,U\rangle$ with online SGD updates. Fails because: still pays a probe phase on anytime, and the UCB probe biases $\hat M$ coverage toward high-reward arms (a real completion trade-off).
BPMF (Bayesian PMF, Thompson sampling). Intuition: keep a per-drone Gaussian posterior over the factors and explore by SAMPLING from it (no $\epsilon$). Rule: conjugate updates in natural form, $\Lambda_{p_k}\mathrel{+}=u_ju_j^\top/\sigma^2$, $\eta_{p_k}\mathrel{+}=r\,u_j/\sigma^2$ (and symmetric for $u_j$); select by drawing $\tilde p\sim\mathcal N(\mu_{p_i},\Lambda_{p_i}^{-1})$, $\tilde u_j\sim\mathcal N(\mu_{u_j},\Lambda_{u_j}^{-1})$ and taking $\arg\max\langle\tilde u_j,\tilde p\rangle$. Fails because: Thompson sampling over-explores in the sample-starved anytime regime. (This is the closest baseline to our EMCF, but with sampling instead of a predictive-interval UCB.)
SoftImpute (nuclear-norm convex completion). Intuition: the convex way to complete a low-rank matrix: repeatedly fill the missing entries, SVD, and SOFT-shrink singular values. Rule: iterate $Z\leftarrow P_\Omega(\bar R)+P_\Omega^\perp(X)$, then $X\leftarrow\sum_r \max(\sigma_r-\lambda,0)\,u_rv_r^\top$ (threshold chosen to keep $\approx\hat d$ components). Fails because: it is a batch fill-in method (the unobserved entries are imputed), so under masking the table is sparse and biased and it decays, strongest at FULL broadcast.
kNN-CF (memory-based, model-free). Intuition: 'find similar drones': predict drone $i$'s reward on $j$ as a similarity-weighted average of OTHER drones who observed $j$. Rule: center each drone's observed rewards, compute cosine similarity $w_{ik}$ over co-observed targets, then $\hat R_{ij}=\bar r_i+\big(\sum_k w_{ik}(\bar r_{kj}-\bar r_k)\big)/\sum_k w_{ik}$ over drones $k$ that observed $j$. Fails because: it generalizes but earns little, and needs neighbors who actually observed $j$; it tests whether explicit factorization is even needed (it is).
BiasModel (additive, the popularity ceiling). Intuition: some targets are generally easy and some drones generally good, but no personalized interaction. Rule: $\hat R_{ij}=\mu+b_i+c_j$ (fit by alternating means over observed entries). Fails because: the prediction matrix has rank $\le 2$, so for a fixed drone the ranking is by $c_j$ alone, a single shared popularity order, which cannot personalize (the additive-ceiling result).
Common thread: the no-structure methods are at the unseen floor by construction; the batch low-rank methods (ESTR/PTF/SoftImpute) generalize but pay a probe phase and/or decay under masking because they impute missing entries; BPMF generalizes but over-explores. Our online weighted-ALS (Section 5) keeps the generalization while removing the probe phase and the imputation bias.
Notation [dist | comm | obs | prior | compute]: dist D=decentralized, C=centralized; comm 0=none (passive), B=broadcast params, full=message-passing; obs full / ρ=masked / ρσ=masked+noisy / self; prior –=none, d̂=guessed rank (itself removable: the ARD variant learns it), ARD=rank self-determined, d=true rank, U*=oracle factors; compute online / batch / ETC=explore-then-commit / mem. In-harness every method is decentralized and communication-free (one estimator per drone on the passive broadcast); the low-rank methods differ in the update rule and refinements, not the information regime. Our flagship sits in the hardest cell [D | 0 | ρσ | d̂ | online]. Our invented methods are ONE family, SwarmCF (shown as 'SwarmCF-variant (code name)'), differing only in the menu axes of the mechanism table; everything else is a standard estimator we adapt, the structure-free paradigm, or a reference ceiling.
| Method | Provenance | Dist | Comm | Observability | Prior | Compute | Profile |
|---|---|---|---|---|---|---|---|
| SwarmCF (RewardCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-B (EMCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-RC (BothCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-Ch (ChoiceCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-D+ (ContentionAdaCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-U (UnifiedCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-H (HybridCF) | ours | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| SwarmCF-B-ARD (ARD-EMCF) | ours | decentralized | 0 | ρσ | ARD | online | D|0|ρσ|ARD|online |
| SwarmCF-batch (PTF) | ours (hybrid) | decentralized | 0 | ρσ | d̂ | batch | D|0|ρσ|d̂|batch |
| MFSGD | standard, adapted | decentralized | 0 | ρσ | d̂ | online | D|0|ρσ|d̂|online |
| KNNCF | standard, adapted | decentralized | 0 | ρσ | – | mem | D|0|ρσ|–|mem |
| BiasModel | standard, adapted | decentralized | 0 | ρσ | – | online | D|0|ρσ|–|online |
| BPMF | standard, adapted | decentralized | 0 | ρσ | d̂ | batch | D|0|ρσ|d̂|batch |
| SoftImpute | standard, adapted | decentralized | 0 | ρσ | d̂ | batch | D|0|ρσ|d̂|batch |
| CLUB | standard, adapted | decentralized | 0 | ρσ | – | batch | D|0|ρσ|–|batch |
| ESTR | standard, adapted | decentralized | 0 | ρ | d̂ | ETC | D|0|ρ|d̂|ETC |
| UCBIndep | structure-free baseline | decentralized | 0 | ρσ | – | online | D|0|ρσ|–|online |
| UCBHomo | structure-free baseline | decentralized | 0 | ρσ | – | online | D|0|ρσ|–|online |
| Tabular | structure-free baseline | decentralized | 0 | ρσ | – | online | D|0|ρσ|–|online |
| Random | structure-free baseline | decentralized | 0 | – | – | – | D|0|–|–|– |
| CentralClean-ceiling | reference (upper bound) | centralized | full | full | d̂ | online | C|full|full|d̂|online |
| CTDE-ceiling | reference (upper bound) | centralized | full | fullσ | d̂ | online | C|full|fullσ|d̂|online |
| Oracle | reference (upper bound) | centralized | full | full | U* | – | C|full|full|U*|– |
And here is the design space of OUR methods by mechanism, so the named variants read as one core plus a few scoped switches (signal channel, exploration, confidence, contention, rank, coordination) rather than a zoo.
| Method | Signal channel | Exploration | Confidence | Contention | Rank | Coordination |
|---|---|---|---|---|---|---|
| SwarmCF (RewardCF) | reward | eps-greedy | none | none | fixed d̂ | implicit |
| SwarmCF-Ch (ChoiceCF) | choice | eps-greedy | none | none | fixed d̂ | implicit |
| SwarmCF-RC (BothCF) | reward+choice | eps-greedy | competence-weight | none | fixed d̂ | implicit |
| SwarmCF-B (EMCF) | reward | collective-UCB | Bayesian posterior | none | fixed d̂ | implicit |
| SwarmCF-X (ActiveCF) | reward | collective-UCB | Bayesian posterior | none | fixed d̂ | explicit (exploration division) |
| SwarmCF-Xc (CoordCF) | reward | neg-correlated-UCB | Bayesian posterior | none | fixed d̂ | explicit (no-comms division of labor) |
| SwarmCF-D (ContentionCF) | reward | eps-greedy | none | fixed private offset | fixed d̂ | explicit de-confliction (no comms) |
| SwarmCF-D+ (ContentionAdaCF) | reward | eps-greedy | none | scarcity-gated offset | fixed d̂ | explicit de-confliction (no comms) |
| SwarmCF-B-ARD (ARD-EMCF) | reward | collective-UCB | Bayesian posterior | none | ARD (self-tuned) | implicit |
| SwarmCF-H (HybridCF) | reward | probe-then-exploit | none | none | fixed d̂ | implicit |
| SwarmCF-batch (PTF) | reward | probe-then-exploit | none | none | fixed d̂ | implicit (batch refit) |
| SwarmCF-U (UnifiedCF) | reward | gated collective-UCB | Bayesian posterior | gated offset | fixed d̂ | both (conditionally) |
Each drone runs its own online weighted alternating least squares (weighted ALS) over the events it senses. The crucial trick for masking-robustness: an event the drone did NOT observe gets zero weight, not an imputed zero value. (Batch 'fill-in' methods instead pretend a missing entry is a real 0, which is why they decay as more of the broadcast is masked.) A second, optional refinement weights each OBSERVED event by its precision $1/\sigma^2$ (trust clean outcomes more than noisy ones); this helps when the broadcast is noisy, but at low noise plain uniform weighting actually generalizes a little better to unseen targets, which we show honestly in the ablation (section 8.12). Estimation is separated from the decision policy, so the policy can be epsilon-greedy, UCB, or uncertainty-directed.
Explored but not recommended (honest negatives): precision-gated fusion (BothCFPrec) and validation-stacked fusion (StackCF) do not strictly dominate the simple reward channel, which already wins for all realistic noise; the per-observation confidence GATE (model agreement) deadlocks at cold-start.
Every method above shares one estimator. Each drone keeps factor estimates $P\in\mathbb{R}^{m\times d}$ (its belief about every drone, including itself) and $U\in\mathbb{R}^{n\times d}$ (every target), fit to the noisy events it has sensed by minimizing a precision-weighted, ridge-regularized reconstruction error.
Let $\Omega$ be the set of sensed events $(k,j,r)$ (teammate $k$ engaged target $j$, observed reward $r$) and let $w_{kj}=1/\sigma^2$ be the precision of that observation (own events $1/\sigma_{\mathrm{own}}^2$, sensed teammates $1/\sigma_{\mathrm{obs}}^2$, masked events ABSENT, i.e. weight $0$). Minimize
$$ \mathcal{L}(P,U) = \sum_{(k,j,r)\in\Omega} w_{kj}\big(r - \langle p_k, u_j\rangle\big)^2 + \lambda\big(\lVert P\rVert_F^2 + \lVert U\rVert_F^2\big). $$Holding one factor fixed makes this a set of independent ridge regressions, solved in closed form (the alternating updates), sweeping a few times:
$$ u_j \leftarrow \Big(\textstyle\sum_{k:(k,j)\in\Omega} w_{kj}\,p_k p_k^\top + \lambda I\Big)^{-1}\Big(\textstyle\sum_{k:(k,j)\in\Omega} w_{kj}\,r_{kj}\,p_k\Big), $$$$ p_i \leftarrow \Big(\textstyle\sum_{j:(i,j)\in\Omega} w_{ij}\,u_j u_j^\top + \lambda I\Big)^{-1}\Big(\textstyle\sum_{j:(i,j)\in\Omega} w_{ij}\,r_{ij}\,u_j\Big). $$The prediction for ANY target (seen or unseen) is then $\hat R_{ij} = \langle p_i, u_j\rangle$. Why this beats batch fill-in: a masked event simply does not appear in the sums (weight $0$), so it contributes nothing; batch-SVD methods instead impute the missing entry as the VALUE $0$, biasing the factors as masking grows. Why precision weighting is subtle: the matrices being inverted are exactly the per-factor Gaussian posterior precisions (next subsection), so the choice of $w$ trades noise-filtering against coverage; see 5.3.
Estimation (above) is separated from the policy (how we pick). We use three policies, in increasing use of confidence:
Policy (b) is the key idea: because every probe is broadcast, the count $g_j$ is shared, so one drone probing an under-observed target lowers the uncertainty bonus for EVERYONE, active learning with zero communication. The count $1/\sqrt{g_j+1}$ is a cheap proxy for the true posterior uncertainty of target $j$, which we compute exactly next.
Under the linear-Gaussian model behind weighted-ALS, the matrices we already invert ARE Gaussian posterior precisions, so confidence comes essentially for free, the question is how to USE it.
Treating $u_j$ as Gaussian with the per-target normal matrix as its posterior precision,
$$ \Lambda_j = \sum_{k} w_{kj}\, p_k p_k^\top + \lambda I, \qquad \Sigma_j = \Lambda_j^{-1} \;=\; \mathrm{Cov}(u_j \mid \text{data}), $$and symmetrically $\Sigma_i$ for $p_i$. The PREDICTED reward $\hat R_{ij}=\langle p_i,u_j\rangle$ then has an explicit predictive variance (independent-Gaussian factors):$$ \mathrm{Var}(\hat R_{ij}) \approx p_i^\top \Sigma_j\, p_i + u_j^\top \Sigma_i\, u_j + \operatorname{tr}(\Sigma_i \Sigma_j), $$giving a confidence interval $\hat R_{ij} \pm z\sqrt{\mathrm{Var}(\hat R_{ij})}$ for every pair, INCLUDING targets the drone never tried. Three ways we put this to work:
The fully Bayesian version (probabilistic matrix factorization, $r_{ij}\sim \mathcal{N}(\langle p_i,u_j\rangle, \sigma_{ij}^2)$, $p_i,u_j\sim\mathcal{N}(0,\lambda^{-1}I)$) keeps a full posterior $q(p_i)=\mathcal{N}(\mu_i,\Sigma_i)$, $q(u_j)=\mathcal{N}(\mu_j,\Sigma_j)$ and updates by mean-field VI, the EM updates differ from ALS by propagating uncertainty through the SECOND MOMENT $\mathbb{E}[u u^\top]=\Sigma_j+\mu_j\mu_j^\top$:
$$ \Sigma_j = \Big(\lambda I + \textstyle\sum_k \beta_{kj}\,\mathbb{E}[p_k p_k^\top]\Big)^{-1}, \quad \mu_j = \Sigma_j \textstyle\sum_k \beta_{kj}\, r_{kj}\,\mu_k, \qquad \beta_{kj}=1/\sigma_{kj}^2 \ \ (\text{and symmetrically for } p_i). $$A poorly-covered or noisy factor automatically contributes less (its $\mathbb{E}[uu^\top]$ is inflated by $\Sigma$), WITHOUT us hand-tuning a weight, confidence enters the model, not an ad-hoc knob. Prediction uses the same predictive-variance formula above for a real interval; we drive a UCB and optional shrinkage from it. (This is our EMCF.)
'Confidence-aware factorization' is a real literature; here is the map, what kind of uncertainty each method yields, its cost, and what we adopt.
| Approach | Uncertainty it gives | Cost | Here |
|---|---|---|---|
| PMF (Mnih & Salakhutdinov, 2007) | none by itself (MAP point estimate) | low | = our MFSGD baseline |
| Laplace around the MAP / ALS | posterior covariance $\Sigma=\Lambda^{-1}$ = the inverse of the matrix ALS already inverts | ~free | used (predictive variance, shrinkage) |
| Variational Bayesian MF (Lim & Teh 2007; Raiko et al.) | closed-form factor posteriors via mean-field VI | moderate | used (our EMCF) |
| Bayesian PMF / MCMC (Salakhutdinov & Mnih, 2008) | full posterior samples (Gibbs); Thompson sampling | high | = our BPMF baseline |
| Online-MF Thompson sampling (Kawale et al., 2015) | sequential (particle/SMC) posterior | high | related; BPMF is the in-harness proxy |
| Debiased entrywise CIs for noisy completion (Chen, Fan, Ma & Yan, 2019, PNAS) | non-asymptotic confidence intervals per entry/factor | batch | gold standard for completion CIs; centralized |
| Low-rank / bilinear bandit ellipsoids (Jun et al., 2019) | frequentist confidence sets on the bilinear parameter | phase-structured | related; we are anytime + distributed |
'Confidence' is not one number. Per-observation precision $1/\sigma^2$ (Level 1) is the shallowest kind, and we saw it can even backfire. The useful confidence is SECOND-ORDER: how well-determined is the learned latent structure, and how much should one agent trust another agent's revealed CHOICES. Two orthogonal axes organize it: WHOSE confidence (my OWN vs a TEAMMATE's) and WHICH channel (the REWARD signal vs the CHOICE signal). That gives a 2x2 we use throughout.
| REWARD channel (continuous outcomes) | CHOICE channel (discrete actions) | |
|---|---|---|
| OWN (about me) | How well MY factor $p_i$ is pinned down by the rewards I have: posterior covariance $\Sigma_{p_i}=(\sum_j \beta_{ij}u_ju_j^\top+\lambda I)^{-1}$. Large $\Rightarrow$ explore (my taste is uncertain). | Whether I am EXPLOITING a confident model vs still exploring (my own exploit-state), $1-\epsilon_i$ or $1-\mathrm{tr}\,\Sigma_{p_i}$-driven. |
| TEAMMATE (about $k$) | How much to trust teammate $k$'s REWARD reports: their observation precision $\beta_{kj}=1/\sigma_{\mathrm{obs}}^2$ AND how well their data pin the target factor $\Sigma_{u_j}$. Drives the (bounded) fit weighting and shrinkage. | How INFORMATIVE / knowledgeable $k$'s CHOICES are: $\gamma_k=\Pr(k$ acts on its model rather than at random$)$, inferred from $k$'s public actions. Drives the choice-channel weight (ChoiceEM, Section 5.6). |
Levels 1-3 below populate this grid: Level 1 is per-observation precision (the REWARD cells, shallow); Level 2 is the posterior structure confidence (own $\Sigma_{p_i}$ and teammate $\Sigma_{u_j}$, the reward column); Level 3 is peer-policy confidence ($\gamma_k$, the teammate-choice cell). The own-choice cell is exploit-vs-explore self-assessment. Crucially OUR confidence and the TEAMMATE's are different quantities, we never see a teammate's posterior, we INFER its reliability from public behavior.
$\beta_{kj}=1/\sigma_{kj}^2$: how noisy is one sensed reward. Local, myopic. Leverage: noise-aware fit weighting, but only bounded/normalized (relcap), since over-trusting clean OWN data starves the broadcast that determines the shared structure (Section 8.12).
As events accumulate, the swarm's belief about the latent geometry sharpens. Two parts, both available in closed form from the (weighted) ALS / variational fit:
These combine into a predictive interval for any pair (seen or unseen): $\mathrm{Var}(\hat R_{ij}) = p_i^\top\Sigma_{u_j}p_i + u_j^\top\Sigma_{p_i}u_j + \operatorname{tr}(\Sigma_{p_i}\Sigma_{u_j})$.
How to leverage Level 2 (three concrete mechanisms).
A teammate $k$'s observed action $a_k^t$ is a useful PREFERENCE signal only to the extent $k$ chose it by EXPLOITING a confident model; if $k$ is still exploring, $a_k^t$ is noise. We cannot see $k$'s posterior (no parameter sharing), so we INFER $k$'s exploit-confidence from its public behavior. Let $\gamma_k^t = \Pr(k \text{ is exploiting at } t \mid k\text{'s observed actions})$. Two estimable signals (ZK: actions only):
Use $\gamma_k^t$ to WEIGHT teammate $k$'s choice observation in the choice channel: $w^{\text{choice}}_{k} \propto \gamma_k^t$. Early random choices are discounted; late model-based choices are trusted, automatically.
Level 3 above raises a real failure of the naive choice channel. When drone $i$ learns from a teammate $k$'s CHOICE $a_k$ (treating the chosen target as a revealed positive), it implicitly assumes $k$ chose because $k$'s model says $a_k$ is good. But early on, $k$ has LOW confidence in its own latent structure, so $k$ is largely EXPLORING, its choices are almost random and carry little preference information. Folding such choices in as positives corrupts the latent estimate. We want to estimate, JOINTLY with the factors, how INFORMATIVE each teammate's choices are, so non-informative choices automatically get a small weight.
Teammate $k$ offered set $S_k^t$ chooses target $a_k^t$ via a latent indicator $z_k^t\in\{\text{informative},\text{random}\}$ with $\Pr(z=\text{informative})=\gamma_k$:
$$ a_k^t \mid z_k^t=\text{informative} \ \sim\ \mathrm{softmax}_{j\in S_k^t}\!\Big(\tfrac{1}{\tau}\langle p_k, u_j\rangle\Big), \qquad a_k^t \mid z_k^t=\text{random}\ \sim\ \mathrm{Unif}(S_k^t). $$The informative component is the Boltzmann-rational / conditional-logit choice (the standard inverse-RL and discrete-choice model): a confident agent picks high-preference targets with high probability; the temperature $\tau$ sets how sharp. The marginal choice likelihood is the mixture $\;p(a_k^t) = \gamma_k\,\mathrm{softmax}_{a_k^t}(\cdot) + (1-\gamma_k)/|S_k^t|$.
E-step (per-choice responsibility = posterior that the choice was informative):
$$ r_k^t = \frac{\gamma_k\,\mathrm{softmax}_{a_k^t}\!\big(\langle \hat p_k, u\rangle/\tau\big)}{\gamma_k\,\mathrm{softmax}_{a_k^t}\!\big(\langle \hat p_k, u\rangle/\tau\big) + (1-\gamma_k)/|S_k^t|}. $$$r_k^t\to 1$ when $k$ picked what our current model predicts it prefers (looks model-based); $r_k^t\to 0$ when the choice looks no better than random. M-step:
$$ \gamma_k \leftarrow \tfrac{1}{T_k}\textstyle\sum_t r_k^t \quad(\text{closed form}); \qquad (P,U) \leftarrow \arg\max \ \underbrace{\textstyle\sum_{\text{rewards}} \beta\,\text{(reward fit)}}_{\text{reward channel}} + \underbrace{\textstyle\sum_{k,t} r_k^t\,\log\mathrm{softmax}_{a_k^t}(\langle p_k,u_j\rangle/\tau)}_{\text{choice channel, weighted by informativeness}}. $$The choice channel's contribution is scaled by $r_k^t$, so a random-looking choice barely moves the factors, exactly the desired effect. In practice the choice log-likelihood is optimized by a few gradient steps, or replaced by the implicit-feedback least-squares surrogate (positive at $a_k^t$, sampled negatives from the public active set) with weight $r_k^t$, so it slots into the same weighted-ALS solver (Section 5.1).
Why this is the right tool, and what it predicts. (i) It DEGRADES GRACEFULLY: a teammate that is exploring contributes $\approx 0$ to the latent update (low $r$), so the choice channel never poisons the structure, the core fix you asked for. (ii) $\gamma_k$ is INTERPRETABLE, drone $i$'s estimate of how 'expert' teammate $k$ currently is, learned purely from public actions (ZK). (iii) It is SELF-CALIBRATING over time: early, everyone explores so all $\gamma_k$ are low and the swarm relies on the reward channel; late, $\gamma_k$ rise and the (noise-immune) choice channel takes over, the explore-to-exploit regime shift falls out of the model rather than being scheduled. (iv) It unifies our heuristics: ChoiceCF's competence ramp is a crude proxy for $\gamma_k$, and the predictive-variance shrinkage (EMCF) is the structure-confidence half; ChoiceEM is the principled joint version.
The positive that fell out: the choice channel is not useless, it is extreme-noise insurance. At $\sigma_{\mathrm{obs}}=2.0$ the noise-immune ChoiceCF (an action's argmax is unaffected by reward noise) overtakes RewardCF on BOTH metrics, unseen $0.093$ vs $0.042$ and anytime $0.219$ vs $0.179$ (non-overlapping CIs), whereas at $\sigma_{\mathrm{obs}}=0.6$ RewardCF still dominates ($0.295$ vs $0.093$). The choice methods are flat in $\sigma_{\mathrm{obs}}$ by construction (they never read the noisy reward), so they cross over RewardCF as rewards degrade. Practical reading: fuse the channels and lean on choices only when your own reward signal is drowning, do not pay for learned per-teammate gating, the schedule is enough. We keep this as a documented negative-with-a-niche, the kind a tutorial should show.
The null above is for a HOMOGENEOUS swarm, where every teammate explores then exploits on the same schedule, so there is no per-teammate difference for $\gamma_k$ to exploit and it can only match the fixed ramp. The interesting case is HETEROGENEOUS teammates: some discover the structure earlier (and their choices are worth trusting sooner), some are unreliable. We built that test (pilot_choicehetero.py): real learners alongside SPECIAL teammates that are either ORACLE choosers (always engage their true best, maximally informative) or RANDOM choosers (faulty / off-objective). It doubles as a SANITY check, a working informativeness estimator MUST drive $\gamma$(oracle) high and $\gamma$(random) low; if it cannot, it is broken.
A uniform-random teammate is the WORST case for the mixture model: its choice is equally likely to land anywhere, so the 'is this informative?' posterior averages to $\gamma$ itself, no force pushes it down. Worse, the naive E-step scores each choice against a factor $\hat p_k$ FIT to that same teammate's choices, which OVERFITS and makes even random choices look informative. Result (8 seeds): in-sample ChoiceEM gives random teammates $\gamma\approx 0.70$ (should be low), barely below oracle's $0.95$, it FAILS the sanity. The fix is a HELD-OUT (predictive) responsibility: score each choice ONCE against the model BEFORE the refit incorporates it. A random teammate's choices are then independent of the scorer, so its $\gamma$ stays at its low prior; a genuine informer's choices are predicted even held-out, so its $\gamma$ rises. Measured: predictive $\gamma$(oracle) $\approx 0.49 \gg \gamma$(random) $\approx 0.10$ (about $5\times$, non-overlapping CIs), it PASSES the sanity, and beats in-sample ChoiceEM on skill everywhere.
A natural worry: every structured method uses a GUESSED rank $\hat d=8$ while the true rank is $d=5$, and every drone starts from the SAME guess (it is a shared hyperparameter, not learned or shared state, each drone's factors stay private). What if the guess is wrong, and can the method learn the rank from data?
Write $\hat d$ for the guessed rank, $d$ for the truth. Two regimes:
So the safe rule is GUESS GENEROUSLY ($\hat d \ge d$); the regularizer absorbs the slack. Measured (Section 8.12, $\hat d$ sweep, true $d=5$): unseen skill is $0.222$ at $\hat d=2$ (under-guess, hurts) and $\approx 0.33$-$0.36$ for $\hat d \in \{5,8,12,20\}$ (flat, robust). We deliberately run $\hat d=8 > 5$ to show no oracle knowledge of the rank is used.
Can the method ADAPT the rank as data grows? Yes, several ways (scouted). The key signal is already in hand: the structure-confidence spectrum (Level 2, Section 5.5). The number of latent directions the data CONFIDENTLY supports is the number of large eigenvalues of the aggregate design $G_U=\sum_{k,j}\beta_{kj}p_kp_k^\top$; early on only a few are well-determined, and more become supportable as coverage grows, so the EFFECTIVE rank should start small and rise toward $d$.
To compare methods fairly we put every score on the same 0-to-1 ruler, called skill: the fraction of the gap between random guessing and a cheating oracle that a method closes. Below each metric is given both formally (a formula) and in plain words.
Fix drone $i$ and a uniform random size-$c$ offer $S\subseteq[n]$. A policy that selects $a_i\in S$ earns expected reward $\rho_i = \mathbb{E}_S[R_{i,a_i}]$. Define two references from the SAME offer distribution:
$$ \mu_i = \mathbb{E}_S\Big[\tfrac{1}{c}\textstyle\sum_{j\in S} R_{ij}\Big] \;(\text{random, mean-in-offer}), \qquad \rho^{\mathrm{or}}_i = \mathbb{E}_S\big[\max_{j\in S} R_{ij}\big] \;(\text{oracle, best-in-offer}). $$$$ \mathrm{skill}_i(\rho_i) = \frac{\rho_i - \mu_i}{\rho^{\mathrm{or}}_i - \mu_i} \in (-\infty, 1], \qquad 0=\text{random},\quad 1=\text{oracle}. $$Skill is an affine rescaling of expected reward (equivalently of per-round regret $\rho^{\mathrm{or}}_i-\rho_i$), so it keeps reward's meaning while being comparable across worlds of different difficulty. The oracle is the per-round best-in-offer (the centralized, complete-information ceiling in the non-contention setting); under contention it becomes the Hungarian matching optimum (Section 8.13).
So skill = 0 means 'no better than guessing' and skill = 1 means 'as good as the cheating oracle'. A method at 0.4 has closed 40% of that gap. A difference counts only if the error bars (confidence intervals, defined at the end of this section) do not overlap. We use five flavors of skill, each answering a different question:
What it is. How good the final learned policy is when offered a random handful of targets.
$\mathrm{skill}$ of the frozen policy $\hat\pi_i(S)=\arg\max_{j\in S}\hat R_{ij}$ over offers $S$ drawn from ALL targets $[n]$: $\rho_i = \mathbb{E}_S[R_{i,\hat\pi_i(S)}]$, plugged into the skill formula above.
Why it matters. It is the all-round 'did you learn the task' score.
What to look for. Higher is better; but it can be flattered by methods that explored a lot 'for free' (see anytime).
What it is. The same score, but ONLY on targets the drone never tried itself, the 'unseen' pairs.
Let $\mathcal{U}_i = \{j : \text{drone } i \text{ never engaged } j\}$ be its unseen set after training. Restrict offers to $\mathcal{U}_i$: $\rho_i = \mathbb{E}_{S\subseteq\mathcal{U}_i}[R_{i,\hat\pi_i(S)}]$ and the references $\mu_i,\rho^{\mathrm{or}}_i$ to $\mathcal{U}_i$. A structure-free learner has $\hat R_{ij}=\text{const}$ on $\mathcal{U}_i$, so $\rho_i=\mu_i$ and $\mathrm{skill}=0$ exactly (the tabular floor).
Why it matters. This is the make-or-break test of GENERALIZATION. A learner with no shared structure literally cannot do better than guessing here (it has no information), so it scores ~0 by construction. Any score clearly above 0 means the method is genuinely transferring knowledge to things it never experienced.
What to look for. We want our methods well above 0 and the structure-free baselines stuck at 0. The GAP between them is the headline result.
What it is. The reward actually EARNED while learning, summed over the rounds ('how many targets did you destroy by round K'), not just the quality of the policy at the end.
Sum the REALIZED reward as the episode runs and normalize the cumulative quantity. With per-round totals $\text{real}_t=\sum_i R_{i,a_i^t}$, $\text{orac}_t=\sum_i\max_{j\in S_i^t}R_{ij}$, $\text{rand}_t=\sum_i\frac1c\sum_{j\in S_i^t}R_{ij}$, the anytime trajectory is
$$ \mathrm{skill}^{\mathrm{any}}_t = \frac{\sum_{\tau\le t}\text{real}_\tau - \sum_{\tau\le t}\text{rand}_\tau}{\sum_{\tau\le t}\text{orac}_\tau - \sum_{\tau\le t}\text{rand}_\tau}, $$and the headline anytime skill is $\mathrm{skill}^{\mathrm{any}}_T$ (the whole-episode value). A probe/explore phase earns $\approx\text{rand}$, so it is charged in full here, unlike overall skill.
Why it matters. In the real world you care about results DURING the mission, not only after. This metric CHARGES a method for any time it spends just exploring or probing. It is the most honest, practical score.
What to look for. Methods that explore politely while still exploiting (ours) win; methods that pause to probe, or that never stop exploring, lose here even if their final policy looks fine.
What it is. How DIFFERENT the drones' learned internal models are from each other.
Compare drones' full predicted matrices $\hat R^{(i)}=P^{(i)}(U^{(i)})^\top$ (frame-invariant, unlike a single row). Mean-center each, normalize, and take the average pairwise cosine $\bar c$; report $1-\bar c$:
$$ \text{uniq} = 1 - \frac{2}{m(m-1)}\sum_{i$\text{uniq}\to 0$ when all drones converge to the same model; bounded away from $0$ when each keeps a permanent blind set (persistent masking).Why it matters. It checks that decentralization is REAL: if every drone ended up with the same model, 'no communication' would be meaningless. It is also the quantity that distinguishes the two masking models in the theory.
What to look for. It should rise as observation gets more limited (drones see different things), and stay high over time under persistent masking (durable) but fade under i.i.d. masking (transient).
What it is. How good the prediction for a brand-new target (or a brand-new drone) gets, as a function of how many trial engagements ('probes') it has received.
For a new target with factor $u_\star$ seen via $k$ shared probes $\{(p_{a},r_a)\}_{a=1}^k$ and a basis $U$, the fold-in solves a $d$-dim least squares: $\hat u_\star=(\sum_a p_a p_a^\top+\lambda I)^{-1}\sum_a r_a p_a$, exact once $k\ge d$ and the $p_a$ span $\mathbb{R}^d$ (row completion); tabular needs $k\approx m$ (one probe per drone). We plot newcomer skill vs $k$. Strict ZK note: for a new DRONE the basis is the targets' factors $U$, which the newcomer does NOT receive from any peer; it RECOVERS $\hat U$ itself by running its own factorization on the public broadcast it passively observed (under its own mask $\rho$). So the only thing handed to a newcomer is the same partial, noisy broadcast everyone hears.
Why it matters. It measures sample efficiency for cold-start: how cheaply can the swarm absorb something new?
What to look for. Our methods should reach high skill after only about d probes (the number of hidden traits), while a structure-free learner needs roughly one probe per drone or per target.
So the metrics are reproducible, here is the exact protocol used after the $T$-round episode.
A claim 'A beats B' is only made when their bootstrap CIs do not overlap. All per-seed values are saved to results/pilots/*.json so any number can be recomputed without re-running (Section 10).
The empirical results are not luck: a short chain of arguments PREDICTS them. We give each result as: statement, why it matters, proof (key steps), and confirmed by (which experiment). Full formal proofs: docs/THEORY_FORMAL.md.
True reward $R = P U^\top$ of rank $d$, unit factors so $R_{ij}=\langle p_i,u_j\rangle\in[-1,1]$. Drone $i$ is offered a uniform size-$c$ set each round; $\mu_i$ is its mean-in-offer (random reference), $\rho^{\mathrm{or}}_i=\mathbb{E}_S[\max_{j\in S}R_{ij}]$ the oracle, and $\mathrm{skill}=(\rho-\mu_i)/(\rho^{\mathrm{or}}_i-\mu_i)$. A learner is structure-free (Definition) if its estimate $\hat R_{ij}$ depends only on drone $i$'s own past pulls of target $j$, and equals a fixed prior constant on any $j$ it never pulled (the per-arm tabular class: UCB-Indep, Tabular).
Statement. For a structure-free learner and any target j drone i never pulled, the estimate is the prior constant, so its squared error is at least the row variance $\mathrm{Var}_j(R_{ij}) = \Omega(1)$, and its expected unseen-pair skill is exactly 0.
Why it matters. This is the floor the categorical win is measured against: a structure-free swarm is helpless on anything it has not personally tried, no matter how long it or its teammates run.
Proof (key steps). (1) By definition the estimate on an unpulled j is a constant b chosen before seeing j; for any constant, E[(b - R[i,J])^2] = (b-mu_i)^2 + Var_J >= Var_J. (2) The broadcast cannot help: a per-arm estimate of R[i,j] is, by definition, not a function of any event (k,·) with k != i, nor of any target j' != j. (3) On an offer of never-pulled targets the selection is independent of their rewards, so by exchangeability the expected earned reward equals the unseen-set mean, giving skill 0.
Confirmed by. Tabular/UCBIndep unseen ~0 at every density (F2), every rank (F4), every world (Section 8.7). Corollary (pooling): a learner that pools all drones (UCBHomo) estimates the column mean $\langle \bar p, u_j\rangle$ = the rank-1 'popularity' term, so it gets PARTIAL unseen skill (~0.17), not zero and not full, exactly as measured.
Statement. If the target factors U are known (rank d) and drone i observes its true rewards on any set Omega with |Omega| >= d whose factors {u_j} span R^d, then p_i is the unique least-squares solution and $R_{ij} = \langle p_i, u_j\rangle$ is recovered EXACTLY for ALL j, including never-pulled ones.
Why it matters. This is the engine of generalization and of onboarding/cold-start: a few observations plus the shared structure determine the entire row. Per-drone sample complexity drops from $\Theta(n)$ (tabular) to $\Theta(d)$, a categorical, not incremental, change.
Proof (key steps). (1) Stacking $j\in\Omega$ gives the linear system $R_{i,\Omega} = U_\Omega\, p_i$. (2) Spanning means $U_\Omega$ has rank $d$, so $U_\Omega^\top U_\Omega$ is invertible and the solution $p_i = (U_\Omega^\top U_\Omega)^{-1} U_\Omega^\top R_{i,\Omega}$ is unique and equals the true $p_i$. (3) Then $\langle p_i, u_j\rangle$ reproduces $R_{ij}$ for every $j$. (With noise, the ridge estimate has error O(sigma^2 d / lambda_min), vanishing as data grows.) U itself is identified from the pooled broadcast by standard matrix completion once about d(m+n) entries are seen.
Confirmed by. Target onboarding at ~d shared probes (F3) and newcomer cold-start at ~d own probes (F10), both vs tabular's ~n; and unseen skill rising as the rank falls (F4).
Statement. With $n$ targets, offers of size $c$, horizon $T$, ANY structure-free learner has anytime (cumulative-reward) skill at most $g(cT/n)$, which $\to 0$ when $cT = o(n)$, EVEN with full broadcast (here $g(x)=\mathbb{E}[\max \text{ of } \lceil x\rceil \text{ i.i.d. draws}]-\mu$ is the expected-order-statistic gap). A CF learner reaches near-oracle after about $d$ rounds, so its anytime skill is $1 - O(d/T)$.
Why it matters. 'Final-policy quality' can flatter a method that explores a lot for free. The operationally honest metric is reward EARNED while learning (targets destroyed by round K). This theorem says structure-free methods cannot earn under starvation.
Proof (key steps). (1) A structure-free learner only earns above the mean on a round whose offer contains an already-pulled target; its pulled set has size <= t-1. (2) Because new targets are chosen blind to their reward, the pulled set is reward-blind, so the offered-and-pulled rewards are like a few i.i.d. draws; by concavity of expected order statistics (Jensen) the per-round surplus is <= g(c(t-1)/n). (3) Summing over t and using g increasing gives <= g(cT/n) -> 0 for cT = o(n). The broadcast does not change the pulled set (the tabular floor). (UCB is even lower: its infinite untried-arm bonus makes it explore forever while n > t.)
Confirmed by. UCBIndep stuck at ~0 anytime even at T=200 with n=240 (F9); online CF earns from round one and dominates at every horizon (F6).
Statement. Under i.i.d. per-round loss, every drone eventually senses every teammate (Borel-Cantelli), so all drones converge to the same model (heterogeneity is TRANSIENT). Under PERSISTENT masking each drone has a permanent blind set, so models stay distinct (DURABLE). The unseen and anytime results are INVARIANT to the choice.
Why it matters. It answers 'is persistent masking a cheat?': no. i.i.d. (packet-loss style) loss gives the same headline results; persistent masking is chosen only so that 'no communication implies durably different knowledge' is a structural property, not a vanishing transient.
Proof (key steps). (1) i.i.d.: each teammate is observed with prob rho > 0 each round; the events are independent and sum diverges, so each is seen infinitely often a.s.; hence every drone recovers the full true model, a common limit. (2) Persistent: a blind teammate's rows appear in NONE of drone i's observations, so they are non-identifiable for i and stay at the prior for all T; distinct blind sets give distinct models, bounded away in distance uniformly in T. (3) The categorical results use only own pulls (the tabular floor) and U-identification (row completion), which both hold under either mask.
Confirmed by. i.i.d. vs persistent give the same unseen/anytime curves, but state-uniqueness is flat in T under persistent and DECAYS under i.i.d. (F8), exactly as predicted.
Statement. A predictor of the form 'global + drone-bias + target-bias' (the BiasModel baseline) has a prediction matrix of rank at most 2, and for ranking targets it reduces to the shared 'popularity' order, so its unseen skill is the popularity skill, strictly below CF whenever there is more than one hidden trait.
Why it matters. It pinpoints exactly what is missing without the low-rank INTERACTION: additive effects (some targets are generally easier, some drones generally better) are not enough; the personalized 'which drone suits which target' part is what CF adds, and what the categorical win depends on.
Proof (key steps). The matrix a+b_i+c_j = (a·1+b) 1^T + 1 c^T is a sum of two rank-1 matrices, so rank ≤ 2. For a FIXED drone the term a+b_i is constant across targets, so the ranking is by c_j alone, the same popularity order for everyone, which ignores personalization and so underperforms CF for d>1.
Confirmed by. BiasModel unseen ~0.12 and UCBHomo ~0.17 (both additive/popularity) vs CF ~0.49 at d=5 (Section 8.9).
A full theory-vs-experiment alignment table (every prediction mapped to its measured value, with honest tensions: the anytime constant is loose though its mechanism is exact; the row-completion result idealizes 'U known' vs finite-data recovery) is in docs/THEORY_FORMAL.md.
The later confidence, contention, and rank results each have a matching statement (full versions + justifications in THEORY_FORMAL.md):
Confidence. For an UNSEEN pair, $\hat u_j$ comes only from the broadcast, so inverse-own-variance ('precision') fit weighting, which over-weights a drone's own rewards that carry no information about $\hat u_j$, inflates $\mathrm{Var}(\hat u_j)$ and is SUBOPTIMAL vs uniform; the Bayesian posterior (EMCF) is the consistent full-information estimator with a valid predictive interval. Caveat: the FULL predictive-variance UCB over-explores (the own-factor term is uniformly large early), so info-directed exploration should use the COLLECTIVE term $p_i^\top\Sigma_{u_j}p_i$. (Confirms 8.12, the precision sweep, and the EM win.)
Contention symmetry-breaking. Under a shared pool with no communication, deterministic argmax makes each same-type group of size $g$ lose $g-1$ engagements (the matching floor); a FIXED private offset $\hat R_{ij}+\varepsilon h_i[j]$ makes same-type picks a.s. distinct (collisions $\to 0$) while preserving any target with margin $>2\varepsilon\lVert h\rVert_\infty$, and only a fixed-AND-private offset is stable (per-round-random re-collides, shared-signal re-synchronizes). (Confirms the 8.13 contention win and the 4-design meta-finding.)
ARD rank. ARD retains a latent column iff the masked design excites it above the prior floor, so the recovered effective rank = the data-IDENTIFIABLE rank $\le d$ (strictly $
Precise scope. The boxed '3-condition iff' was loose. Correctly: the CATEGORICAL unseen gap vs a tabular learner needs only a shared channel ($\rho>0$) and some structure (it holds even at $d=1$, even when sample-rich); $d>1$ is what additionally beats a POPULARITY baseline; and sample-starvation is what makes the unseen edge OPERATIONALLY dominant. Three conditions, three distinct roles, not one AND against one baseline.
Precision (corrected). Our earlier claim ('uniform beats precision') over-reached. Under HETEROGENEOUS teammate noise, bounded ratio-capped precision WINS (it down-weights the noisy sources yet preserves coverage), while UNBOUNDED $1/\sigma^2$ over-concentrates on the clean ones and starves coverage. So noise-aware weighting helps iff sources differ in reliability, and only in a bounded form, matching the heterogeneous-noise sanity (8.12).
Why directed exploration helps. Given the recovered target factors, EMCF's predictive-variance UCB is exactly LinUCB on the latent features, so per-drone regret is $\tilde O(d\sqrt T)$, the formal reason confidence-directed probing beats a fixed schedule (and is what wins under churn). The JOINT case (learning the factors WHILE exploring) is open.
The keystone, now bounded (was the central open problem). Our headline 'a few samples complete the whole row' leans on a step, recovering the shared structure $U$ from the MASKED broadcast, that we previously only CITED from centralized completion theory. We now pin it to an explicit per-drone condition: drone $i$ recovers a target's factor $u_j$, and so predicts the unseen pair, EXACTLY when its own factor $p_i$ is spanned by the factors of the visible teammates that engaged $j$ (an anchor block to fix the frame, plus enough spanning engagers per target); it is provably stuck at the prior floor below that, the per-target version of the tabular floor; under noise the error is bounded by the fold-in bound with the coverage degree setting the noise term; and under ordinary exploration the condition becomes true on its own after about $nd/(\rho m)\cdot\log n$ rounds, sooner as the swarm grows. An oracle-reconstruction check on the swarm's real coverage confirms the threshold exactly: pairs meeting the span condition reconstruct to error 0.000, the rest sit at the floor. The single remaining open piece is the finite-time coverage rate when drones exploit (an adaptive policy can starve low-reward targets of coverage). A much narrower gap than 'is decentralized masked recovery even possible'. (Full statements, proofs, and the validation: docs/THEORY_FORMAL.md, docs/P15_VALIDATION.md.)
Across a structure-by-observability grid, CF beats tabular if and only if three conditions hold together: the reward is low-rank but PERSONALIZED (1 < d << min(m,n)), the regime is SAMPLE-STARVED, and the reward channel is SHARED. This scopes the claims and explains earlier null results in sample-rich regimes.
Statically, CF reaches unseen-pair skill 0.496 vs tabular 0.006. Under the natural masking regime it holds at every density (Figure F2): CF stays well above zero on unseen pairs for all rho > 0 while tabular sits at the floor; meanwhile drones' models diverge (state-uniqueness rises 0.54 to 0.92), so decentralization is real.
Takeaway. This is the core result in one picture: the blue CF line stays well above zero everywhere (it acts intelligently on things it never tried), while the tabular line hugs zero (it is helpless on the unseen, by mathematical necessity). The size of that gap, not a few percent, is what we mean by a CATEGORICAL win.
A brand-new TARGET is onboarded for the whole swarm from about d shared probes via fold-in (Figure F3); tabular needs about m probes. Symmetrically, a NEW DRONE with zero history acts from the broadcast alone, folding in its own factor from a few probes (Figure F10), starting at population-average competence while a tabular newcomer is at the floor. Both are $\Theta(d)$ vs $\Theta(n)$ separations.
Takeaway. New tasks and new teammates are absorbed cheaply: a few trial runs (about d, the number of hidden traits) are enough for the whole swarm to predict a new target, and a brand-new drone is useful from its very first moment by riding on what the swarm already learned. A structure-free swarm would need to re-try everything from scratch.
As the true rank rises there is more to complete from the same data, so CF unseen skill decreases (0.96 at d=1 to 0.10 at d=20) while tabular stays at the floor for every rank (Figure F4).
Every low-rank method clears the no-structure floor on unseen pairs, so that categorical win is a property of low-rank STRUCTURE, not of our particular estimator. Our method's specific edge is twofold. (a) Masking-robustness (Figure F5): our online weighted-ALS unseen skill is flat as the broadcast is masked, while batch-SVD hybrids (PTF/ESTR/BPMF) decay. (b) Anytime (Figure F6): on cumulative reward we dominate at every horizon; UCBIndep is stuck near random because, with $n \gg T$, it never stops exploring untried arms, and PTF/ESTR pay a probe phase.
Takeaway. If you count reward as it is actually earned (the honest, operational view), our online methods (top curves) pull ahead from the very first rounds and stay ahead. The explore-then-commit competitors are flat near zero until they stop probing, and the per-target bandit never recovers because, with far more targets than rounds, every menu still contains something it has not tried. This is where 'acting on the unseen' turns into real, early reward.
Crossing masking with reward noise (Figure F7): the reward channel (RewardCF) degrades as $\sigma_{\mathrm{obs}}$ rises while the clean choice channel (ChoiceZK) is flat. The crossover is at $\sigma_{\mathrm{obs}}\approx 1$ (noise = half the signal range): for realistic noise the reward channel dominates; the choice channel is severe-noise insurance. No learned fusion is needed.
Re-running everything under i.i.d. per-round loss (Figure F8) leaves unseen and anytime essentially unchanged (the categorical results do not depend on the masking model), while state-uniqueness is durable under persistent masking but transient under i.i.d., exactly as the theory predicts. Scaling sweeps over rank, horizon, target count, and guessed rank (Figure F9) confirm every trend, including robustness to misguessing the rank. Generality sweeps confirm the conclusions are not artifacts of the default world: they hold across drone count m (10 to 120), cluster count K (2 to 30, INCLUDING K=m, i.e. every drone its own type = no clustering = uniform latents), and latent spread (tight to diffuse). So the categorical advantage is not an artifact of clustered latent sampling.
Takeaway. It does not matter whether the missing observations are a fixed blind spot (persistent) or random dropouts (i.i.d., like lost radio packets): the headline results are the same (left/middle panels overlap). The only thing that changes is whether the drones STAY different over time (right panel): they do under a fixed blind spot, but slowly converge under random dropouts. So our modeling choice is principled, not a cheat.
The strongest competitor (PTF) once led on one metric (final-policy unseen at full broadcast); that was an artifact of our under-converged default estimator. With a converged configuration, HybridCFconv ties PTF there and beats it everywhere else, and active exploration (ActiveCFconv) improves on plain online CF on both metrics. The Pareto frontier (Figure F11) is entirely ours: ActiveCFconv (best balanced) and HybridCFconv (best unseen under masking) sit up-and-right; PTF and the rest are dominated.
Takeaway. 'Up and to the right' is best (good on both axes). Our methods (green/blue) sit in that corner; the strongest competitor (PTF, red) is pushed to the edge, buying a sliver of final-policy quality by giving up almost all earned reward, and it falls behind entirely once observation is limited. There is no metric left on which a competitor beats us in the regime that matters.
Summary numbers (skill ~0 = random; ours in green):
| Method | Class | UNSEEN @rho=1 | ANYTIME @rho=1 |
|---|---|---|---|
| Random | no-struct | 0.01 | 0.00 |
| UCBIndep | no-struct | 0.00 | -0.01 |
| Tabular | no-struct | 0.00 | 0.25 |
| PTF | low-rank | 0.51 | 0.27 |
| ESTR | low-rank | 0.23 | 0.18 |
| RewardCF (ours) | low-rank | 0.39 | 0.40 |
| HybridCFconv (ours) | low-rank | 0.49 | 0.35 |
| ActiveCFconv (ours) | low-rank | 0.49 | 0.44 |
For the camera-ready, the headline panels are confirmed at 20 seeds with bootstrap 95%% confidence intervals (a CI is a range we are 95%% confident the true mean lies in; non-overlapping intervals mean a real difference, not noise). Skill: 0 = random, 1 = oracle; ours in bold.
skill = (method - random)/(oracle - random); 0 = random, 1 = oracle. rho = fraction of broadcast observed. Ours in bold.
| method | rho=1.0 | rho=0.25 |
|---|---|---|
| UCBIndep | 0.000 [-0.004, 0.004] | -0.001 [-0.005, 0.004] |
| PTF | 0.490 [0.458, 0.517] | 0.272 [0.251, 0.293] |
| RewardCF | 0.377 [0.355, 0.398] | 0.326 [0.303, 0.349] |
| HybridCFconv | 0.488 [0.468, 0.509] | 0.379 [0.354, 0.401] |
| ActiveCFconv | 0.476 [0.461, 0.493] | 0.339 [0.317, 0.358] |
| method | rho=1.0 | rho=0.25 |
|---|---|---|
| UCBIndep | 0.002 [-0.006, 0.010] | -0.004 [-0.010, 0.002] |
| PTF | 0.273 [0.257, 0.290] | 0.226 [0.215, 0.236] |
| RewardCF | 0.383 [0.366, 0.399] | 0.342 [0.332, 0.352] |
| HybridCFconv | 0.340 [0.327, 0.354] | 0.300 [0.288, 0.311] |
| ActiveCFconv | 0.436 [0.419, 0.451] | 0.341 [0.323, 0.357] |
To be sure we are not beating a weak field, we added three more methods, each run under the exact same limits (per-drone, broadcast-only, guessed rank where relevant): SoftImpute (a different, convex way to fill in the table), kNN-CF (a model-free 'find similar drones' approach), and BiasModel (additive effects only, no personalization). Figure F12 shows the result.
Takeaway. The story holds and widens. SoftImpute is excellent when the broadcast is full (it even tops everyone on unseen at rho=1), but, like all the batch 'fill-in' methods, it collapses as observation is masked and it loses on earned reward. kNN-CF generalizes but earns little. BiasModel stays near the additive ceiling (it cannot personalize, exactly as the additive-ceiling result says). Under masking (the regime that matters) and on the anytime metric, our methods beat the best of these new baselines with non-overlapping error bars. No competitor wins in the regime that defines the problem.
tabula_drone; not to be confused with our Tabular baseline). We previously called it 'a realistic simulator', which over-claims, so we are precise: it is still a SIMULATOR (a PettingZoo environment), not real hardware or field data, and it is itself a LATENT-compatibility world, so it does not independently prove the world is low-rank. What it genuinely provides is out-of-harness transfer: a SEPARATELY-built codebase, which we did not design for this experiment, with DIFFERENT dynamics our generative model lacks, spatial geometry (drones/targets with positions and distances), target HEALTH that depletes as it is engaged (a soft CONTENTION / matching pressure), episodic resets, and its own reward/observation noise. Its oracle is a Hungarian OPTIMAL-ASSIGNMENT matcher. So this is a test of TRANSFER and partial-contention robustness, not of realism.We drop our estimator in as a drop-in policy (WeightedALS = RewardCF in the env's policy interface) and benchmark it (3 seeds, learning over 16 episodes) against the environment's OWN policies: its SGD matrix-factorization, a per-arm UCB-Indep, random, and the optimal-assignment oracle. Score = efficiency (mean reward per step over the converged second half of episodes), normalized to skill = (policy − random)/(oracle − random) (Figure F13).
Takeaway. The advantage TRANSFERS out of our own harness: our method reaches skill ~0.81 (low variance), clearly beating the environment's own SGD matrix-factorization (~0.25) and the per-arm UCB-Indep (~0.72), and approaching the matching oracle. So the result is not an artifact of our specific generative code: it survives a different implementation with spatial, depleting, episodic dynamics and a matching (contention) objective. What it does NOT show: real-world realism, or an independent confirmation of low-rankness (the sim is also latent-based); for the latter see the assumption-stress test next.
A fair worry: what if the world is not exactly low-rank? We deliberately break the assumption two ways and watch what happens (Figure F14). First, a nonlinear reward link (a power curve) that raises the effective rank; here our methods actually do a touch BETTER, because the curve sharpens the differences between targets. Second, and more importantly, we add random noise to every entry of the reward table, which drives the effective rank from 5 up toward 29 (nearly full-rank, i.e. almost no structure left).
Takeaway. The advantage degrades GRACEFULLY, not off a cliff: as the structure is progressively destroyed (effective rank 5 to 29), skill slides down smoothly and our methods stay the best the whole way, only meeting the floor when there is essentially no low-rank structure left to exploit. So the result does not hinge on the world being perfectly low-rank, only on there being SOME usable structure, which real problems have.
Our recommended method, ActiveCFconv, has four design choices stacked together. To avoid a confusing 'zoo' of methods, we turn each one OFF in turn and measure the cost, all under one identical setup (12 seeds, 95%% CIs). This shows what each piece actually buys.
Recommended method = ActiveCFconv (online weighted-ALS + latent-UCB active exploration; best balanced: top anytime, strong unseen). Each row below toggles ONE design choice; skill: 0 = random, 1 = oracle.
| variant (knob) | unseen rho=1.0 | unseen rho=0.25 | anytime rho=1.0 | anytime rho=0.25 |
|---|---|---|---|---|
| ActiveCFconv online wALS + precision + active-UCB (RECOMMENDED) | 0.485 [0.466, 0.506] | 0.360 [0.345, 0.376] | 0.440 [0.424, 0.455] | 0.348 [0.330, 0.367] |
| HybridCFconv - active; + probe->SVD warm-start (eps-greedy) | 0.501 [0.478, 0.526] | 0.396 [0.371, 0.421] | 0.353 [0.337, 0.367] | 0.308 [0.295, 0.321] |
| RewardCFconv - active, - warm-start (plain online wALS, eps-greedy) | 0.450 [0.427, 0.473] | 0.372 [0.347, 0.395] | 0.390 [0.373, 0.405] | 0.332 [0.321, 0.342] |
| RewardCFconv_noprec - precision weighting (uniform obs weights) | 0.584 [0.574, 0.594] | 0.392 [0.371, 0.413] | 0.438 [0.426, 0.452] | 0.339 [0.328, 0.349] |
| PTF batch SVD + SGD finetune (explore-then-commit) | 0.505 [0.482, 0.528] | 0.276 [0.248, 0.305] | 0.277 [0.260, 0.295] | 0.231 [0.217, 0.245] |
| ESTR explore-then-spectral commit | 0.224 [0.197, 0.251] | 0.047 [0.034, 0.059] | 0.218 [0.196, 0.241] | 0.185 [0.173, 0.197] |
| guessed rank d_hat | unseen skill | anytime skill |
|---|---|---|
| 2 | 0.222 [0.198, 0.245] | 0.270 [0.247, 0.293] |
| 5 | 0.327 [0.296, 0.356] | 0.351 [0.331, 0.369] |
| 8 | 0.360 [0.345, 0.376] | 0.348 [0.330, 0.368] |
| 12 | 0.359 [0.328, 0.390] | 0.347 [0.329, 0.365] |
| 20 | 0.344 [0.313, 0.369] | 0.340 [0.324, 0.354] |
Takeaways: (i) online weighted-ALS dominates batch / explore-then-commit (PTF, ESTR) under masking and on anytime (anytime rho=1.0: RewardCFconv 0.39 vs PTF 0.28; unseen rho=0.25: 0.37 vs 0.28). (ii) HONEST FINDING: precision weighting 1/sigma^2 does NOT help at the default broadcast noise (sigma_obs=0.3) -- UNIFORM weighting (noprec) is better on both unseen (0.584 vs 0.450 at rho=1.0) and anytime, because it uses the abundant broadcast fully (the broadcast carries the cross-target structure that unseen generalization needs). Precision weighting only pays off once the broadcast is high-noise; see PRECISION_SWEEP.md for the crossover. (iii) the SVD warm-start (HybridCFconv) lifts unseen, most at dense rho, at an anytime cost. (iv) active (latent-UCB) exploration improves anytime and dense-rho unseen over eps-greedy. (v) robust across guessed rank for d_hat >= true d (5..20 all ~0.35 unseen at rho=0.25), degrading only when d_hat=2 < true d=5.
Takeaway. (i) Learning ONLINE (updating every round) beats batch 'probe-then-commit' methods (PTF, ESTR) under masking and on earned reward. (ii) An HONEST surprise: weighting observations by PRECISION (1/sigma^2) does NOT help at the default noise; turning it OFF (uniform weights) actually gives the BEST unseen skill in the table (0.584), because uniform weighting uses the plentiful broadcast more fully, and the broadcast is exactly what lets a drone judge targets it never tried. Precision weighting only earns its keep once the broadcast is genuinely noisy (we chart the crossover separately). (iii) the SVD warm-start mostly helps when the broadcast is full, at a small cost to early earned reward; (iv) ACTIVE exploration (chase the most uncertain target) beats plain random exploration on earned reward and dense-broadcast unseen; and (v) the method is robust to a wrong guessed rank as long as the guess is at least the true rank (5 to 20 all behave the same; only guessing 2, below the true 5, hurts), so you do not need to know the rank exactly.
Because finding (ii) was a surprise, we chased it down: here is precision weighting ON vs OFF as we crank up the broadcast noise (full broadcast, so only the noise changes). It shows exactly when each choice wins.
RewardCF (converged online weighted-ALS) with precision weighting 1/sigma^2 ON vs OFF (uniform obs weights), sweeping the broadcast-observation noise sigma_obs at rho=1.0 (full broadcast; own noise fixed at sigma_own=0.10). Skill: 0=random, 1=oracle.
| sigma_obs | precision ON | precision OFF | winner |
|---|---|---|---|
| 0.10 | 0.609 [0.591, 0.625] | 0.674 [0.657, 0.690] | OFF |
| 0.30 | 0.450 [0.427, 0.473] | 0.584 [0.574, 0.595] | OFF |
| 0.60 | 0.281 [0.254, 0.305] | 0.396 [0.366, 0.423] | OFF |
| 1.00 | 0.163 [0.142, 0.183] | 0.196 [0.169, 0.219] | OFF |
| 2.00 | 0.047 [0.028, 0.065] | 0.050 [0.030, 0.070] | OFF |
| sigma_obs | precision ON | precision OFF | winner |
|---|---|---|---|
| 0.10 | 0.457 [0.448, 0.466] | 0.495 [0.484, 0.505] | OFF |
| 0.30 | 0.390 [0.373, 0.405] | 0.438 [0.425, 0.452] | OFF |
| 0.60 | 0.296 [0.280, 0.311] | 0.336 [0.316, 0.355] | OFF |
| 1.00 | 0.216 [0.194, 0.234] | 0.215 [0.195, 0.235] | ON |
| 2.00 | 0.144 [0.123, 0.164] | 0.120 [0.108, 0.135] | ON |
Takeaway: at LOW/default broadcast noise, UNIFORM weighting is better (it uses the abundant broadcast fully, which powers unseen generalization); precision weighting 1/sigma^2 only pays off once the broadcast is noisy enough that down-weighting it filters more error than coverage it loses. So the recommended default is uniform weights, switching to precision weighting when the broadcast is known to be high-noise.
Takeaway. The two choices cross over: when the broadcast is clean, uniform weighting wins (it trusts and uses all of it); when the broadcast is noisy, precision weighting wins (it correctly stops trusting the garbage). So 'weight by precision' is not free wisdom, it is the right call only in the high-noise regime; the honest default is uniform, switching to precision when you know the shared signal is noisy.
To pin down WHEN precision is the right call, we ran an obvious-expected sanity test: hold the MEAN broadcast noise fixed and vary only its HETEROGENEITY across teammates. homog = every teammate equally noisy ($\sigma_{\mathrm{obs}}=1.0$); hetero = half the teammates clean ($\sigma=0.1$), half garbage ($\sigma=1.9$), same mean. If precision weighting works, it MUST help in hetero (it can down-weight the garbage half) and need not help in homog.
Matched MEAN broadcast noise; only the HETEROGENEITY varies. homog = every teammate sigma_obs=1.0; hetero = half sigma_obs=0.1 (clean), half=1.9 (noisy). rho=1.0 (full broadcast). Does precision (1/sigma^2) fit weighting beat uniform when sources DIFFER in noise? 8 seeds, bootstrap 95%% CI.
| method | homog | hetero |
|---|---|---|
| uniform | 0.235 [0.213, 0.258] | 0.390 [0.367, 0.410] |
| full | 0.187 [0.165, 0.213] | 0.292 [0.266, 0.316] |
| relcap | 0.172 [0.150, 0.197] | 0.483 [0.454, 0.511] |
| (full - uniform) | -0.048 | -0.098 |
| (relcap - uniform) | -0.063 | +0.093 |
| method | homog | hetero |
|---|---|---|
| uniform | 0.219 [0.200, 0.236] | 0.267 [0.247, 0.285] |
| full | 0.231 [0.219, 0.244] | 0.266 [0.252, 0.281] |
| relcap | 0.223 [0.209, 0.236] | 0.356 [0.337, 0.375] |
| (full - uniform) | +0.012 | -0.001 |
| (relcap - uniform) | +0.004 | +0.089 |
Sanity: precision's edge over uniform should be ~0 or negative in HOMOG (the known result: coverage, not noise, binds; 'full' over-trusts own clean data) and clearly POSITIVE in HETERO (it correctly down-weights the noisy half). A positive HETERO delta confirms the precision machinery works and SCOPES the negative: precision is the right call exactly when observation sources differ in reliability.
Takeaway (a sharper statement of the negative). The sanity passes, with an instructive twist (8 seeds, CIs). (i) Under HOMOGENEOUS noise, precision does NOT help (uniform best, the known result: coverage, not noise, binds). (ii) Under HETEROGENEOUS noise, BOUNDED precision (relcap, ratio-capped) WINS clearly, unseen $0.483$ vs uniform $0.390$ and anytime $0.356$ vs $0.267$ (both non-overlapping CIs). (iii) But UNBOUNDED precision (full, $1/\sigma^2$) LOSES even in hetero ($-0.10$ unseen): correctly identifying the clean sources is not enough, raw $1/\sigma^2$ weights pile almost all mass on the few clean rows and the drone's own data ($\sigma=0.1\Rightarrow$ weight $100$) and STARVE coverage. So the precise scope is: noise-aware weighting is the right call exactly when observation sources DIFFER in reliability, and only in a ratio-bounded form that stays coverage-friendly. This is the same lesson as the confidence bake-off (below): the win comes from confidence that does not let any source dominate the broadcast.
Rather than settle for a trade-off, we asked: is there a confidence rule that matches uniform when the broadcast is clean AND beats it when it is noisy? The diagnosis above is the clue: naive precision weighting failed not because confidence is useless, but because its raw weights (11 to 100) dwarfed the model's regularizer, so it quietly over-fit and under-used the broadcast. We bake off the fixes against plain uniform and plain precision, across clean and noisy broadcasts: bounded/normalized precision (relcap) and, crucially, the BAYESIAN factorization with confidence intervals (EM = variational EMCF: a predictive-interval UCB; EMshrink additionally shrinks uncertain predictions toward popularity).
All variants are the same online weighted-ALS (ConfCF) or variational EM (EMCF); they differ only in HOW confidence enters. Baseline = uniform. Skill 0=random, 1=oracle, 6 seeds, bootstrap 95% CI. sigma_obs is the broadcast noise (own fixed at 0.10).
| mechanism | rho=1.00, s=0.3 | rho=1.00, s=1.0 | rho=0.25, s=0.3 | rho=0.25, s=1.0 |
|---|---|---|---|---|
| uniform | 0.594 [0.556, 0.628] | 0.196 [0.178, 0.212] | 0.376 [0.346, 0.411] | 0.139 [0.130, 0.149] |
| full | 0.449 [0.400, 0.484] | 0.165 [0.132, 0.202] | 0.370 [0.336, 0.400] | 0.125 [0.102, 0.150] |
| relcap4 | 0.581 [0.566, 0.596] | 0.197 [0.168, 0.226] | 0.378 [0.351, 0.415] | 0.145 [0.122, 0.169] |
| EM | 0.627 [0.599, 0.652] | 0.326 [0.276, 0.386] | 0.391 [0.355, 0.425] | 0.173 [0.146, 0.201] |
| EMshrink | 0.689 [0.670, 0.708] | 0.317 [0.281, 0.353] | 0.438 [0.395, 0.477] | 0.207 [0.173, 0.239] |
| mechanism | rho=1.00, s=0.3 | rho=1.00, s=1.0 | rho=0.25, s=0.3 | rho=0.25, s=1.0 |
|---|---|---|---|---|
| uniform | 0.427 [0.406, 0.442] | 0.218 [0.190, 0.241] | 0.337 [0.319, 0.356] | 0.242 [0.225, 0.260] |
| full | 0.373 [0.352, 0.387] | 0.230 [0.197, 0.258] | 0.322 [0.303, 0.341] | 0.256 [0.239, 0.275] |
| relcap4 | 0.435 [0.422, 0.448] | 0.218 [0.197, 0.235] | 0.334 [0.318, 0.349] | 0.242 [0.223, 0.262] |
| EM | 0.422 [0.404, 0.438] | 0.245 [0.225, 0.265] | 0.307 [0.286, 0.330] | 0.195 [0.177, 0.215] |
| EMshrink | 0.350 [0.316, 0.384] | 0.201 [0.175, 0.227] | 0.245 [0.221, 0.265] | 0.162 [0.150, 0.176] |
| mechanism | rho=1.00, s=0.3 | rho=1.00, s=1.0 | rho=0.25, s=0.3 | rho=0.25, s=1.0 | verdict |
|---|---|---|---|---|---|
| full | -0.099 | -0.010 | -0.010 | -0.000 | <= uniform |
| relcap4 | -0.002 | +0.001 | -0.001 | +0.003 | <= uniform |
| EM | +0.014 | +0.079 | -0.008 | -0.007 | DOMINATES uniform |
| EMshrink | +0.009 | +0.052 | -0.015 | -0.006 | mixed |
WINNER: EM is >= uniform in every condition and better in at least one (esp. high noise): a confidence mechanism that keeps the broadcast at full weight in the fit (so unseen generalization is preserved) while still being noise/coverage-aware. This is the right way to account for confidence.
Takeaway. Confidence, done right, WINS, vindicating the instinct not to give up on it. EM (put the uncertainty in the MODEL, via a Bayesian posterior, and use its predictive interval to explore) dominates plain uniform: as good when the broadcast is clean and clearly better when it is noisy (unseen 0.33 vs 0.20 at the noisy full-broadcast point), because a confident posterior knows which predictions to trust. EMshrink pushes the FINAL-policy unseen quality highest of all (0.69) by also shrinking uncertain predictions toward popularity, at some cost to early earned reward, a clean Pareto choice (use it when you care about the converged policy, EM when you care about anytime reward). The shallow rules (bounded precision relcap) merely match uniform; the gain comes from SECOND-ORDER confidence (the posterior), not from reweighting single observations.
Our main results assume two drones CAN engage the same target. What if they cannot, so engaging a target uses it up (a 'matching' problem)? We test this directly: all drones see the same small pool of targets, each target goes to ONE random drone that wanted it, and the losers earn nothing that round. We shrink the pool to crank up the competition, and we normalize earned reward by the best possible assignment (the Hungarian matching optimum).
Shared offer pool of size `pool` (smaller = more contention; m=30 drones, n=240). Each target is awarded to one random contender; losers earn 0. anytime = earned reward normalized by the per-round MATCHING optimum (Hungarian), 8 seeds, bootstrap 95% CI. unseen = preference quality evaluated contention-free AFTER training under contention (the categorical claim). rho=1.00 (full broadcast, to isolate contention from masking).
| method | pool=240 | pool=60 | pool=30 | pool=15 |
|---|---|---|---|---|
| ContentionAdaCF | 0.448 [0.422, 0.472] | 0.205 [0.186, 0.223] | 0.153 [0.132, 0.173] | 0.100 [0.086, 0.113] |
| ContentionCF | 0.314 [0.290, 0.337] | 0.178 [0.160, 0.200] | 0.134 [0.115, 0.155] | 0.105 [0.096, 0.114] |
| CBBAlite | 0.464 [0.434, 0.487] | 0.216 [0.204, 0.226] | 0.127 [0.112, 0.142] | 0.064 [0.047, 0.081] |
| MusicalChairs | 0.310 [0.279, 0.344] | 0.124 [0.108, 0.141] | 0.077 [0.053, 0.103] | 0.028 [0.019, 0.040] |
| ActiveCFconv | 0.439 [0.412, 0.464] | 0.230 [0.206, 0.252] | 0.125 [0.109, 0.145] | 0.046 [0.022, 0.071] |
| RewardCFconv | 0.439 [0.418, 0.463] | 0.199 [0.186, 0.214] | 0.121 [0.102, 0.138] | 0.059 [0.044, 0.079] |
| PTF | 0.243 [0.224, 0.260] | 0.151 [0.136, 0.168] | 0.096 [0.083, 0.110] | 0.057 [0.040, 0.075] |
| UCBIndep | 0.005 [-0.004, 0.014] | -0.002 [-0.009, 0.005] | -0.002 [-0.010, 0.006] | 0.004 [-0.004, 0.013] |
| Random | -0.010 [-0.023, -0.000] | 0.002 [-0.010, 0.014] | -0.003 [-0.010, 0.005] | -0.010 [-0.026, 0.007] |
| method | pool=240 | pool=60 | pool=30 | pool=15 |
|---|---|---|---|---|
| ContentionAdaCF | 0.320 [0.282, 0.356] | 0.302 [0.283, 0.323] | 0.352 [0.325, 0.382] | 0.295 [0.260, 0.326] |
| ContentionCF | 0.023 [0.002, 0.048] | 0.290 [0.268, 0.313] | 0.312 [0.270, 0.345] | 0.273 [0.233, 0.311] |
| CBBAlite | 0.339 [0.311, 0.365] | 0.385 [0.351, 0.421] | 0.355 [0.334, 0.379] | 0.288 [0.256, 0.322] |
| MusicalChairs | 0.120 [0.090, 0.150] | 0.257 [0.215, 0.304] | 0.298 [0.264, 0.326] | 0.275 [0.251, 0.297] |
| ActiveCFconv | 0.113 [0.079, 0.145] | 0.350 [0.319, 0.383] | 0.310 [0.287, 0.330] | 0.195 [0.165, 0.221] |
| RewardCFconv | 0.323 [0.291, 0.361] | 0.383 [0.351, 0.414] | 0.339 [0.314, 0.367] | 0.287 [0.242, 0.324] |
| PTF | 0.343 [0.333, 0.356] | 0.338 [0.302, 0.375] | 0.262 [0.232, 0.293] | 0.141 [0.114, 0.169] |
| UCBIndep | -0.007 [-0.014, -0.000] | 0.001 [-0.006, 0.005] | -0.003 [-0.009, 0.002] | -0.006 [-0.015, 0.003] |
| Random | -0.007 [-0.026, 0.014] | 0.014 [-0.002, 0.031] | -0.009 [-0.027, 0.009] | -0.005 [-0.028, 0.019] |
| method | pool=240 | pool=60 | pool=30 | pool=15 |
|---|---|---|---|---|
| ContentionAdaCF | 0.126 | 0.392 | 0.465 | 0.581 |
| ContentionCF | 0.245 | 0.466 | 0.535 | 0.625 |
| CBBAlite | 0.091 | 0.356 | 0.507 | 0.658 |
| MusicalChairs | 0.391 | 0.555 | 0.625 | 0.697 |
| ActiveCFconv | 0.207 | 0.376 | 0.503 | 0.685 |
| RewardCFconv | 0.147 | 0.371 | 0.526 | 0.660 |
| PTF | 0.187 | 0.323 | 0.448 | 0.624 |
| UCBIndep | 0.054 | 0.209 | 0.363 | 0.561 |
| Random | 0.061 | 0.211 | 0.369 | 0.563 |
Takeaway: the CATEGORICAL preference quality (unseen) is contention-invariant (CF stays high, structure-free at the floor) because it is a property of the learned model; the OPERATIONAL earned-reward gap narrows as collisions, not preferences, become the bottleneck, yet CF still leads by spreading drones across targets via diverse, accurate preferences (lower collision rate).
De-confliction primitive (a recognized MRTA baseline). CBBAlite is the canonical consensus-based auction (CBBA) with the comms/consensus step REMOVED: each drone bids its OWN CF-predicted utility on the public pool (identical model class to RewardCF) and de-conflicts by a reactive, public-loss BACKOFF. It isolates the de-confliction primitive against our fixed PRIVATE per-drone offset (Theorem 7). At severe contention (pool=15): ContentionAdaCF 0.100 vs CBBAlite 0.064 vs greedy RewardCFconv 0.059 earned. Reading: our proactive private-offset de-confliction BEATS the reactive auction-with-backoff -- a proactive STATIC private offset spreads drones once and for all, whereas a reactive SHARED backoff makes all colliders flee the same target together, re-synchronizing them.
Takeaway (three honest findings from the numbers).
The right lesson is not 'CF loses under contention' but 'argmax is the wrong DECISION rule under contention'. The CF ESTIMATE of who-suits-what is exactly as good as ever (the categorical result above); what fails is everyone greedily grabbing their single best target. So we keep the estimator and swap the policy for a contention-aware, decentralized (still communication-free) one. Three ideas, in increasing sophistication, all ZK:
ContentionCF earns $0.105\,[0.096,0.114]$ at the most severe contention (pool $=15$, $2{:}1$) versus $\approx 0.05$ for argmax-CF and PTF, non-overlapping CIs, roughly DOUBLE, lifting the swarm off the matching floor; it also leads at pool $=30$.Real swarms face CHURN: targets come and go over time. We test continuous turnover, a fixed-size active set where a batch of targets DEPARTS and an equal batch of FRESH ones ARRIVES every few rounds, and measure steady-state skill on the active set and on the very freshest arrivals. This is a clean illustration of the research process: a negative, diagnosed, turned into a win.
Active set of fixed size 200; every 5 rounds 8 targets DEPART and 8 FRESH ones ARRIVE. Steady-state skill (2nd half of a 80-round episode) on the current active set and on RECENT arrivals (active < 10 rounds). rho=1.0, 8 seeds, bootstrap 95% CI.
| method | skill on active set | skill on RECENT arrivals |
|---|---|---|
| RewardCF | 0.632 [0.610, 0.655] | 0.074 [0.059, 0.088] |
| ActiveCFconv | 0.697 [0.684, 0.709] | 0.363 [0.321, 0.402] |
| EMCF | 0.842 [0.830, 0.853] | 0.371 [0.291, 0.439] |
| Tabular | 0.448 [0.435, 0.459] | 0.062 [0.042, 0.082] |
| UCBIndep | 0.619 [0.584, 0.646] | 0.132 [0.114, 0.150] |
Read: under FAST continuous churn, PLAIN exploitative CF (RewardCF) does NOT win, it ties UCBIndep on the active set and TRAILS it on fresh arrivals (0.074 vs 0.132), because collective fold-in needs ~d probes to pin a newcomer and exploitation never probes them. The FIX is CF UNITED WITH DIRECTED EXPLORATION of the uncertain (fresh) targets: ActiveCFconv (broadcast count-bonus) and especially EMCF (predictive-variance UCB) PROBE newcomers AND fold them in via the shared structure, and they DOMINATE, on the active set EMCF 0.842 vs UCBIndep 0.619 / RewardCF 0.632, and on FRESH arrivals ActiveCFconv 0.363 and EMCF 0.371 vs UCBIndep 0.132 (all non-overlapping CIs). So non-stationarity IS handled: the win needs the variant that combines low-rank fold-in with confidence-directed probing of newcomers, neither structure-free optimism (UCBIndep) nor exploitative CF alone suffices. The arc: plain-CF negative -> diagnosis (must probe newcomers) -> confidence-directed CF win.
The negative (and its diagnosis). Plain exploitative CF (RewardCF) does NOT win under fast churn: it ties the structure-free UCBIndep on the active set and TRAILS it on the freshest arrivals ($0.074$ vs $0.132$). Why: a newcomer's factor needs about $d$ collective probes before fold-in can place it, a latency that rapid churn outpaces, and a purely exploitative policy never probes the new targets. UCBIndep, by contrast, is drawn to untried arms by its optimism, so it at least samples the arrivals.
It would be dishonest to claim CF always wins. It does not. Pinning down EXACTLY when it beats a plain learner is itself a useful result, and it comes down to three conditions that must ALL hold at once.
Why this matters. These three conditions are precisely the situation our drone swarm is in (hidden compatibility with a few factors, way more targets than rounds, a public outcome stream), which is why CF wins here, and they honestly explain the cases where it would NOT, pre-empting the reasonable question 'is this always better?' with a clear no-and-here-is-when.
Sections 5 and 8.12-8.15 presented a single core estimator plus a MENU of scoped refinements, each helping only under a stated condition. A fair question: must a practitioner pick the right extension per situation? No. The extensions COMPOSE into one self-tuning policy, UnifiedCF, because each one gates itself on a quantity the drone can observe, and switches itself off when its condition is absent.
On the SAME 8 seeds, with ONE configuration and no per-regime tuning, UnifiedCF is best-or-statistically-tied in every regime-metric we measure:
| Regime (metric) | UnifiedCF | per-regime specialist | verdict |
|---|---|---|---|
| Standard (anytime skill) | 0.437 | EMCF 0.433 | tie |
| Churn (active-set skill) | 0.851 | EMCF 0.842 | tie / edge |
| Churn (recent arrivals) | 0.347 | EMCF 0.371 | tie |
| Contention $|S|{=}15$ (earned) | 0.104 | AdaCF 0.100 / greedy 0.059 | tie / $\approx2\times$ greedy |
| Contention $|S|{=}240$ (earned) | 0.425 | greedy 0.439 / AdaCF 0.448 | tie (overlapping CIs) |
Why it matters. The design space does not merely reduce to 'a core plus a menu', it collapses to a SINGLE deployable policy that adapts itself to the regime it finds itself in (plentiful vs contested targets, stationary vs churning, clean vs noisy), with no knob the operator must set per situation. That is the strongest form of the consolidation argument.
To make sure our results are not an artifact of an abstract mask, we placed the robots and tasks in a 2-D arena and DERIVED observability from geometry: a robot senses a teammate's engagement only if the task is within its sensing radius $R$, with noise growing with distance ($\sigma(d)=\sigma_0(1+d/R)$). This is a persistent, structured, per-robot visibility, exactly the kind a real swarm has, now emerging from physics. Sweeping the sensing radius:
| sensing radius $R$ | coverage | RewardCF unseen | Tabular unseen |
|---|---|---|---|
| 0.20 | 0.11 | 0.030 | ~0 |
| 0.35 | 0.29 | 0.147 | ~0 |
| 0.50 | 0.50 | 0.229 | ~0 |
| 0.80 | 0.86 | 0.304 | ~0 |
Framed as an operational target-servicing / dispatch MISSION (each robot repeatedly services an offered target and earns the trait-match, under range-limited distance-noisy sensing), and measured on our SAME servicing skill, our method beats the WHOLE competing field, the structured low-rank methods (PTF, ESTR, BPMF, SoftImpute, MFSGD) AND the structure-free ones, under limited observability (rho=0.25): ours $\approx0.36$ vs the best of the entire field $\approx0.29$ (non-overlapping error bars), while structure-free learners sit at the random-dispatch floor. At full broadcast the structured methods catch up; the gap opens precisely under the limited observability that defines the operational regime. So the swarm dispatches the right asset to targets it never personally serviced. Applications: limited-resource servicing under degraded comms (firefighting, medical-evac, precision agriculture), search-and-service, and sensor-target matching.
| Method | Provenance | unseen skill (ρ=0.25, masked) | unseen skill (ρ=1, full) | cumulative regret (ρ=0.25, lower=better) | rounds to 25% of oracle | profile |
|---|---|---|---|---|---|---|
| SwarmCF (RewardCF) | ours | 0.336 | 0.376 | 41.2 | 35 | D|0|rho,sig|dhat|online |
| SwarmCF-RC (BothCF) | ours | 0.349 | 0.372 | 41.2 | 36 | D|0|rho,sig|dhat|online |
| SwarmCF-batch (PTF) | ours (hybrid) | 0.280 | 0.516 | 46.0 | never | D|0|rho,sig|dhat|batch |
| MFSGD | standard, adapted | -0.019 | 0.042 | 46.5 | never | D|0|rho,sig|dhat|online |
| BPMF | standard, adapted | 0.126 | 0.233 | 49.9 | never | D|0|rho,sig|dhat|batch |
| ESTR | standard, adapted | 0.058 | 0.232 | 46.4 | never | D|0|rho|dhat|ETC |
| UCBHomo | structure-free baseline | 0.070 | 0.167 | 50.3 | never | D|0|rho,sig|none|online |
| UCBIndep | structure-free baseline | 0.003 | 0.004 | 50.4 | never | D|0|rho,sig|none|online |
| Tabular | structure-free baseline | 0.003 | -0.001 | 43.0 | 46 | D|0|rho,sig|none|online |
| Random | structure-free baseline | 0.004 | 0.007 | 50.0 | never | D|0|-|none|- |
One canonical masked harness (m=30, n=240, 8 seeds): unseen skill from the bake-off, regret and time-to-competence from the anytime trajectories. Our online CF leads the masked column and the operational columns; the batch-refit PTF (also ours) wins only the full-broadcast column; structure-free sits at the floor. The confidence/contention/unified refinements (EMCF, ContentionAdaCF, UnifiedCF) and the SoftImpute / KNNCF baselines are evaluated in their dedicated experiments.
Novelty. (1) The decentralized, online, broadcast-only, per-drone-masked formulation of CF for MRTA, with the unseen-pair / onboarding categorical separations and a matching per-drone theory; standard matrix-completion theory is centralized and about estimation error, not online decision reward. (2) The anytime (cumulative-reward) separation under sample starvation. (3) The masking-model dichotomy (durable vs transient decentralization). (4) Collective active exploration via the shared broadcast.
Where we sit among MRTA and decentralized-learning paradigms. The table places the major families on the four axes that define our problem; each established paradigm relaxes at least one axis we hold fixed, and our cell (low-rank with only a guessed rank, no communication, decentralized, masked and noisy) is the one left open.
| Paradigm | Prior knowledge | Communication | Distribution | Observability | Note |
|---|---|---|---|---|---|
| Auction / CBBA (consensus bundle) | known task values/costs | message-passing (bids) | decentralized | full task info | needs communication AND known utilities |
| DCOP / consensus MRTA | known constraints/utilities | message-passing | decentralized | full | needs communication AND a known objective |
| Cooperative MARL (CTDE: MAPPO/QMIX/VDN) | none (learned) | centralized training | centralized-train / decentralized-exec | full (in training) | central critic; not comms-free |
| Learned-communication MARL (CommNet/TarMAC) | none (learned) | learned messages | decentralized | partial + messages | broadcast is learned message-passing, not passive |
| No-comms multiplayer bandits (SIC-MMAB, musical chairs) | none (per-arm) | none | decentralized | own pulls + collisions | comms-free but STRUCTURE-FREE (no unseen generalization) |
| Matrix completion (nuclear norm, spectral) | low-rank | centralized (one matrix) | centralized | partial (uniform) | centralized estimation, not online/decision |
| Low-rank / bilinear bandits (ESTR, etc.) | low-rank | centralized | centralized | partial | centralized and/or explore-then-commit |
| Federated / gossip CF | low-rank | broadcast of factors/gradients | decentralized | partial | shares PARAMETERS, not a passive stream |
| Multi-user RL, low-rank rewards (Nagaraj-Agarwal) | low-rank | centralized aggregation | centralized | partial | closest prior; centralizes trajectory aggregation |
| Trait-based MRTA (Prorok et al.) | KNOWN traits | varies | decentralized | full | capability/requirement traits are GIVEN, not learned |
| OURS (broadcast CF for ZK-MRTA) | low-rank, GUESSED rank only | none (passive sensing) | decentralized | masked + noisy | the hardest cell: no comms, no known utilities/traits, guessed rank |
All numbers come from complete per-seed JSON in results/pilots/ (registry docs/DATA_CATALOGUE.md; chronology and per-cycle reviews docs/PROJECT_LOG.md). Figures: python experiments/make_figures.py; this page: python experiments/make_tutorial.py. Harnesses: pilot_compare.py, pilot_crossover.py, pilot_anytime.py, pilot_e3_channels.py, pilot_iid.py, pilot_scaling.py, pilot_robust.py, pilot_e7_newcomer.py, pilot_e8.py; methods in pilot_noise.py and pilot_baselines.py; world/reward/oracle in core.py. Proofs: docs/THEORY_FORMAL.md; ZK audit: docs/ZK_COMPLIANCE.md; paper draft: docs/PAPER_DRAFT.md.
Plain one-line definitions of the recurring terms. For every mathematical symbol, see the Notation at a glance table at the top.
Generated from the project repository; every figure and number is regenerable from saved data. See the paper draft for the full write-up.