A self-evolving swarm of AI auditors that catches planted errors —
ungameable, because we plant the errors and hold the answer key.
56bill line items
20planted errors
5strategy lenses
0.87best validation F1
↓ the experiment, honestly reported
The guarantee
Why it can't be gamed
Most "AI catches errors" demos quietly grade themselves. This one is built so the model
cannot fake its way to a good score.
🌱
We plant the errors
The bill is synthetic. Twenty errors of five known types are seeded deliberately — duplicate, wrong CPT, inflated price, phantom service, impossible quantity.
🔑
We hold the answer key
Ground truth is the exact set of error line ids. The detector never sees it — it's only ever used to score, never to prompt.
⚖️
F1, not recall
Precision and recall together. "Flag everything" tanks precision, so spraying the whole bill can never win — coverage has to be earned.
🔒
Test scored once
Selection runs only on the validation split. The held-out test set is scored exactly one time, at the very end — no peeking, no tuning to it.
The research story
Four findings, each one a dead end that taught us something
We didn't get a swarm on the first try. We got there by watching simpler ideas fail, on the record.
FINDING A
One detector isn't enough
A single gpt-4o-mini auditor reliably catches the obvious errors but slides right past the subtle ones — the far-apart duplicate, the CPT code quietly mislabeled as a flu shot. A lone lens has blind spots.
FINDING B
Majority voting buried the one expert who was right
Run five specialist lenses and take the majority vote, and a subtle error caught by only one specialist gets out-voted by the four that missed it. On held-out errors, majority scored nothing.
0.00held-out F1 · majority
FINDING C
Union recovered the lone specialist
Flip the rule: flag a line if any lens flags it. Now the single correct specialist is trusted instead of out-voted. The same held-out errors that majority missed came back into view.
0.67held-out F1 · union
FINDING D
Greedy selection eroded the ensemble
When evolution kept the top detectors by their individual score, it threw away low-scoring-but-unique specialists (like the CPT lens). Union coverage collapsed — motivating ensemble-fitness selection + a merge operator.
0.67→0.50union F1 · eroding
The breakthrough
Coverage-aware selection, plus merge
Select a fixed-size ensemble that maximizes the union's validation F1 — and reward
each detector for the unique errors only it catches. Now specialists survive on what they add to the team,
and a merge operator can breed new hybrids.
union validation F1 · generations 0 → 3
◇Fixed size 4 — the ensemble can never collapse to a single detector, so merge always has partners.
◇Selection rewards marginal unique true-positives: a specialist earns its slot for errors no teammate catches.
◇Validation F1 held, then climbed — 0.82 → 0.82 → 0.87 → 0.87 — instead of eroding.
🧬 A merged hybrid was born and selected. In generation 2, two surviving prompts were
combined into one child covering both their error types. It beat the survivors' union on validation
and earned a permanent place in the ensemble.
⚠ The honest result
0.87VALIDATION F1
vs
0.62TEST F1
Validation 0.87 vs Test 0.62 — a real generalization gap, deliberately exposed by
scoring the held-out test only once. We could have hidden it by tuning against the test set, or by never
holding one out. We didn't. The honesty is the point: a benchmark you can't game
includes the part where you admit it didn't fully generalize.
Domain 2 · the same engine, harder
Supply-chain anomaly detection (synthetic)
The same engine, a harder problem: catch a hidden illicit network by its verifiable anomaly signals —
and keep catching it as the adversary adapts. Synthetic data, modeled on documented public typologies
(shared address/phone, shared officers, facade mismatch). Surfaces leads, never verdicts.
BEAT 1 · CAUGHT
The swarm catches the network
Six lenses read the whole registry and flag a hidden network by its shared infrastructure —
the same registered address, the same phone, the same officers linking otherwise-unrelated shells.
BEAT 2 · ADAPTS
The adversary moves, the swarm craters
A new cohort drops every shared signal — unique address and phone each, officers no longer shared —
and switches its facade to "home appliances". The lenses go blind.
0.76→0.27union F1 · cratered
BEAT 3 · INVARIANT
But a deeper signal survives
The disguise can't hide everything: all eight new shells still consign to the same downstream
buyer. The signal sits in the data — no current lens is looking at it.
Arm B 0.00Arm A 0.00round 0
Arm B — diversity + exploration Arm A — exploit only
union F1 across 6 rounds · adversary adapts at round 3
What this shows
In a non-stationary, adversarial setting, diversity + exploration is what lets the swarm recover.
Arm A pruned itself to a single best lens — and a lens that only mutates copies of itself can never
rediscover a signal it discarded. After the adversary moved, it stayed flat at 0.55, permanently blind
to the new cohort.
Arm B kept a coverage-diverse ensemble and spent budget probing uncovered ground. Its exploration
regenerated a consignee lens, and selection adopted it the moment it mattered — when scoring the
combined network made that lens raise union F1. The recovery is emergent, gated by scope, not a
hardcoded schedule: the same lens was rejected in the earlier rounds because its hits fell outside the
active scope.
Honest note. The exploration direction is steered toward uncovered fields; the timing
of rediscovery is emergent. Synthetic data throughout — and the system surfaces leads for analysts, not
verdicts.