Self-evolving error detection

AutoSwarm

A self-evolving swarm of AI auditors that catches planted errors — ungameable, because we plant the errors and hold the answer key.

56bill line items
20planted errors
5strategy lenses
0.87best validation F1
the experiment, honestly reported
The guarantee

Why it can't be gamed

Most "AI catches errors" demos quietly grade themselves. This one is built so the model cannot fake its way to a good score.

🌱

We plant the errors

The bill is synthetic. Twenty errors of five known types are seeded deliberately — duplicate, wrong CPT, inflated price, phantom service, impossible quantity.

🔑

We hold the answer key

Ground truth is the exact set of error line ids. The detector never sees it — it's only ever used to score, never to prompt.

⚖️

F1, not recall

Precision and recall together. "Flag everything" tanks precision, so spraying the whole bill can never win — coverage has to be earned.

🔒

Test scored once

Selection runs only on the validation split. The held-out test set is scored exactly one time, at the very end — no peeking, no tuning to it.

The research story

Four findings, each one a dead end that taught us something

We didn't get a swarm on the first try. We got there by watching simpler ideas fail, on the record.

FINDING A

One detector isn't enough

A single gpt-4o-mini auditor reliably catches the obvious errors but slides right past the subtle ones — the far-apart duplicate, the CPT code quietly mislabeled as a flu shot. A lone lens has blind spots.

FINDING B

Majority voting buried the one expert who was right

Run five specialist lenses and take the majority vote, and a subtle error caught by only one specialist gets out-voted by the four that missed it. On held-out errors, majority scored nothing.

0.00held-out F1 · majority
FINDING C

Union recovered the lone specialist

Flip the rule: flag a line if any lens flags it. Now the single correct specialist is trusted instead of out-voted. The same held-out errors that majority missed came back into view.

0.67held-out F1 · union
FINDING D

Greedy selection eroded the ensemble

When evolution kept the top detectors by their individual score, it threw away low-scoring-but-unique specialists (like the CPT lens). Union coverage collapsed — motivating ensemble-fitness selection + a merge operator.

0.67→0.50union F1 · eroding
The breakthrough

Coverage-aware selection, plus merge

Select a fixed-size ensemble that maximizes the union's validation F1 — and reward each detector for the unique errors only it catches. Now specialists survive on what they add to the team, and a merge operator can breed new hybrids.

union validation F1 · generations 0 → 3
  • Fixed size 4 — the ensemble can never collapse to a single detector, so merge always has partners.
  • Selection rewards marginal unique true-positives: a specialist earns its slot for errors no teammate catches.
  • Validation F1 held, then climbed — 0.82 → 0.82 → 0.87 → 0.87 — instead of eroding.
🧬 A merged hybrid was born and selected. In generation 2, two surviving prompts were combined into one child covering both their error types. It beat the survivors' union on validation and earned a permanent place in the ensemble.
⚠ The honest result
0.87VALIDATION F1
vs
0.62TEST F1

Validation 0.87 vs Test 0.62 — a real generalization gap, deliberately exposed by scoring the held-out test only once. We could have hidden it by tuning against the test set, or by never holding one out. We didn't. The honesty is the point: a benchmark you can't game includes the part where you admit it didn't fully generalize.

Domain 2 · the same engine, harder

Supply-chain anomaly detection (synthetic)

The same engine, a harder problem: catch a hidden illicit network by its verifiable anomaly signals — and keep catching it as the adversary adapts. Synthetic data, modeled on documented public typologies (shared address/phone, shared officers, facade mismatch). Surfaces leads, never verdicts.

BEAT 1 · CAUGHT

The swarm catches the network

Six lenses read the whole registry and flag a hidden network by its shared infrastructure — the same registered address, the same phone, the same officers linking otherwise-unrelated shells.

BEAT 2 · ADAPTS

The adversary moves, the swarm craters

A new cohort drops every shared signal — unique address and phone each, officers no longer shared — and switches its facade to "home appliances". The lenses go blind.

0.76→0.27union F1 · cratered
BEAT 3 · INVARIANT

But a deeper signal survives

The disguise can't hide everything: all eight new shells still consign to the same downstream buyer. The signal sits in the data — no current lens is looking at it.

Arm B 0.00 Arm A 0.00 round 0
round union F1
Arm B — diversity + exploration Arm A — exploit only
union F1 across 6 rounds · adversary adapts at round 3

What this shows

In a non-stationary, adversarial setting, diversity + exploration is what lets the swarm recover. Arm A pruned itself to a single best lens — and a lens that only mutates copies of itself can never rediscover a signal it discarded. After the adversary moved, it stayed flat at 0.55, permanently blind to the new cohort.

Arm B kept a coverage-diverse ensemble and spent budget probing uncovered ground. Its exploration regenerated a consignee lens, and selection adopted it the moment it mattered — when scoring the combined network made that lens raise union F1. The recovery is emergent, gated by scope, not a hardcoded schedule: the same lens was rejected in the earlier rounds because its hits fell outside the active scope.

Honest note. The exploration direction is steered toward uncovered fields; the timing of rediscovery is emergent. Synthetic data throughout — and the system surfaces leads for analysts, not verdicts.