multi-agent ·bug-hunt ·arxiv ·code-review

Why one agent isn't enough to find your bugs

May 26, 2026 · 13 min read

The hunter read line 340 — the factory function that creates a legacy chapter-generation worker — and called a P0. “Still live on the production queue, bypassing the new pipeline.” Two screens further down, on line 1106, was the env-gate: if (process.env.INCLUDE_BOOK_WORKER !== 'true') return. The worker was unreachable in production. The P0 was a P3 cleanup at worst.

A second hunter flagged a route handler as fail-open authentication. It had missed the router.use(requireAuth) middleware three files up.

That audit produced five P0 claims across three different methods. Zero survived an inspection phase. The interesting question is not why five hunters made the same mistakes. It is whether the answer is use fewer agents or use different ones.

For multiple agents to beat one, both conditions must hold: each agent reads a different slice of the code AND each agent comes from a different LLM family. We had added the first half but not the second. One Sonnet can’t read every layer. Five Sonnets can’t either, because they share the blind spots.

The ceiling of a single agent

Run a single Opus call against a backend of any real size and the predicted true-positive rate sits in the 50–60% range, with 30–40% false positives. CodeX-Verify (arXiv 2511.16708) measured this curve directly. The reason is that one agent reads breadth-first. It cannot simultaneously hold a route handler, the middleware that wraps it, the env-gate that disables the route, and the queue-state machine the route mutates, not long enough to reason about all four at once. Things get missed. The misses are systematic, not random.

CodeX-Verify proves a theorem about why adding agents helps, with one strict condition: each new agent has to disagree with the others most of the time. The paper measures disagreement as the pairwise correlation between agent outputs (Pearson’s ρ), and the condition is ρ ≤ 0.25 — meaning when two specialists look at the same bug, they reach the same call less than a quarter of the time. As long as that holds, every additional agent strictly raises the system’s accuracy. When agents start agreeing more often than that, the math collapses and adding agents stops helping.

The empirical curve from the paper:

---
config:
    xyChart:
        width: 640
        height: 340
        plotReservedSpacePercent: 70
---
xychart-beta
    title "CodeX-Verify: accuracy by agent count (ρ ≤ 0.25)"
    x-axis "Number of specialists" [1, 2, 3, 4]
    y-axis "Bug-detection accuracy (%)" 0 --> 100
    bar [32.8, 47.7, 61.2, 72.4]
    line [32.8, 47.7, 61.2, 72.4]

CodeX-Verify's measured accuracy curve (arXiv 2511.16708). Four specialists at ρ ≤ 0.25 reach 72.4% — a 40-point lift over one. The line only keeps climbing while agents disagree; a fifth agent from the same model family would flatten or reverse it.

The condition does the work, not the count: a 5th independent agent gains less than 10 points, and a 5th correlated agent gains nothing.

Why five copies of one agent fails

Our five-hunter swarm was five Sonnets. Same model family. Same training distribution. Same blind spots in the same places.

What we had built was not a swarm. It was a louder single agent. The Inverse-Wisdom Law paper (2604.27274, April 2026) gives this its own name, Consensus Paradox, and a mechanism: when all members share a model family, the system “prioritizes internal architectural agreement over external logical truth.” The swarm’s confidence rises. So does its false-positive rate.

That is what produced the five P0 calls. Each hunter read a different file, but all of them applied the same severity heuristics to what they found. “SQL string concatenation in a query” reads as P0 to every Sonnet, regardless of whether the payload is internal-trusted or user-supplied. Five Sonnets reinforced each other. One Sonnet at least leaves the question open.

What “different” actually means is the next question — and it has two answers, not one.

Two diversities: roles AND model families

Only about 12% of bugs were found by more than one persona in the Multi-Agent Committees paper’s (2512.21352) 9-persona swarm — each lens catches a class the others miss. Security personas catch SQL injection and XSS. Accessibility personas catch ARIA and contrast violations. UX personas surface workflow failures.

That maps onto our session’s failure directly. The hunter that misread line 340 did not make a Sonnet mistake. It made a generalist’s mistake. A specialist whose explicit job was trace env gates, middleware, and registration sites would have read line 1106 before submitting line 340 as a finding. The bug was not missed because there were too few agents. It was missed because none of the agents had been told to read the wiring layer.

The replacement swarm we designed after the audit — what we’re calling the inspection partition from here on, the assignment of which agent reads which slice of the codebase — has five specialists across three model families. Sonnet reads correctness: state machines, request lifecycles. Kimi K2 reads security: CWE/OWASP, secrets, auth flows. A second Sonnet, given a different prompt frame, has the explicit job of tracing env gates, middleware, and registration sites — the lens that would have caught line 1106. GLM-4.6 reads data: migrations, JSONB invariants, foreign-key shape. Kimi K2 again reads cron: retry, cancel, queue lifecycle.

Five distinct lenses, not five copies of one. And three LLM families behind those lenses, not one — each agent is set up to be wrong in a different way.

Five heterogeneous lenses change what the swarm can see — but diverse discovery still produces candidates, not findings, until a second phase whose explicit job is disconfirmation takes over.

Discovery is recall. Validation is precision.

Discovery agents optimize for recall. They are right to. The goal of a hunter is to flag everything plausible. The goal of a validator is to reject what is wrong.

Four validators ran after the swarm finished. The first re-checked each cited bug against current code, doing drift detection: is this still true today? The second generated runtime-probe recipes for the uncertain ones. The third re-derived the new findings from scratch. The fourth applied a threat-model lens to every severity claim: who controls the input, what gate fires before the handler, is the cited code reachable in the default deploy?

The validators rolled back roughly 40 findings as duplicates, already-fixed, or stale. All five P0 claims downgraded. 28 P1s from the swarm dropped to 16.

sequenceDiagram
  autonumber
  participant Y as features.yaml
  participant H as Hunters (heterogeneous, n=5)
  participant G as Citation gate
  participant D as 3-round vote
  participant V as Validators (disconfirmation, n=4)
  participant R as Report

  Note over Y,H: PHASE 1 · DISCOVERY (parallel, recall-optimized)
  Y->>H: scope + threat model
  par 5 parallel reads
    H->>H: Sonnet · Kimi · GLM · Kimi · Sonnet
  end

  Note over H,G: PHASE 2 · CITATION GATE
  H->>G: raw NDJSON
  G->>G: drop bugs whose file:line does not resolve

  Note over G,D: PHASE 3 · DELIBERATION
  G->>D: verified findings
  loop per finding
    D->>D: propose · refine · confidence-weighted vote
  end

  Note over D,V: PHASE 4 · VALIDATION (parallel, precision-optimized)
  D->>V: deliberated bug list
  par 4 parallel disconfirmers
    V->>V: stale-check · runtime-probe · new-find verify · threat-model arbiter
  end

  Note over V,R: OUTPUT
  V->>R: validated backlog, ROI-ranked

The bug-hunt pipeline. Discovery is five hunters across three model families (Sonnet · Kimi · GLM) running in parallel; validation is four disconfirmers also running in parallel. The citation gate between them drops anything the hunters hallucinated, so the deliberation step never sees fabricated file:line refs.

Without a validation phase, the swarm ships a high-FP report that burns a sprint on bugs that were not there — and the validator phase costs only $2–3 per feature.

What the literature actually requires

If you are considering a multi-agent system today, run two checks before you ship it.

First: compute pairwise correlation across the hunters’ outputs. If ρ > 0.5, you do not have a swarm. You have a louder single agent, and adding more copies will worsen the false-positive rate, not improve the true-positive rate. Diversify the model families. Sonnet + Kimi + GLM are three families with measurably different training. Five Sonnets is one family with five voices.

Second: separate discovery from validation. A discovery-only multi-agent system — hunters only, no validator phase — is high-recall slop. The disconfirmation phase is not polish. It is the part that turns candidate bugs into findings. If your pipeline only has the discovery half, you’re shipping the recall half of a system that needs both.

Across CodeX-Verify, the Committees paper, and the Inverse-Wisdom Law, the finding is consistent: the number of agents matters less than their independence, and the number of phases matters more than agent count. iMAD (2511.11306) takes that further — only escalate to debate the ~10–15% of bugs where validators disagree, so the expensive phase fires only when it has to.

The detection fingerprint, then: a multi-agent system whose outputs agree across hunters is not a confident system. It is a correlated one. Their agreement is the signal that the same code is being read five times with the same model, instead of once each through five different lenses.

Related: We ran a 3-source bug hunt. Then we realised our validators were all Claude. — the cautionary half: same-family validators reinforce each other even when the discovery phase is heterogeneous. Read together, they bracket the design space.