How to run an adversarial multi-agent idea search

When you ask for "groundbreaking, unsolved problems," a single-pass brainstorm betrays you. It returns the same dozen ideas everyone already lists — another agent framework, an AI memory layer, a RAG chatbot. The obvious ideas feel productive and are worthless, because their obviousness means they're already being built.

To find what's actually unsolved, the research process has to be adversarial. I built a multi-agent pipeline to do exactly that, and ran it across the agentic landscape and again, scoped to gaming.

The pipeline

Generate (broad). One agent per domain lens — memory, coordination, verification, computer-use, security, economics, vertical agents, embodied, and a deliberate "wildcard" lens — each asked for sharp, specific ideas with the key technical insight and a concrete claim about why it's unsolved today. Breadth at generation is where non-obvious ideas come from.

Shortlist. Deduplicate and cluster near-identical ideas, cut the obvious and already-productized, and keep the distinct, strongest candidates.

Score. Each survivor graded on six axes — novelty, technical hardness, moat, open-source fit, build feasibility, and impact.

Kill-test (the adversarial core). This is what separates the method from a brainstorm. Each top candidate faces independent skeptics, each told to default to "killed" unless they can prove otherwise, attacking on distinct axes: is this already solved? (name the products and papers), is this actually hard, or a weekend wrapper?, is there a durable moat, or does it commoditize instantly? An idea that's killed on every axis is dead. Survivors are ranked by score minus a heavy per-kill penalty.

Synthesize. The survivors become full dossiers — and, just as importantly, the rejected ideas are kept with the reasons they fell, because the failures are where the real lessons are.

What it produced

Across two runs, the funnel was deliberately ruthless: of roughly 96 generated ideas, only 15 survived the gauntlet — and the dossiers were refreshingly self-critical, with several rating their own moat as weak. That's the point. No hype survives an adversary whose job is to refute you.

What it taught

Three patterns recurred across nearly every survivor:

The moat is never the algorithm — it's the data corpus and the cost benchmark. The math is publishable; the defensibility is the labeled corpus and a measured quantity for how expensive evasion is.
Verification ages better than behavior classification. Defenses that attest what happened don't decay as models improve; defenses that classify behavior erode with every model generation.
The universal failure mode is adversarial mimicry. The sophisticated response isn't "perfect detection" — it's raising measured evasion cost, and owning the red-team harness that quantifies it.

One honest caveat the method enforces: the agents cite specific products and papers, and those should be treated as leads to verify, not facts. The technical shapes are sound; the citations need an independent check before anyone bets on them. Adversarial generation finds the frontier — it doesn't excuse you from doing the homework.