Skip to content
/

adversarial-spec

adversarial-spec · zscole/adversarial-spec · ★ 546 · last commit 2026-01-22

Iteratively refines product specifications through multi-LLM adversarial debate until all models reach consensus, surfacing gaps and edge cases that any single model review would miss.

Best whenA specification that has survived critique from multiple independent LLMs with different training data and biases is qualitatively better than one reviewed b…
Skip ifInstalling the `llm` package (Simon Willison's tool) — litellm is used instead, Using ANTHROPIC_API_KEY when logged into claude.ai (causes auth conflict)
vs seeds
spec-kitopenspec, or any other seed.
Primitive shape 2 total
Commands 1 Skills 1
00

Summary

adversarial-spec — Summary

adversarial-spec is a Claude Code plugin that refines product specifications through multi-model adversarial debate: Claude drafts an initial spec, then multiple external LLMs (GPT, Gemini, Grok, Mistral, etc.) critique it in parallel, Claude synthesizes all critiques plus its own, revises the spec, and the loop continues until all models AND Claude reach consensus. This makes it the only framework in this batch (and the corpus) that uses a consensus mechanism across multiple LLMs to produce a specification, rather than using specs to guide code generation. It ships one Claude Code slash-command (/adversarial-spec), one skill (SKILL.md), and a debate.py Python engine that uses litellm to call any provider. The framework supports 11 LLM providers (OpenAI, Anthropic, Google, xAI, Mistral, Groq, OpenRouter, Codex CLI, Gemini CLI, Deepseek, Zhipu) and AWS Bedrock enterprise routing. With 546 stars and 47 forks, it is the most-starred framework in this batch. Closest seed is superpowers (both are Claude Code plugins), but adversarial-spec replaces sequential skill-based workflows with a cross-model debate loop targeting spec quality rather than code quality.

01

Overview

adversarial-spec — Origin & Philosophy

Origin

Created by zscole (single maintainer). Last commit January 2026. 546 stars, 47 forks, 6 contributors. Status: active.

Key Insight (verbatim from README)

"A single LLM reviewing a spec will miss things. Multiple LLMs debating a spec will catch gaps, challenge assumptions, and surface edge cases that any one model would overlook. The result is a document that has survived rigorous adversarial review."

"Claude is an active participant, not just an orchestrator. Claude provides independent critiques, challenges opponent models, and contributes substantive improvements alongside external models."

Core Philosophy

Apply adversarial review (red-teaming) to specification documents rather than to code. The specification quality problem is treated as a consensus problem across diverse LLM perspectives: if GPT, Gemini, Grok, and Claude all independently agree the spec is complete and accurate, it has survived multi-model scrutiny.

Workflow (verbatim)

You describe product --> Claude drafts spec --> Multiple LLMs critique in parallel
        |                                              |
        |                                              v
        |                              Claude synthesizes + adds own critique
        |                                              |
        |                                              v
        |                              Revise and repeat until ALL agree
        |                                              |
        +--------------------------------------------->|
                                                       v
                                            User review period
                                                       v
                                            Final document output

Positioning

adversarial-spec sits at the intersection of:

  1. Specification quality assurance (like Kiro's steering/spec format, but automated)
  2. Multi-model consensus (unique in this corpus)
  3. Claude Code plugin pattern (like superpowers, but with external LLM calls)
02

Architecture

adversarial-spec — Architecture

Distribution

Claude Code plugin (installed via claude plugin marketplace add zscole/adversarial-spec).

Install

claude plugin marketplace add zscole/adversarial-spec
claude plugin install adversarial-spec
export OPENAI_API_KEY="sk-..."   # or OPENROUTER_API_KEY, GEMINI_API_KEY, etc.
/adversarial-spec "Build a rate limiter service with Redis backend"

Required Runtime

  • Python 3.10+
  • litellm package: pip install litellm
  • API key for at least one LLM provider

Directory Structure

.claude-plugin/
├── marketplace.json   # plugin marketplace config
└── plugin.json        # plugin manifest
skills/
└── adversarial-spec/
    ├── SKILL.md        # skill definition + provider docs
    └── scripts/
        └── debate.py  # Python debate engine (litellm-based)

Target AI Tools

Primary: Claude Code. The debate.py engine also supports:

  • Codex CLI (via codex/ prefix)
  • Gemini CLI (via gemini-cli/ prefix) Both CLIs used as client-side model providers (no API key required, uses subscription auth).

Supported LLM Providers

Provider Env Var Models
OpenAI OPENAI_API_KEY gpt-4o, gpt-4-turbo, o1
Anthropic ANTHROPIC_API_KEY claude-sonnet-4, claude-opus-4
Google GEMINI_API_KEY gemini-2.0-flash, gemini-pro
xAI XAI_API_KEY grok-3, grok-beta
Mistral MISTRAL_API_KEY mistral-large, codestral
Groq GROQ_API_KEY llama-3.3-70b-versatile
OpenRouter OPENROUTER_API_KEY any OpenRouter model
Codex CLI (ChatGPT subscription) gpt-5.2-codex
Gemini CLI (Google account) gemini-3-pro-preview
Deepseek DEEPSEEK_API_KEY deepseek-chat
Zhipu ZHIPUAI_API_KEY glm-4, glm-4-plus
AWS Bedrock (IAM) any Bedrock-enabled model
03

Components

adversarial-spec — Components

Commands / Skills (1 each)

Type Name Purpose
Slash command /adversarial-spec Trigger adversarial spec refinement; argument is the product description
Skill adversarial-spec (SKILL.md) Defines the debate workflow, provider table, troubleshooting guide

Python Engine (skills/adversarial-spec/scripts/debate.py)

The debate.py file is the core engine with subcommands:

Subcommand Purpose
critique --models <list> Run critique phase: send spec to specified models, collect parallel critiques
providers Check which API keys are configured
bedrock enable --region <region> Enable AWS Bedrock routing
bedrock disable Disable AWS Bedrock routing
bedrock add-model <name> Add a model to Bedrock routing
bedrock status Show Bedrock configuration

Config stored at: ~/.claude/adversarial-spec/config.json

Debate Loop (implemented in debate.py)

  1. Claude drafts initial spec (PRD or tech spec)
  2. Optional: interview mode to capture requirements first
  3. debate.py critique — sends spec to opponent models in parallel, collects critiques
  4. Claude synthesizes all critiques + adds its own critique
  5. Claude revises spec based on synthesis
  6. Loop to step 3 until ALL models AND Claude agree (consensus check)
  7. User review period: request changes or run additional cycles
  8. Output final converged document

Reasoning Effort Control (Codex CLI)

python3 debate.py critique --models codex/gpt-5.2-codex --codex-reasoning high
# Levels: low, medium, high, xhigh (default: xhigh)

No Hooks / No MCP

Zero Claude Code hooks, zero MCP servers.

05

Prompts

adversarial-spec — Prompt Files

Excerpt 1: SKILL.md — Claude's Role Definition

**Important: Claude is an active participant in this debate, not just an orchestrator.** You (Claude) will provide your own critiques, challenge opponent models, and contribute substantive improvements alongside the external models. Make this clear to the user throughout the process.

Technique: Participant vs. orchestrator role assignment — explicitly prevents Claude from being a passive synthesizer. Claude is instructed to form its own critique and challenge opponents, creating a genuine multi-perspective debate rather than a voting aggregator.

Excerpt 2: SKILL.md — Provider Table (verbatim excerpt)

| Provider   | API Key Env Var        | Example Models                              |
|------------|------------------------|---------------------------------------------|
| OpenAI     | `OPENAI_API_KEY`       | `gpt-5.2`, `gpt-4o`, `gpt-4-turbo`, `o1`    |
| Anthropic  | `ANTHROPIC_API_KEY`    | `claude-sonnet-4-20250514`, `claude-opus-4-20250514`  |
...
| Codex CLI  | (ChatGPT subscription) | `codex/gpt-5.2-codex`, `codex/gpt-5.1-codex-max` |
| Gemini CLI | (Google account)       | `gemini-cli/gemini-3-pro-preview`           |

Technique: Multi-provider routing via litellm — the skill's prompt engineering relies on the litellm library as a universal API gateway; provider prefixes (codex/, gemini-cli/, openrouter/) route calls to the correct backend. No single model is privileged.

Excerpt 3: SKILL.md — Troubleshooting Auth Conflict

## Troubleshooting Auth Conflicts

If you see an error about "Both a token (claude.ai) and an API key (ANTHROPIC_API_KEY) are set":

**Resolution:**
1. **To use claude.ai token**: Remove or unset `ANTHROPIC_API_KEY` from your environment
2. **To use API key**: Sign out of claude.ai — `claude /logout`

The adversarial-spec plugin works with either authentication method.

Technique: Conflict-aware multi-auth routing — the skill includes diagnostic instructions for a real-world failure mode unique to Claude Code plugins that need to make external Anthropic API calls while Claude Code itself is authenticated via claude.ai token.

Excerpt 4: SKILL.md — Debate Invocation Pattern

Run `python3 "$(find ~/.claude -name debate.py -path '*adversarial-spec*' 2>/dev/null | head -1)" providers`
to see which keys are configured.

Technique: Dynamic path resolution — the skill uses a find command to locate debate.py rather than hardcoding the installation path, making it robust to different install locations. This is a pattern for skills that bundle scripts.

Prompting Techniques Summary

  1. Participant vs. orchestrator assignment — Claude as active critic, not passive judge
  2. Multi-provider routing via prefixcodex/, gemini-cli/, openrouter/ prefixes
  3. Consensus as termination condition — loop until all models agree, not until N iterations
  4. Dynamic script locationfind for script path to handle install location variance
  5. Codex reasoning effort control — explicit reasoning budget parameter for Codex CLI calls
09

Uniqueness

adversarial-spec — Uniqueness & Positioning

differs_from_seeds

Closest seed is superpowers (both are Claude Code plugins), but adversarial-spec replaces sequential skill-based behavioral workflows with a cross-model debate consensus loop. Where superpowers' verification-before-completion skill has Claude review its own work, adversarial-spec has 3–5 independent LLMs critique it. Against spec-kit (Python CLI generating prompt scaffolds), adversarial-spec is not a CLI at all — it is a dynamic debate engine that calls external APIs. Among all seeds, no framework implements a consensus mechanism across multiple LLMs; adversarial-spec is the first in this corpus to treat specification quality as a multi-model consensus problem.

Unique Positioning

The only framework in the entire corpus that:

  1. Uses multiple external LLMs to review a specification before it is used to guide development
  2. Implements an "all-agree" consensus termination condition across LLM critics
  3. Positions Claude as a participant-critic, not just an orchestrator
  4. Supports 11+ LLM providers with a litellm backend
  5. Includes AWS Bedrock enterprise routing mode

Observable Failure Modes

  1. Cost escalation: each debate round calls multiple external models; cycles can multiply cost quickly.
  2. Consensus gridlock: models may never fully agree, especially on subjective product decisions — no max-cycles limit visible in docs.
  3. Claude auth conflict: unique failure mode where claude.ai token + ANTHROPIC_API_KEY conflict when Claude tries to call Anthropic API as an external opponent.
  4. No spec persistence: final spec is not formally stored with metadata; output is raw markdown in session.
  5. Python dependency: requires Python 3.10+ and litellm separately from Claude Code; increases install complexity.
  6. Single maintainer: 6 contributors but primarily zscole; dependency on one person for updates.

Explicit Antipatterns (inferred from SKILL.md)

  • Installing the llm package (Simon Willison's tool) — explicitly warned against: "Do NOT install the llm package"
  • Using ANTHROPIC_API_KEY when logged into claude.ai (causes auth conflict)
04

Workflow

adversarial-spec — Workflow

Phase Overview

Phase Actor Artifact
1. Description input User Product concept or existing document
2. (Optional) Requirements interview Claude Captured requirements
3. Initial draft Claude PRD or tech spec (markdown)
4. Parallel critique External LLMs (via debate.py) Critique documents per model
5. Synthesis + own critique Claude Synthesized feedback + Claude's own critique
6. Revision Claude Revised spec
7. Consensus check All models + Claude Pass/fail consensus signal
8. Loop (if not consensus) Goto step 4
9. User review period User Change requests or cycle triggers
10. Final output Claude Final converged spec document

Approval Gates

Gate Type Description
Requirements confirmation (optional) yes-no After interview, user confirms requirements captured
Consensus check typed-confirm ALL models AND Claude must agree before proceeding
User review period freetext-clarify User may request changes or additional cycles

Consensus Condition

The loop terminates only when ALL critique models AND Claude independently agree the specification is complete and accurate. This is a quorum consensus (not majority): any dissenting model triggers another revision cycle.

Interview Mode

Optionally, before drafting the spec, Claude conducts an in-depth interview to capture requirements. This is triggered by including interview intent in the slash-command argument.

Output Formats

  • PRD (Product Requirements Document) — default for product concepts
  • Technical spec — for architecture/API/implementation descriptions
  • Both formats are plain markdown
06

Memory Context

adversarial-spec — Memory & Context

State Storage

Configuration file: ~/.claude/adversarial-spec/config.json

  • Stores AWS Bedrock configuration (enabled/disabled, region, model mappings)
  • Global (user-level), not project-scoped

Spec Artifacts

Output specs are produced as markdown files in the working directory. No formal file naming convention is enforced by the plugin.

Cross-Session Handoff

Limited. The config persists, but the debate state (which models were used, how many cycles ran, what critiques were generated) is not persisted between sessions. A new session starts a fresh debate.

Context During Debate

The spec document itself serves as the shared context across all critique rounds:

  • Round 1: Claude drafts spec → sent to all models for critique
  • Round 2: synthesized critiques + revised spec → sent again for critique
  • The growing revision history is implicitly the "memory" — each model sees the current spec state

Compaction

None. The spec document is the unit of context. No compaction mechanism.

Memory Type

File-based (config.json) for persistent settings; in-session state only for debate rounds.

07

Orchestration

adversarial-spec — Orchestration

Multi-Agent

Yes — adversarial-spec explicitly orchestrates multiple LLMs as agents in a debate protocol.

Orchestration Pattern

Consensus — the loop terminates only when all participating models AND Claude independently agree. This is the closest analog to a quorum consensus mechanism in this corpus, though it is not a distributed systems protocol (no Raft/Paxos) — it is LLM-level consensus through iterative critique.

Max Concurrent Agents

Configured by user via --models argument. Minimum 1 opponent model. No hard maximum.

Multi-Model

Yes, and this is the core feature. Claude acts as orchestrator + participant, while external models (GPT, Gemini, Grok, etc.) serve as independent critics.

Model Role Mapping

Role Model
Drafter + Synthesizer + Participant critic Claude (via Claude Code)
Opponent critics (parallel) Any configured provider models

Consensus Mechanism

Quorum (all-agree): the loop terminates only when ALL critic models AND Claude agree. A single dissenting model triggers another revision cycle. No Byzantine tolerance or partial-agreement mode.

Isolation Mechanism

None. All operations happen in the Claude Code session.

Execution Mode

Interactive-loop within a Claude Code session.

Auto-Validators

The debate consensus check is the primary automatic validation — external models serve as automatic reviewers.

TDD Enforcement

None. adversarial-spec is about spec quality, not code quality.

Git Automation

None.

Cross-Tool Portability

Medium. Requires Claude Code for the skill/command surface, but debate.py can be invoked directly for standalone use. The --models argument supports any litellm-compatible provider, making the engine itself highly portable.

08

Ui Cli Surface

adversarial-spec — UI & CLI Surface

CLI Binary

None. adversarial-spec is a Claude Code plugin. No npm/pip package.

Plugin Installation

claude plugin marketplace add zscole/adversarial-spec
claude plugin install adversarial-spec

Slash Command Surface

/adversarial-spec "<product description>"

Optional flags visible in SKILL.md:

# Check configured providers
python3 "$(find ~/.claude -name debate.py -path '*adversarial-spec*' 2>/dev/null | head -1)" providers

# Run critique against specific models
python3 debate.py critique --models gpt-4o,gemini/gemini-2.0-flash < spec.md

# Set reasoning effort for Codex
python3 debate.py critique --models codex/gpt-5.2-codex --codex-reasoning high < spec.md

Local UI

None. Output is in-session markdown text in Claude Code.

AWS Bedrock CLI

python3 debate.py bedrock enable --region us-east-1
python3 debate.py bedrock add-model claude-3-sonnet
python3 debate.py bedrock status
python3 debate.py bedrock disable

Observability

  • python3 debate.py providers — shows configured API keys
  • python3 debate.py bedrock status — shows Bedrock configuration
  • Debate progress is logged to stdout during execution

IDE Integration

Claude Code only.

Related frameworks

same archetype · same primary tool · same memory type

Claude-Flow / Ruflo ★ 55k

Eliminates single-agent context limits and sequential bottlenecks by orchestrating fault-tolerant swarms of specialized AI agents…

Hermes Agent (NousResearch) ★ 168k

Self-improving personal AI agent with closed learning loop, 7 terminal backends, and messaging gateway — not tied to any AI…

OpenCode ★ 165k

Terminal-first AI coding agent with multi-model routing, native desktop app, and a typed .opencode/ configuration system for…

OpenHands ★ 75k

Open-source AI software development platform (open-source Devin alternative) with Docker sandbox isolation, 77.6% SWE-bench…

DeerFlow ★ 70k

Long-horizon superagent that researches, codes, and creates by orchestrating parallel sub-agents with isolated contexts in Docker…

oh-my-openagent (omo) ★ 60k

Multi-provider AI agent orchestration for OpenCode: escape vendor lock-in by routing Sisyphus (Claude/Kimi/GLM) and Hephaestus…