Skip to content
/

Heavy3 Code Audit

heavy3-code-audit · heavy3-ai/code-audit · ★ 44 · last commit 2026-04-26

Primitive shape 2 total
Commands 1 Skills 1
00

Summary

Heavy3 Code Audit — Summary

Heavy3 Code Audit (/h3) is a multi-model consensus code review skill for Claude Code and other AI coding agents. It routes code diffs, plans, and pull requests through a council of three specialized LLMs (GPT 5.5 for correctness, Gemini 3.1 Pro for performance, Grok 4 for security) via OpenRouter, then synthesizes findings into a 3-column comparison table that surfaces where models agree (high confidence) and where they diverge. The tool ships as a Claude Code skill with a Python backend and is 100% free and open source under MIT with BYOK via OpenRouter.

The skill auto-detects what to review: uncommitted changes, a plan file, the last commit, or a numbered PR — no argument required for the common case. Council mode runs three parallel API calls, each with role-specific prompts and different web-search back-ends (Bing for correctness, Exa for security and performance), applying the "Lost in the Middle" positioning strategy from academic research. The trademark Synthesis Table differentiates it from other review tools by showing model-by-model verdict columns side-by-side.

Compared to seeds: closest to spec-kit (pre-implementation validation + code review gates) but differs architecturally — heavy3 outsources the critic role to external LLMs via API rather than using in-context skill prompts, making it the only tool in this corpus that implements genuine multi-model consensus as a code-quality mechanism rather than a multi-agent orchestration pattern.

01

Overview

Heavy3 Code Audit — Overview

Origin

Created by Heavy3.ai (heavy3.ai), a commercial AI research company. The skill was released as free and open source (MIT) as a community tool while the company explores commercial extensions. First public release circa early 2026.

Philosophy

The core thesis is that single-model code review is structurally unreliable: different LLMs fail differently. Research citations in the methodology doc show GPT-5.4 generates 2x more concurrency bugs, Gemini 3.1 Pro generates 4x more control flow mistakes, and even top models classify code correctness only ~68% of the time. The solution is a council of specialized reviewers where each model focuses on a domain it's been selected for (based on published benchmarks) and cross-validates against the others.

"You code with Claude. Our council (GPT + Gemini + Grok) catches what Claude misses."

"Every model has different blind spots. The GPT-5.4 / Gemini 3.1 Pro / Opus 4.5 numbers come from Sonar's Dec 2025 analysis of millions of lines of generated code."

Key Claims

  • 3-model council achieves "blockchain-grade properties of consistency and Byzantine fault tolerance" (citing Hashgraph-Inspired Consensus, 2025)
  • Plan review is "the missing piece" — no major code review tool effectively reviews architectural plans before implementation
  • Context positioning strategy applied from academic research (Liu et al., "Lost in the Middle," TACL 2024): intent first, diff second, supporting material in middle
  • Free tier ($0) uses rotating free models; Single tier ($0.01) uses DeepSeek V4 Pro; Council tier ($0.10) uses GPT 5.5 + Gemini 3.1 Pro + Grok 4

Explicit Antipatterns Named

  • Reviewing only implementation-level code (ignoring plans)
  • Single-model review for high-stakes code
  • Noisy 100K+ token contexts vs. well-selected 8K-32K contexts
02

Architecture

Heavy3 Code Audit — Architecture

Distribution

  • Type: skill-pack (Claude Code skill with Python scripts)
  • License: MIT
  • Install complexity: multi-step (clone + symlink + pip install requests + OPENROUTER_API_KEY)

Install Commands

git clone https://github.com/heavy3-ai/code-audit.git
cd code-audit && mkdir -p ~/.claude/skills && ln -sf "$(pwd)/skill" ~/.claude/skills/h3
pip install requests
echo 'OPENROUTER_API_KEY=your-key-here' > ~/.claude/skills/h3/.env

Directory Layout

skill/
├── SKILL.md              # Skill manifest: YAML frontmatter + Claude instructions
├── config.json           # User config (tier, model, context limits)
└── scripts/
    ├── review.py         # Single-model: OpenRouter API calls, prompt templates
    ├── council.py        # Multi-model: 3-parallel reviews with synthesis
    ├── license.py        # License activation and verification
    └── list-free-models.py  # OpenRouter free model discovery
docs/
├── METHODOLOGY.md        # Research-backed design rationale
├── CONFIGURATION.md
├── INSTALL-WINDOWS.md
└── TROUBLESHOOTING.md

Data Flow

  1. User invokes /h3 [target] [--flags] in Claude Code
  2. Claude reads SKILL.md, gathers context (git diff, files, docs) and compiles to JSON
  3. JSON passed to review.py or council.py via --context-file
  4. Script calls OpenRouter API — single model streams, council runs 3 in parallel
  5. Council synthesizes into 3-column comparison table; Claude proposes fixes for approval

Required Runtime

  • Python 3.x
  • requests pip package
  • OPENROUTER_API_KEY environment variable
  • Claude Code (primary) or compatible agent

Target AI Tools

  • Claude Code (primary)
  • Cursor
  • Codex CLI
  • Gemini CLI
  • Antigravity

Config File

skill/config.json — configures tier (single/council/free), model overrides, max context tokens (200K default), and per-model context limits.

03

Components

Heavy3 Code Audit — Components

Skill

Name File Purpose
h3 skill/SKILL.md Main skill manifest; argument parser + smart detection logic + review orchestration instructions for Claude

Scripts (Python — invoked by Claude during skill execution)

Name File Purpose
review.py skill/scripts/review.py Single-model review: builds context JSON, calls OpenRouter, streams response
council.py skill/scripts/council.py 3-model parallel council: specialized prompts per role, web search, synthesis table
license.py skill/scripts/license.py License activation and verification for commercial tiers
list-free-models.py skill/scripts/list-free-models.py Discovers currently-free models on OpenRouter

Commands (exposed via skill invocation)

Usage What it does
/h3 Smart-detect mode: uncommitted changes → plan → ask
/h3 --council Force 3-model council (GPT 5.5 + Gemini 3.1 Pro + Grok 4)
/h3 --free Use rotating free model
/h3 pr <number> Review specific GitHub PR
/h3 plan.md Review a plan/markdown file
/h3 HEAD~3..HEAD Review a commit range
/h3 --staged Staged changes only
/h3 --commit Last commit only
/h3 --model <name> Model override (shortcuts: glm, gpt, kimi, deepseek, free)

Council Roles

Role Model Search Backend Focus
Correctness Expert GPT 5.5 Bing Bugs, logic errors, edge cases, race conditions
Performance Critic Gemini 3.1 Pro Exa N+1 queries, memory leaks, scaling bottlenecks
Security Analyst Grok 4 Exa Vulnerabilities, auth issues, data exposure

Config

File Purpose
skill/config.json User-editable: tier selection, model overrides, context limits
~/.claude/skills/h3/.env OPENROUTER_API_KEY
05

Prompts

Heavy3 Code Audit — Prompt Excerpts

Excerpt 1: Smart Detection Workflow (from skill/SKILL.md)

Technique: Decision tree + explicit confirmation gates

## Smart Detection

**When `/h3` is invoked without explicit targets, automatically detect intent and confirm with user.**

### Detection Priority

| Priority | Condition | Action |
|----------|-----------|--------|
| 1 | Explicit argument provided | Execute directly, no confirmation |
| 2 | Uncommitted changes exist | Confirm: review changes? |
| 3 | No changes + plan detected | Confirm: review the plan? |
| 4 | No changes + no plan | Ask: review commits or specify target? |

Analysis: Uses a graded decision tree to eliminate ambiguity at invocation. Each priority level has exactly one action and one confirmation template. Prevents the "what should I review?" dead-end by checking git state first.


Excerpt 2: Council Role Differentiation (from docs/METHODOLOGY.md)

Technique: Role-specialized system prompts with distinct model selection rationale

## The Council

Three specialized reviewers, each with web search:

| Role | Model | Focus | Search |
|------|-------|-------|--------|
| **Correctness Expert** | GPT 5.5 | Bugs, logic errors, edge cases, race conditions | Bing |
| **Performance Critic** | Gemini 3.1 Pro | N+1 queries, memory leaks, scaling bottlenecks | Exa |
| **Security Analyst** | Grok 4 | Vulnerabilities, auth issues, data exposure | Exa |

**Why Grok 4 for Security?**
Grok 4 was selected as Security Analyst based on independent security benchmarks:
| Benchmark | Score |
| Kilo AI Exploit Test | 100% detection on advanced exploits |
| WMDP-Cyber | 79-81% accuracy (vulnerability detection, reverse engineering) |

Analysis: Evidence-based model selection — each role is assigned to a model based on published benchmark performance in that domain, not by default. This is "research-anchored role assignment," distinct from generic "use GPT for X" heuristics.


Excerpt 3: Synthesis Table Output Format (from docs/METHODOLOGY.md)

Technique: Structured comparison table as a deterministic output contract

## The Synthesis Table (Trademark Feature)

| Aspect | Correctness (GPT 5.5) | Performance (Gemini 3.1) | Security (Grok 4) |
|--------|----------------------|----------------------|---------------------|
| **Focus** | Bugs, Logic, Edge Cases | Scaling, Memory, N+1 | Vulnerabilities, Auth |
| **Findings** | ❌ Null check missing | ⚠️ Potential N+1 query | ✅ No issues found |
| **Verdict** | REQUEST CHANGES | APPROVE WITH NOTES | APPROVE |

**What you get:**
- **Consensus Issues** - Problems flagged by 2+ reviewers (high confidence)
- **Notable Findings** - Unique insights from each specialist
- **Final Recommendation** - APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES
- **Priority Actions** - Ranked list of fixes

Analysis: Forced-format output contract. The synthesis table is a mandatory output structure that makes cross-model consensus visible at a glance, with explicit classification of "consensus" (2+ models agree) vs. "unique finding" (single-model insight). Defines exactly three verdicts and prevents ambiguous outputs.

09

Uniqueness

Heavy3 Code Audit — Uniqueness & Positioning

Differs From Seeds

Closest seed: spec-kit (pre-implementation validation + code review hooks). But heavy3 diverges architecturally: spec-kit's review runs via Claude's own in-context skills, while heavy3 outsources the critic role to 3 external LLMs via OpenRouter API calls. This makes heavy3 the only tool in the seed set that implements genuine multi-model consensus as a quality mechanism — where "consensus" means intersection of independent LLM outputs with different training sets and verified different failure modes, not just multi-agent parallelism within one model family.

Also distinct from taskmaster-ai (task decomposition, no review) and BMAD-METHOD (persona-based reviews, same model). Heavy3's council is the architectural inverse of BMAD's persona approach: instead of giving one model multiple personas, it routes different problems to different models selected by domain benchmark.

Observable Failure Modes

  1. OpenRouter dependency: council mode requires BYOK + network access. Air-gapped environments cannot use it.
  2. Cost surprise in council mode: ~$0.10 per review at current pricing — manageable for important reviews but high for routine use.
  3. No persistence: previous review findings lost between sessions unless manually referenced.
  4. Single-skill invocation: the skill cannot be chained into a larger workflow automatically; it's always manually triggered.
  5. Python version fragility: requests dependency and Python 3.x assumption; no lockfile enforced.

Distinctive Opinion

Multi-model consensus is not just useful for code review — it is the correct architecture for any high-stakes AI decision where model blind spots are known and models are chosen based on published domain benchmarks rather than default model preference.

Positioning

  • vs. ESLint/Biome/Semgrep: Heavy3 catches agent-specific patterns and semantic issues these tools miss, but these tools catch deterministic syntax/type issues more reliably.
  • vs. Manual review: Heavy3 is always available, consistent, and provides three specialized perspectives simultaneously.
  • vs. Single-model AI review: Heavy3 can catch issues the primary model (Claude) would miss due to its own blind spots.
  • vs. GitHub Copilot review: Heavy3 is model-agnostic, configurable, and runs locally without telemetry.
04

Workflow

Heavy3 Code Audit — Workflow

Review Modes

Single-Model Review

Phase Action Artifact
1. Invocation /h3 [target] None
2. Smart Detection Auto-detect: changes vs plan vs nothing Detection prompt to user
3. Context Gathering Claude collects diff, file contents, test files, docs, conversation context JSON context object
4. API Call review.py --context-file <json> → OpenRouter → DeepSeek V4 Pro Streaming review
5. Synthesis Claude presents findings and proposes fixes Markdown review report
6. Approval Gate User approves or requests changes on proposed fixes User decision

Council Mode

Phase Action Artifact
1. Invocation /h3 --council or --council flag detected None
2. Context Gathering Same as single JSON context object
3. Parallel Review council.py fires 3 simultaneous API calls (GPT 5.5, Gemini 3.1 Pro, Grok 4) 3 model responses
4. Synthesis Table 3-column comparison: Correctness / Performance / Security findings Synthesis Table markdown
5. Consensus Analysis Flags issues found by 2+ models (high confidence) Consensus issues list
6. Final Recommendation APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES + Priority Actions Final verdict
7. Approval Gate User decides which fixes to apply User decision

Approval Gates

  1. Smart detection confirmation: "Review all changes? (y/n)"
  2. Plan review confirmation: "Review this plan? (y/n)"
  3. Proposed fixes: Claude presents fix proposals; user approves before applying

Context Positioning Strategy

Per "Lost in the Middle" (TACL 2024) research — information injected at start and end of context window:

  1. Conversation context (developer intent) — first
  2. PR metadata / problem description — second
  3. Code diff — third
  4. Full file contents — middle
  5. Documentation — middle
  6. Test files — middle
  7. Cross-file dependencies — middle
  8. Review instructions — end (in system prompt)

Large Change Handling

For >50 files or >10K lines of changes:

  1. Detect size via git stats
  2. Break into logical modules
  3. Review each module sequentially with progress tracking
  4. Final cross-module summary pass
06

Memory Context

Heavy3 Code Audit — Memory & Context

State Storage

No persistent memory between sessions. The skill builds a fresh context JSON on each invocation.

Context Object (Ephemeral, Per-Review)

The skill compiles a structured JSON for each API call:

{
  "review_type": "code|plan|pr",
  "conversation_context": {
    "original_request": "...",
    "approach_notes": "...",
    "relevant_exchanges": [...],
    "previous_review_findings": "..."
  },
  "diff": "...",
  "file_contents": {...},
  "test_files": {...},
  "dependent_files": {...},
  "documentation": { "CLAUDE.md": "..." },
  "pr_metadata": { "number": 123, "title": "...", "body": "..." }
}

Context Budget

  • Single mode: 200K tokens (~800K chars) total
  • Council mode: 200K tokens per reviewer (per-model overrides in max_context_by_model)
  • Graceful truncation with [... truncated due to length ...] marker

Positioning Strategy

Applies "Lost in the Middle" research: critical context (developer intent + diff) at the start and end; supporting material (file contents, docs, tests) in the middle. This is enforced programmatically in build_user_message() in review.py and council.py.

Cross-Session Handoff

None. Reviews are one-shot API calls. The conversation_context.previous_review_findings field can carry prior review findings if the user re-invokes within the same Claude session, but there is no persistence to disk.

Config File (Persistent)

skill/config.json — persists tier selection and model preferences between invocations.

07

Orchestration

Heavy3 Code Audit — Orchestration

Multi-Agent Pattern

Pattern: parallel-fan-out (council mode) / none (single mode)

In council mode, council.py fires 3 simultaneous API calls to distinct models via OpenRouter. These are not "agents" in the Claude Code subagent sense — they are external LLM API calls. Claude Code (the lead agent) runs the skill and calls the Python script; the Python script fans out to 3 external models.

Isolation Mechanism

None — the skill edits in-place (no worktrees, no containers). The review process is read-only on the codebase; only the review output (text) is produced.

Multi-Model Routing

User invokes /h3 --council
  → Claude reads SKILL.md
  → Claude compiles context JSON
  → council.py
      ├── GPT 5.5 (Correctness) via OpenRouter + Bing search
      ├── Gemini 3.1 Pro (Performance) via OpenRouter + Exa search
      └── Grok 4 (Security) via OpenRouter + Exa search
  → Claude synthesizes 3-column table

Model selection is determined by role and backed by published benchmark performance per domain. Not configurable at runtime without --model override.

Execution Mode

One-shot — invoked per review; no daemon, no continuous loop.

Consensus Mechanism

Informal quorum: findings flagged by 2+ of 3 models are classified as "Consensus Issues" (high confidence). No formal Raft/Byzantine consensus — this is LLM output intersection by Claude during synthesis.

Subagent Definition Format

None — the external models are not subagents in the Claude Code sense; they are API endpoints accessed by Python scripts.

Crash Recovery

None — transient failures handled by exponential backoff (2s, 4s, 8s) in review.py:38-70. No session state to recover.

Supports BYOK

Yes — requires OPENROUTER_API_KEY. All models accessed via OpenRouter, so any supported model can be substituted.

08

Ui Cli Surface

Heavy3 Code Audit — UI / CLI Surface

CLI Binary

No dedicated CLI binary. The entry point is the /h3 Claude Code skill invocation. Python scripts are internal implementation detail — users never call them directly.

UI / Dashboard

None. Output is rendered as markdown in the Claude Code conversation pane.

IDE Integration

  • Claude Code: primary integration via skill at ~/.claude/skills/h3/
  • Cursor: supported (mentioned in README)
  • Codex CLI: supported
  • Gemini CLI: supported
  • Antigravity: supported

Observability

  • Progress indicators: council mode shows completion status for each of the 3 models as they finish
  • Streaming: single mode streams tokens as they arrive from OpenRouter
  • No audit log or replay capability

Installation Surface

Manual symlink setup. No marketplace integration for the current version. Windows support documented in docs/INSTALL-WINDOWS.md.

Output Format

Markdown in the Claude Code conversation, specifically:

  • Single mode: structured review report with findings + proposed fixes
  • Council mode: 3-column Synthesis Table + Consensus Issues + Final Recommendation (APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES) + Priority Actions ranked list

Related frameworks

same archetype · same primary tool · same memory type

CodeMachine CLI ★ 2.5k

JavaScript-DSL workflow orchestration engine that captures repeatable AI coding agent workflows with tracks, condition groups,…

Codexia ★ 690

Tauri desktop app providing visual control plane, task scheduler, git worktree manager, and headless REST API for Codex CLI +…

Kagan ★ 88

Kanban TUI for AI coding agents with a structurally enforced human review gate (REVIEW → DONE cannot be automated) — one git…

oh-my-claudecode (Yeachan-Heo) ★ 35k

Zero-learning-curve teams-first multi-agent orchestration for Claude Code with autopilot (6-phase lifecycle), ralph (PRD-driven…

Paseo ★ 6.8k

Multi-provider AI coding agent orchestration daemon with cross-device access (phone/desktop/CLI) and git worktree isolation.

CCG Workflow ★ 5.4k

Routes Claude + Codex + Gemini to task-appropriate collaboration strategies (direct-fix through full-collaborate) with hook-based…