Heavy3 Code Audit

heavy3-code-audit · heavy3-ai/code-audit · ★ 44 · last commit 2026-04-26

Primitive shape 2 total

Commands 1 Skills 1

Summary

Heavy3 Code Audit — Summary

Heavy3 Code Audit (/h3) is a multi-model consensus code review skill for Claude Code and other AI coding agents. It routes code diffs, plans, and pull requests through a council of three specialized LLMs (GPT 5.5 for correctness, Gemini 3.1 Pro for performance, Grok 4 for security) via OpenRouter, then synthesizes findings into a 3-column comparison table that surfaces where models agree (high confidence) and where they diverge. The tool ships as a Claude Code skill with a Python backend and is 100% free and open source under MIT with BYOK via OpenRouter.

The skill auto-detects what to review: uncommitted changes, a plan file, the last commit, or a numbered PR — no argument required for the common case. Council mode runs three parallel API calls, each with role-specific prompts and different web-search back-ends (Bing for correctness, Exa for security and performance), applying the "Lost in the Middle" positioning strategy from academic research. The trademark Synthesis Table differentiates it from other review tools by showing model-by-model verdict columns side-by-side.

Compared to seeds: closest to spec-kit (pre-implementation validation + code review gates) but differs architecturally — heavy3 outsources the critic role to external LLMs via API rather than using in-context skill prompts, making it the only tool in this corpus that implements genuine multi-model consensus as a code-quality mechanism rather than a multi-agent orchestration pattern.

Overview

Heavy3 Code Audit — Overview

Origin

Created by Heavy3.ai (heavy3.ai), a commercial AI research company. The skill was released as free and open source (MIT) as a community tool while the company explores commercial extensions. First public release circa early 2026.

Philosophy

The core thesis is that single-model code review is structurally unreliable: different LLMs fail differently. Research citations in the methodology doc show GPT-5.4 generates 2x more concurrency bugs, Gemini 3.1 Pro generates 4x more control flow mistakes, and even top models classify code correctness only ~68% of the time. The solution is a council of specialized reviewers where each model focuses on a domain it's been selected for (based on published benchmarks) and cross-validates against the others.

"You code with Claude. Our council (GPT + Gemini + Grok) catches what Claude misses."

"Every model has different blind spots. The GPT-5.4 / Gemini 3.1 Pro / Opus 4.5 numbers come from Sonar's Dec 2025 analysis of millions of lines of generated code."

Key Claims

3-model council achieves "blockchain-grade properties of consistency and Byzantine fault tolerance" (citing Hashgraph-Inspired Consensus, 2025)
Plan review is "the missing piece" — no major code review tool effectively reviews architectural plans before implementation
Context positioning strategy applied from academic research (Liu et al., "Lost in the Middle," TACL 2024): intent first, diff second, supporting material in middle
Free tier ($0) uses rotating free models; Single tier (~~$0.01) uses DeepSeek V4 Pro; Council tier (~~$0.10) uses GPT 5.5 + Gemini 3.1 Pro + Grok 4

Explicit Antipatterns Named

Reviewing only implementation-level code (ignoring plans)
Single-model review for high-stakes code
Noisy 100K+ token contexts vs. well-selected 8K-32K contexts

Architecture

Heavy3 Code Audit — Architecture

Distribution

Type: skill-pack (Claude Code skill with Python scripts)
License: MIT
Install complexity: multi-step (clone + symlink + pip install requests + OPENROUTER_API_KEY)

Install Commands

git clone https://github.com/heavy3-ai/code-audit.git
cd code-audit && mkdir -p ~/.claude/skills && ln -sf "$(pwd)/skill" ~/.claude/skills/h3
pip install requests
echo 'OPENROUTER_API_KEY=your-key-here' > ~/.claude/skills/h3/.env

Directory Layout

skill/
├── SKILL.md              # Skill manifest: YAML frontmatter + Claude instructions
├── config.json           # User config (tier, model, context limits)
└── scripts/
    ├── review.py         # Single-model: OpenRouter API calls, prompt templates
    ├── council.py        # Multi-model: 3-parallel reviews with synthesis
    ├── license.py        # License activation and verification
    └── list-free-models.py  # OpenRouter free model discovery
docs/
├── METHODOLOGY.md        # Research-backed design rationale
├── CONFIGURATION.md
├── INSTALL-WINDOWS.md
└── TROUBLESHOOTING.md

Data Flow

User invokes /h3 [target] [--flags] in Claude Code
Claude reads SKILL.md, gathers context (git diff, files, docs) and compiles to JSON
JSON passed to review.py or council.py via --context-file
Script calls OpenRouter API — single model streams, council runs 3 in parallel
Council synthesizes into 3-column comparison table; Claude proposes fixes for approval

Required Runtime

Python 3.x
requests pip package
OPENROUTER_API_KEY environment variable
Claude Code (primary) or compatible agent

Target AI Tools

Claude Code (primary)
Cursor
Codex CLI
Gemini CLI
Antigravity

Config File

skill/config.json — configures tier (single/council/free), model overrides, max context tokens (200K default), and per-model context limits.

Components

Heavy3 Code Audit — Components

Skill

Name	File	Purpose
`h3`	`skill/SKILL.md`	Main skill manifest; argument parser + smart detection logic + review orchestration instructions for Claude

Scripts (Python — invoked by Claude during skill execution)

Name	File	Purpose
`review.py`	`skill/scripts/review.py`	Single-model review: builds context JSON, calls OpenRouter, streams response
`council.py`	`skill/scripts/council.py`	3-model parallel council: specialized prompts per role, web search, synthesis table
`license.py`	`skill/scripts/license.py`	License activation and verification for commercial tiers
`list-free-models.py`	`skill/scripts/list-free-models.py`	Discovers currently-free models on OpenRouter

Commands (exposed via skill invocation)

Usage	What it does
`/h3`	Smart-detect mode: uncommitted changes → plan → ask
`/h3 --council`	Force 3-model council (GPT 5.5 + Gemini 3.1 Pro + Grok 4)
`/h3 --free`	Use rotating free model
`/h3 pr <number>`	Review specific GitHub PR
`/h3 plan.md`	Review a plan/markdown file
`/h3 HEAD~3..HEAD`	Review a commit range
`/h3 --staged`	Staged changes only
`/h3 --commit`	Last commit only
`/h3 --model <name>`	Model override (shortcuts: glm, gpt, kimi, deepseek, free)

Council Roles

Role	Model	Search Backend	Focus
Correctness Expert	GPT 5.5	Bing	Bugs, logic errors, edge cases, race conditions
Performance Critic	Gemini 3.1 Pro	Exa	N+1 queries, memory leaks, scaling bottlenecks
Security Analyst	Grok 4	Exa	Vulnerabilities, auth issues, data exposure

Config

File	Purpose
`skill/config.json`	User-editable: tier selection, model overrides, context limits
`~/.claude/skills/h3/.env`	OPENROUTER_API_KEY

Prompts

Heavy3 Code Audit — Prompt Excerpts

Excerpt 1: Smart Detection Workflow (from skill/SKILL.md)

Technique: Decision tree + explicit confirmation gates

## Smart Detection

**When `/h3` is invoked without explicit targets, automatically detect intent and confirm with user.**

### Detection Priority

| Priority | Condition | Action |
|----------|-----------|--------|
| 1 | Explicit argument provided | Execute directly, no confirmation |
| 2 | Uncommitted changes exist | Confirm: review changes? |
| 3 | No changes + plan detected | Confirm: review the plan? |
| 4 | No changes + no plan | Ask: review commits or specify target? |

Analysis: Uses a graded decision tree to eliminate ambiguity at invocation. Each priority level has exactly one action and one confirmation template. Prevents the "what should I review?" dead-end by checking git state first.

Excerpt 2: Council Role Differentiation (from docs/METHODOLOGY.md)

Technique: Role-specialized system prompts with distinct model selection rationale

## The Council

Three specialized reviewers, each with web search:

| Role | Model | Focus | Search |
|------|-------|-------|--------|
| **Correctness Expert** | GPT 5.5 | Bugs, logic errors, edge cases, race conditions | Bing |
| **Performance Critic** | Gemini 3.1 Pro | N+1 queries, memory leaks, scaling bottlenecks | Exa |
| **Security Analyst** | Grok 4 | Vulnerabilities, auth issues, data exposure | Exa |

**Why Grok 4 for Security?**
Grok 4 was selected as Security Analyst based on independent security benchmarks:
| Benchmark | Score |
| Kilo AI Exploit Test | 100% detection on advanced exploits |
| WMDP-Cyber | 79-81% accuracy (vulnerability detection, reverse engineering) |

Analysis: Evidence-based model selection — each role is assigned to a model based on published benchmark performance in that domain, not by default. This is "research-anchored role assignment," distinct from generic "use GPT for X" heuristics.

Excerpt 3: Synthesis Table Output Format (from docs/METHODOLOGY.md)

Technique: Structured comparison table as a deterministic output contract

## The Synthesis Table (Trademark Feature)

| Aspect | Correctness (GPT 5.5) | Performance (Gemini 3.1) | Security (Grok 4) |
|--------|----------------------|----------------------|---------------------|
| **Focus** | Bugs, Logic, Edge Cases | Scaling, Memory, N+1 | Vulnerabilities, Auth |
| **Findings** | ❌ Null check missing | ⚠️ Potential N+1 query | ✅ No issues found |
| **Verdict** | REQUEST CHANGES | APPROVE WITH NOTES | APPROVE |

**What you get:**
- **Consensus Issues** - Problems flagged by 2+ reviewers (high confidence)
- **Notable Findings** - Unique insights from each specialist
- **Final Recommendation** - APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES
- **Priority Actions** - Ranked list of fixes

Analysis: Forced-format output contract. The synthesis table is a mandatory output structure that makes cross-model consensus visible at a glance, with explicit classification of "consensus" (2+ models agree) vs. "unique finding" (single-model insight). Defines exactly three verdicts and prevents ambiguous outputs.

Uniqueness

Heavy3 Code Audit — Uniqueness & Positioning

Differs From Seeds

Closest seed: spec-kit (pre-implementation validation + code review hooks). But heavy3 diverges architecturally: spec-kit's review runs via Claude's own in-context skills, while heavy3 outsources the critic role to 3 external LLMs via OpenRouter API calls. This makes heavy3 the only tool in the seed set that implements genuine multi-model consensus as a quality mechanism — where "consensus" means intersection of independent LLM outputs with different training sets and verified different failure modes, not just multi-agent parallelism within one model family.

Also distinct from taskmaster-ai (task decomposition, no review) and BMAD-METHOD (persona-based reviews, same model). Heavy3's council is the architectural inverse of BMAD's persona approach: instead of giving one model multiple personas, it routes different problems to different models selected by domain benchmark.

Observable Failure Modes

OpenRouter dependency: council mode requires BYOK + network access. Air-gapped environments cannot use it.
Cost surprise in council mode: ~$0.10 per review at current pricing — manageable for important reviews but high for routine use.
No persistence: previous review findings lost between sessions unless manually referenced.
Single-skill invocation: the skill cannot be chained into a larger workflow automatically; it's always manually triggered.
Python version fragility: requests dependency and Python 3.x assumption; no lockfile enforced.

Distinctive Opinion

Multi-model consensus is not just useful for code review — it is the correct architecture for any high-stakes AI decision where model blind spots are known and models are chosen based on published domain benchmarks rather than default model preference.

Positioning

vs. ESLint/Biome/Semgrep: Heavy3 catches agent-specific patterns and semantic issues these tools miss, but these tools catch deterministic syntax/type issues more reliably.
vs. Manual review: Heavy3 is always available, consistent, and provides three specialized perspectives simultaneously.
vs. Single-model AI review: Heavy3 can catch issues the primary model (Claude) would miss due to its own blind spots.
vs. GitHub Copilot review: Heavy3 is model-agnostic, configurable, and runs locally without telemetry.

Workflow

Heavy3 Code Audit — Workflow

Review Modes

Single-Model Review

Phase	Action	Artifact
1. Invocation	`/h3 [target]`	None
2. Smart Detection	Auto-detect: changes vs plan vs nothing	Detection prompt to user
3. Context Gathering	Claude collects diff, file contents, test files, docs, conversation context	JSON context object
4. API Call	`review.py --context-file <json>` → OpenRouter → DeepSeek V4 Pro	Streaming review
5. Synthesis	Claude presents findings and proposes fixes	Markdown review report
6. Approval Gate	User approves or requests changes on proposed fixes	User decision

Council Mode

Phase	Action	Artifact
1. Invocation	`/h3 --council` or `--council` flag detected	None
2. Context Gathering	Same as single	JSON context object
3. Parallel Review	`council.py` fires 3 simultaneous API calls (GPT 5.5, Gemini 3.1 Pro, Grok 4)	3 model responses
4. Synthesis Table	3-column comparison: Correctness / Performance / Security findings	Synthesis Table markdown
5. Consensus Analysis	Flags issues found by 2+ models (high confidence)	Consensus issues list
6. Final Recommendation	APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES + Priority Actions	Final verdict
7. Approval Gate	User decides which fixes to apply	User decision

Approval Gates

Smart detection confirmation: "Review all changes? (y/n)"
Plan review confirmation: "Review this plan? (y/n)"
Proposed fixes: Claude presents fix proposals; user approves before applying

Context Positioning Strategy

Per "Lost in the Middle" (TACL 2024) research — information injected at start and end of context window:

Conversation context (developer intent) — first
PR metadata / problem description — second
Code diff — third
Full file contents — middle
Documentation — middle
Test files — middle
Cross-file dependencies — middle
Review instructions — end (in system prompt)

Large Change Handling

For >50 files or >10K lines of changes:

Detect size via git stats
Break into logical modules
Review each module sequentially with progress tracking
Final cross-module summary pass

Memory Context

Heavy3 Code Audit — Memory & Context

State Storage

No persistent memory between sessions. The skill builds a fresh context JSON on each invocation.

Context Object (Ephemeral, Per-Review)

The skill compiles a structured JSON for each API call:

{
  "review_type": "code|plan|pr",
  "conversation_context": {
    "original_request": "...",
    "approach_notes": "...",
    "relevant_exchanges": [...],
    "previous_review_findings": "..."
  },
  "diff": "...",
  "file_contents": {...},
  "test_files": {...},
  "dependent_files": {...},
  "documentation": { "CLAUDE.md": "..." },
  "pr_metadata": { "number": 123, "title": "...", "body": "..." }
}

Context Budget

Single mode: 200K tokens (~800K chars) total
Council mode: 200K tokens per reviewer (per-model overrides in max_context_by_model)
Graceful truncation with [... truncated due to length ...] marker

Positioning Strategy

Applies "Lost in the Middle" research: critical context (developer intent + diff) at the start and end; supporting material (file contents, docs, tests) in the middle. This is enforced programmatically in build_user_message() in review.py and council.py.

Cross-Session Handoff

None. Reviews are one-shot API calls. The conversation_context.previous_review_findings field can carry prior review findings if the user re-invokes within the same Claude session, but there is no persistence to disk.

Config File (Persistent)

skill/config.json — persists tier selection and model preferences between invocations.

Orchestration

Heavy3 Code Audit — Orchestration

Multi-Agent Pattern

Pattern: parallel-fan-out (council mode) / none (single mode)

In council mode, council.py fires 3 simultaneous API calls to distinct models via OpenRouter. These are not "agents" in the Claude Code subagent sense — they are external LLM API calls. Claude Code (the lead agent) runs the skill and calls the Python script; the Python script fans out to 3 external models.

Isolation Mechanism

None — the skill edits in-place (no worktrees, no containers). The review process is read-only on the codebase; only the review output (text) is produced.

Multi-Model Routing

User invokes /h3 --council
  → Claude reads SKILL.md
  → Claude compiles context JSON
  → council.py
      ├── GPT 5.5 (Correctness) via OpenRouter + Bing search
      ├── Gemini 3.1 Pro (Performance) via OpenRouter + Exa search
      └── Grok 4 (Security) via OpenRouter + Exa search
  → Claude synthesizes 3-column table

Model selection is determined by role and backed by published benchmark performance per domain. Not configurable at runtime without --model override.

Execution Mode

One-shot — invoked per review; no daemon, no continuous loop.

Consensus Mechanism

Informal quorum: findings flagged by 2+ of 3 models are classified as "Consensus Issues" (high confidence). No formal Raft/Byzantine consensus — this is LLM output intersection by Claude during synthesis.

Subagent Definition Format

None — the external models are not subagents in the Claude Code sense; they are API endpoints accessed by Python scripts.

Crash Recovery

None — transient failures handled by exponential backoff (2s, 4s, 8s) in review.py:38-70. No session state to recover.

Supports BYOK

Yes — requires OPENROUTER_API_KEY. All models accessed via OpenRouter, so any supported model can be substituted.

Ui Cli Surface

Heavy3 Code Audit — UI / CLI Surface

CLI Binary

No dedicated CLI binary. The entry point is the /h3 Claude Code skill invocation. Python scripts are internal implementation detail — users never call them directly.

UI / Dashboard

None. Output is rendered as markdown in the Claude Code conversation pane.

IDE Integration

Claude Code: primary integration via skill at ~/.claude/skills/h3/
Cursor: supported (mentioned in README)
Codex CLI: supported
Gemini CLI: supported
Antigravity: supported

Observability

Progress indicators: council mode shows completion status for each of the 3 models as they finish
Streaming: single mode streams tokens as they arrive from OpenRouter
No audit log or replay capability

Installation Surface

Manual symlink setup. No marketplace integration for the current version. Windows support documented in docs/INSTALL-WINDOWS.md.

Output Format

Markdown in the Claude Code conversation, specifically:

Single mode: structured review report with findings + proposed fixes
Council mode: 3-column Synthesis Table + Consensus Issues + Final Recommendation (APPROVE / APPROVE WITH CHANGES / REQUEST CHANGES) + Priority Actions ranked list

Related frameworks

same archetype · same primary tool · same memory type

CodeMachine CLI ★ 2.5k

A16 Cross-vendor router

JavaScript-DSL workflow orchestration engine that captures repeatable AI coding agent workflows with tracks, condition groups,…

Codexia ★ 690

A16 Cross-vendor router

Tauri desktop app providing visual control plane, task scheduler, git worktree manager, and headless REST API for Codex CLI +…

Kagan ★ 88

A16 Cross-vendor router

Kanban TUI for AI coding agents with a structurally enforced human review gate (REVIEW → DONE cannot be automated) — one git…

oh-my-claudecode (Yeachan-Heo) ★ 35k

A16 Cross-vendor router

Zero-learning-curve teams-first multi-agent orchestration for Claude Code with autopilot (6-phase lifecycle), ralph (PRD-driven…

Paseo ★ 6.8k

A16 Cross-vendor router

Multi-provider AI coding agent orchestration daemon with cross-device access (phone/desktop/CLI) and git worktree isolation.

CCG Workflow ★ 5.4k

A16 Cross-vendor router

Routes Claude + Codex + Gemini to task-appropriate collaboration strategies (direct-fix through full-collaborate) with hook-based…

Distribution

Type: skill-pack
License: MIT
Install: multi-step

Surfaces

CLI binary: No
CLI subcmds: 0
Local UI: No

Components

Commands: 1
Skills: 1
Subagents: 0
Hooks: 0
MCP servers: 0
MCP tools: 0
Scripts: 4
Templates: 0

Workflow

Phases: 6
Approval gates: 3
Spec format: none
Spec storage: none
Delta or full: none

Orchestration

Multi-agent: No
Pattern: parallel-fan-out
Max concurrent: 3
Isolation: none
Consensus: quorum
Prompt chaining: No

Multi-model

Multi-model: Yes
BYOK: Yes
Modal: text

Execution

Mode: one-shot
Crash recovery: No
Compaction: No
Session handoff: No
Streaming: Yes

Memory

Type: none
Persistence: none
Search: none

Quality

TDD: No
TDD mechanism: none
Validators: 2
Self-review: adversarial-subagent

Git / Observability

Auto commit: No
Auto PR: No
Auto merge: No
Worktree/feat: No
Audit log: No
Audit format: none
Replay: No

Tools

Primary: claude-code
Targets: 5
Portability: medium

Signals

Stars: 44
Last commit: 2026-04-26
Maintainer: active
Quality score: 1.9/10