Skip to content
/
Phase D Batch 25

Batch 25 — Skills/Verification Tools & Code-Audit Utilities

Batch 25 — Skills/Verification Tools & Code-Audit Utilities

Roster (10)

slug stars distribution cli_binary local_ui orchestration multi_model tier
sightglass unknown unknown unknown unknown unknown unknown C (repo 404)
heavy3-code-audit 44 skill-pack no no parallel-fan-out yes (GPT/Gemini/Grok via OpenRouter) A
ui-ux-pro-max 83,052 claude-plugin yes (uipro) no parallel-fan-out no A
aurite-agent-verifier 38 skill-pack no no sequential no A
subtask-zippoxer 330 cli-tool (Go) yes (subtask) yes (TUI) hierarchical yes (lead/worker split) A
skill-optimizer 57 npm-package yes (tsx CLI) no none yes (model matrix) A
setup-structure-index 15 skill-pack no no none yes (Haiku for YAML gen) A
unslop 44 claude-plugin yes (python) no none no A
nlpm-xiaolai 55 claude-plugin yes (nlpm-check) no hierarchical yes (haiku/sonnet/opus by task) A
vibe-check-mcp 486 mcp-server yes (npx) no none yes (meta-mentor LLM configurable) A

Intra-Batch Patterns

This batch coheres around a single theme: verification and quality assurance for AI-generated artifacts and AI agent behavior. Seven of the nine analyzable tools operate as quality checks or validators — but they target very different layers: heavy3 and aurite-agent-verifier check code/agent output, nlpm-xiaolai checks NL artifact quality, unslop checks prose style, skill-optimizer checks skill reliability via behavioral evals, vibe-check-mcp checks metacognitive alignment, and setup-structure-index maintains structural index accuracy. Only subtask-zippoxer and ui-ux-pro-max are primarily workflow/design tools.

A striking sub-pattern: multi-model consensus as a quality mechanism appears independently in heavy3 (3 LLMs per review), vibe-check-mcp (second LLM as meta-mentor), and skill-optimizer (model matrix evals). All three arrived at the same architectural insight — a single model cannot reliably validate itself.

The batch also reveals a new distribution pattern: self-referential tools that eat their own dog food. Unslop is written following its own rules. Nlpm carries an nlpm-badge.json. Subtask claims to be built using subtask. This self-referential proof pattern is emerging as a credibility signal in the ecosystem.

Most Interesting Finds

  1. nlpm-xiaolai: The only tool in the entire corpus that operates as a meta-linter for the artifact types other frameworks produce. The manifest-vs-disk consistency check (SKILL.md on disk but missing from plugin.json → invisible after install) is a real bug class no other validator covers, verified against Anthropic's own plugin-validator. The self-evolving GitHub Actions pipeline that audits real repos, harvests exemplars, and PRs back improvements to its own rule catalog is architecturally novel.

  2. subtask-zippoxer: The most technically sophisticated tool in the batch — a Go binary with event-sourced state (history.jsonl as append-only truth), git worktree pool management, bidirectional lead-worker communication via subtask ask/send, a TUI, and self-claimed proof it was built using its own workflow. The "workspace opacity" principle (lead never picks worktrees) and "history wins over SQLite" invariants show careful system design.

Items Written as Tier C

Slug Reason
sightglass Repository at https://github.com/sightglass-ai/sightglass returns HTTP 404 — not found, private, or URL incorrect

Cross-References Discovered

  • vibe-check-mcp cites a companion CPI (Chain-Pattern Interrupt) repo at https://github.com/PV-Bhat/cpi — the two are separate but designed to work together
  • unslop explicitly positions itself as eating its own dog food (README written following unslop rules)
  • nlpm-xiaolai is available via both the xiaolai marketplace AND Anthropic's official community marketplace (with ~24h lag) — first tool in the corpus I've seen on both
  • heavy3-code-audit selects Grok 4 for security analysis based on published benchmark scores (Kilo AI exploit test, WMDP-Cyber, CyBench) — model selection rationale is more rigorous than any other tool in the corpus
  • ui-ux-pro-max at 83,052 stars stands out sharply from the batch (next highest: 486 for vibe-check-mcp). This star count warrants scrutiny — it may reflect broader nextlevelbuilder community sharing rather than organic tool adoption.