Batch 25 — Skills/Verification Tools & Code-Audit Utilities

Roster (10)

slug	stars	distribution	cli_binary	local_ui	orchestration	multi_model	tier
sightglass	unknown	unknown	unknown	unknown	unknown	unknown	C (repo 404)
heavy3-code-audit	44	skill-pack	no	no	parallel-fan-out	yes (GPT/Gemini/Grok via OpenRouter)	A
ui-ux-pro-max	83,052	claude-plugin	yes (uipro)	no	parallel-fan-out	no	A
aurite-agent-verifier	38	skill-pack	no	no	sequential	no	A
subtask-zippoxer	330	cli-tool (Go)	yes (subtask)	yes (TUI)	hierarchical	yes (lead/worker split)	A
skill-optimizer	57	npm-package	yes (tsx CLI)	no	none	yes (model matrix)	A
setup-structure-index	15	skill-pack	no	no	none	yes (Haiku for YAML gen)	A
unslop	44	claude-plugin	yes (python)	no	none	no	A
nlpm-xiaolai	55	claude-plugin	yes (nlpm-check)	no	hierarchical	yes (haiku/sonnet/opus by task)	A
vibe-check-mcp	486	mcp-server	yes (npx)	no	none	yes (meta-mentor LLM configurable)	A

Intra-Batch Patterns

This batch coheres around a single theme: verification and quality assurance for AI-generated artifacts and AI agent behavior. Seven of the nine analyzable tools operate as quality checks or validators — but they target very different layers: heavy3 and aurite-agent-verifier check code/agent output, nlpm-xiaolai checks NL artifact quality, unslop checks prose style, skill-optimizer checks skill reliability via behavioral evals, vibe-check-mcp checks metacognitive alignment, and setup-structure-index maintains structural index accuracy. Only subtask-zippoxer and ui-ux-pro-max are primarily workflow/design tools.

A striking sub-pattern: multi-model consensus as a quality mechanism appears independently in heavy3 (3 LLMs per review), vibe-check-mcp (second LLM as meta-mentor), and skill-optimizer (model matrix evals). All three arrived at the same architectural insight — a single model cannot reliably validate itself.

The batch also reveals a new distribution pattern: self-referential tools that eat their own dog food. Unslop is written following its own rules. Nlpm carries an nlpm-badge.json. Subtask claims to be built using subtask. This self-referential proof pattern is emerging as a credibility signal in the ecosystem.

Most Interesting Finds

nlpm-xiaolai: The only tool in the entire corpus that operates as a meta-linter for the artifact types other frameworks produce. The manifest-vs-disk consistency check (SKILL.md on disk but missing from plugin.json → invisible after install) is a real bug class no other validator covers, verified against Anthropic's own plugin-validator. The self-evolving GitHub Actions pipeline that audits real repos, harvests exemplars, and PRs back improvements to its own rule catalog is architecturally novel.
subtask-zippoxer: The most technically sophisticated tool in the batch — a Go binary with event-sourced state (history.jsonl as append-only truth), git worktree pool management, bidirectional lead-worker communication via subtask ask/send, a TUI, and self-claimed proof it was built using its own workflow. The "workspace opacity" principle (lead never picks worktrees) and "history wins over SQLite" invariants show careful system design.

Items Written as Tier C

Slug	Reason
sightglass	Repository at https://github.com/sightglass-ai/sightglass returns HTTP 404 — not found, private, or URL incorrect

Cross-References Discovered

vibe-check-mcp cites a companion CPI (Chain-Pattern Interrupt) repo at https://github.com/PV-Bhat/cpi — the two are separate but designed to work together
unslop explicitly positions itself as eating its own dog food (README written following unslop rules)
nlpm-xiaolai is available via both the xiaolai marketplace AND Anthropic's official community marketplace (with ~24h lag) — first tool in the corpus I've seen on both
heavy3-code-audit selects Grok 4 for security analysis based on published benchmark scores (Kilo AI exploit test, WMDP-Cyber, CyBench) — model selection rationale is more rigorous than any other tool in the corpus
ui-ux-pro-max at 83,052 stars stands out sharply from the batch (next highest: 486 for vibe-check-mcp). This star count warrants scrutiny — it may reflect broader nextlevelbuilder community sharing rather than organic tool adoption.