xiaolai/nlpm-for-claude

nlpm-xiaolai · xiaolai/nlpm-for-claude · ★ 55 · last commit 2026-05-26

Primitive shape 34 total

Commands 8 Skills 17 Subagents 7 Hooks 1 MCP tools 1

Summary

xiaolai/nlpm-for-claude — Summary

NLPM (Natural-Language Programming Manager) is a comprehensive linting, scoring, and testing system for AI agent "NL artifacts" — the markdown files that drive AI behavior: skills, agents, commands, rules, hooks, prompts, CLAUDE.md, and memory files. It treats natural language artifacts as programs that can be scored (100-point penalty-based scale), linted (50 rules with named penalties), auto-fixed, tested with NL-TDD specs, and security-scanned.

The system ships 8 slash commands, 7 agent definitions, a standalone Python validator (bin/nlpm-check) for pre-commit hooks and CI, and a self-evolving GitHub Actions pipeline that audits real plugin repos, harvests exemplars, and feeds learnings back into the rule catalog. It uniquely catches manifest-vs-disk consistency — the bug where a SKILL.md exists on disk but is missing from plugin.json.

NLPM is multi-tool tier-aware: it applies a universal set of rules (Tier 1) plus tool-specific overlays for Claude Code (Tier 2-Claude), Codex CLI (Tier 2-Codex), and Antigravity (Tier 2-Antigravity). NL-TDD workflow: write .nlpm-test/my-agent.spec.md before the artifact (red), write artifact, verify (green).

Compared to seeds: no direct equivalent. The closest analogy is spec-kit (linting + validation) but for NL artifacts rather than code. Uniquely designed to police the quality of other frameworks' skill/agent/command files.

Overview

xiaolai/nlpm-for-claude — Overview

Origin

Built by xiaolai (Li Xiaolai, prolific Chinese author and developer). Released under ISC. Part of the xiaolai plugin marketplace. Available via Anthropic's official community marketplace (with ~24h lag) and via the xiaolai marketplace directly.

Philosophy

Natural language artifacts (skills, agents, commands, CLAUDE.md) are programs. Just as ESLint scores JavaScript and ruff scores Python, NLPM scores the markdown files that drive AI behavior. The quality of these artifacts directly determines agent reliability and predictability.

"Just as ESLint scores JavaScript and ruff scores Python, NLPM scores the markdown files that drive AI behavior."

"NLPM is the only multi-tool NL artifact validator that systematically checks manifest-vs-disk consistency."

Key Research Finding

The most novel claim: a class of bugs exists where a SKILL.md exists on disk but is silently missing from plugin.json — invisible after claude plugin install. No other validator (including Anthropic's official plugin-validator and Linux Foundation's skills-ref) catches this. NLPM does.

The 50 Rules of Natural Language Programming

Formalized as skills/nlpm/rules/SKILL.md. Examples:

R01: No vague quantifiers without criteria ("appropriate", "relevant", "as needed" → penalty -2 each, cap -20)
R02: Every line must earn its tokens
R03: Positive framing over prohibitions
R04: Description is a trigger, not a summary (3+ specific action phrases)
R05: Under 500 lines (over 500 = context bloat)
R09: <example> blocks are mandatory in agents (minimum 2)
R10: Model must match task complexity (haiku = mechanical, sonnet = reasoning, opus = complex judgment)

Self-Evolving Pipeline

The auditor pipeline (GitHub Actions):

Audits real plugin repos weekly
Repos scoring ≥ 90 produce exemplars under auditor/exemplars/
auditor-cite-exemplars.yml opens human-gated PRs adding real-world examples to rule catalog
Drift detector validates rule IDs against rubric (found 990 mislabeled rule IDs in a 2026-05-13 sweep)

Architecture

xiaolai/nlpm-for-claude — Architecture

Distribution

Type: claude-plugin + standalone-repo (with CLI binary)
License: ISC
Install complexity: one-liner

Install Commands

# Via xiaolai marketplace (latest)
claude plugin marketplace add xiaolai/claude-plugin-marketplace
claude plugin install nlpm@xiaolai --scope project

# Via Anthropic community marketplace
claude plugin marketplace add anthropics/claude-plugins-community
claude plugin install nlpm@claude-community --scope project

# Standalone CLI (no Claude Code required)
curl -fsSL -o /usr/local/bin/nlpm-check \
  https://raw.githubusercontent.com/xiaolai/nlpm/main/bin/nlpm-check
chmod +x /usr/local/bin/nlpm-check

Directory Layout

skills/nlpm/
├── rules/SKILL.md          # 50 rules of NL programming
├── scoring/SKILL.md        # Scoring rubric with penalty tables
├── testing/SKILL.md        # NL-TDD spec format
├── security/SKILL.md       # Security scan skill
├── conventions/SKILL.md    # Universal conventions
├── conventions-claude/     # Claude Code tier-specific conventions
├── conventions-codex/      # Codex tier-specific conventions
├── conventions-antigravity/ # Antigravity tier-specific conventions
├── patterns/               # Pattern library
├── vocabulary/             # Vocabulary drift tracking
├── writing-agents/         # Agent authoring guidance
├── writing-hooks/          # Hook authoring guidance
├── writing-plugins/        # Plugin authoring guidance
├── writing-skills/         # Skill authoring guidance
├── ...

agents/
├── scorer.md               # Scoring agent (Sonnet)
├── checker.md              # Cross-component consistency checker
├── scanner.md              # Artifact scanner
├── security-scanner.md     # Security scan agent
├── tester.md               # NL-TDD test runner
├── vague-scanner.md        # Vague quantifier detector
└── vocab-drift-scanner.md  # Vocabulary drift detector

commands/
├── ls.md, score.md, check.md, fix.md, trend.md, test.md, init.md, security-scan.md, ...

bin/
└── nlpm-check              # Standalone Python 3.11+ validator

hooks/
└── hooks.json              # PostToolUse hook (Write|Edit|MultiEdit)

auditor/
├── exemplars/              # 62 published teaching artifacts
├── scripts/                # Drift detection, rule health, CI scripts
├── audits/                 # Historical audit records
└── findings.jsonl          # Audit findings log

Required Runtime

Claude Code (primary, for slash commands)
Python 3.11+ (for bin/nlpm-check standalone)
No external dependencies for standalone binary

Components

xiaolai/nlpm-for-claude — Components

Commands (8 slash commands)

Command	Purpose
`/nlpm:ls`	Discover and inventory all NL artifacts in a repo
`/nlpm:score`	Score artifact quality (100-point scale with named penalties)
`/nlpm:check`	Cross-component consistency checks (including manifest-vs-disk)
`/nlpm:fix`	Auto-fix fixable issues
`/nlpm:trend`	Track quality score trends over time
`/nlpm:test`	Run NL artifact tests against spec files (NL-TDD)
`/nlpm:init`	Initialize NLPM for a project
`/nlpm:security-scan`	Scan plugins for security risks in executable artifacts

Agents (7 AI agents)

Agent	Model	Purpose
`scorer`	sonnet	Score NL artifacts on 100-point scale, apply penalty tables
`checker`	sonnet	Cross-component consistency (manifest-vs-disk, etc.)
`scanner`	unknown	Artifact inventory and classification
`security-scanner`	unknown	Security risk detection in executable artifacts
`tester`	unknown	NL-TDD test runner
`vague-scanner`	unknown	Detect vague quantifiers (R01 violations)
`vocab-drift-scanner`	unknown	Detect vocabulary drift (R51)

Hooks

Event	Matcher	Purpose
`PostToolUse`	`Write\|Edit\|MultiEdit`	Advise when an NL artifact is written/edited; remind to run `/nlpm:score`

Standalone CLI

Name	Language	Purpose
`bin/nlpm-check`	Python 3.11+ (single file, no deps)	Pre-commit hook or CI validator; runs deterministic subset including manifest-vs-disk check

Templates

File	Purpose
`templates/pre-commit-nlpm.sh`	Drop-in git pre-commit hook
`templates/workflows/nlpm-check.yml`	Drop-in GitHub Actions workflow

Scoring System

Starts at 100, penalties subtracted
Score 90-100: Excellent (production-ready)
Score 80-89: Good (minor gaps)
Score 70-79: Adequate (meets threshold)
Score 60-69: Weak (below threshold)
Score <60: Rewrite (fundamental problems)
Default pass threshold: 70 (configurable in .claude/nlpm.local.md)

Prompts

xiaolai/nlpm-for-claude — Prompt Excerpts

Excerpt 1: The 50 Rules — Universal Examples (from skills/nlpm/rules/SKILL.md)

Technique: Bad/Good paired examples for every rule

**R01. No vague quantifiers without criteria.** "appropriate", "relevant", "as needed", 
"sufficient", "adequate", "reasonable", "properly", "correctly", "some", "several", "various" 
are meaningless without specifics. Replace with measurable criteria. Penalty: -2 each, cap -20.

Bad: "Use appropriate error handling."
Good: "Return `Result<T, AppError>` from all API handlers. Map errors to HTTP status codes 
via the `From<AppError> for StatusCode` impl."

**R04. Description is a trigger, not a summary.** 3+ specific action phrases matching real 
user queries. "Use when debugging React re-renders, fixing hook dependency arrays, optimizing 
with useMemo" — not "Helpful React skill."

**R10. Model must match task complexity.** haiku = mechanical (parsing, counting). sonnet = 
reasoning (analysis, review). opus = complex judgment (orchestration). Wrong tier wastes 
money or produces weak results.

Analysis: Each rule has a named penalty, a concrete bad example, and a concrete good example. R10 is particularly notable — it encodes a cost-correctness tradeoff as a rule, preventing both over-spending (using opus for parsing) and under-spending (using haiku for complex orchestration).

Excerpt 2: Scorer Agent Instructions (from agents/scorer.md)

Technique: 5-step verification gate before reporting findings

## Do Not Invent Findings

Apply ONLY penalties enumerated in `nlpm:scoring`. Do not invent penalty categories. Before 
reporting any finding, run this 5-step check:

1. **Rubric check** — Does the penalty appear in the `nlpm:scoring` penalty tables for this 
   artifact type? If no, do not report (unless marked...

Analysis: The "5-step check before reporting" is an anti-hallucination gate specifically for the scoring agent. It prevents the scorer from inventing rules or over-applying penalties. This is a meta-level quality gate on the quality checker itself.

Excerpt 3: Manifest-vs-Disk Check Rationale (from README)

Technique: Gap analysis embedded as motivation for the tool's existence

NLPM is the only multi-tool NL artifact validator that systematically checks 
**manifest-vs-disk consistency** — the bug class where a SKILL.md exists on disk but is 
silently missing from `plugin.json` (and therefore invisible after `claude plugin install`). 
Verified across 8+ tools including Anthropic's official `plugin-validator` and the Linux 
Foundation's `skills-ref`. See `analysis/ecosystem-gap.md` for the research.

Analysis: The tool's primary unique value proposition is stated as a named bug class (manifest-vs-disk inconsistency) with a specific mechanism (silently invisible after install), validated against 8 competing tools. The analysis/ecosystem-gap.md file documents the evidence — this is a research-backed positioning claim, not marketing.

Uniqueness

xiaolai/nlpm-for-claude — Uniqueness & Positioning

Differs From Seeds

No direct equivalent in the 11 seeds. NLPM occupies a meta-layer: it validates the quality of the artifact types that other frameworks in the corpus produce (skills, agents, commands, hooks, plugin manifests). Conceptually closest to spec-kit (linting + validation) but spec-kit validates code against specs; NLPM validates NL artifacts against a formal rubric of 50 rules.

The manifest-vs-disk consistency check is uniquely positioned as an ecosystem gap — no other tool in the corpus or in the broader ecosystem (including Anthropic's official plugin-validator) catches this class of bug. This makes NLPM the only "meta-linter" for AI agent plugins.

Observable Failure Modes

Score determinism depends on rule adherence: The scorer agent is instructed not to invent findings, but heuristic violations (like "vague quantifiers") involve judgment.
R51 vocabulary drift is opt-in: The most advanced check (vocabulary registry) requires explicit config — may be missed by casual users.
Self-evolving pipeline complexity: The GitHub Actions auditor pipeline is sophisticated — reproducing it in a fork requires understanding the full pipeline.
Marketplace lag: Anthropic community marketplace may be ~24h behind xiaolai marketplace — stale version risk.
ISC license: Minor licensing difference from MIT; may matter for some enterprise contexts.

Distinctive Opinion

Natural language artifacts are programs. They can be linted, scored, tested, and have quality metrics. The quality of skills/agents/commands directly determines agent reliability — and poor NL artifact quality is a systematic, measurable problem that should be treated like code quality.

Self-Referential Feature

NLPM carries an nlpm-badge.json in its own repo ([![Validated by NLPM]...]) — the tool runs on itself and displays its own quality score. This is eating its own dog food at the meta level.

Workflow

xiaolai/nlpm-for-claude — Workflow

NL-TDD Workflow

Step	Action	State
1	Write spec: `.nlpm-test/my-agent.spec.md`	RED — artifact doesn't exist
2	`/nlpm:test`	Fails (artifact missing)
3	Write artifact: `agents/my-agent.md`	Artifact created
4	`/nlpm:test`	Check trigger accuracy, output format, score
5	`/nlpm:score`	Verify quality score ≥ threshold
6	Iterate	Fix until GREEN

Standard Validation Workflow

/nlpm:ls        → see all NL artifacts
/nlpm:score     → score them all (or specific path)
/nlpm:check     → cross-component consistency
/nlpm:fix       → auto-fix fixable issues
/nlpm:trend     → track score history

CI/Pre-Commit Workflow (Standalone)

nlpm-check .    # exit 1 on high-confidence findings

No Claude Code dependency. Runs in pre-commit hooks or GitHub Actions.

Self-Evolving Auditor Pipeline

Audit (weekly GitHub Actions): audits real plugin repos in the ecosystem
Score (≥90): repos passing at 90+ produce exemplar teaching artifacts
Cite (auditor-cite-exemplars.yml): opens human-gated PRs adding real-world examples to rules
Drift detection (validate-rule-ids.py): re-validates rule_id in historical audits against rubric and semantic keyword match
Rule health (rule-health.py): reports validated_hits and exemplars_count per rule

Approval Gates

/nlpm:fix produces auto-fixes; human reviews before applying
Auditor citation PRs require human approval before merging
NL-TDD spec files require passing tests before artifact is considered complete

Memory Context

xiaolai/nlpm-for-claude — Memory & Context

State Storage

File-based, project-scoped. Score trends and audit history written to disk.

Score Trend Storage

/nlpm:trend tracks score history over time. Stored in project-level files (exact path unknown from public sources — likely .claude/nlpm/ or similar).

Audit History

auditor/audits/ — historical audit records per repo. auditor/findings.jsonl — per-finding log.

Exemplar Library

auditor/exemplars/ — 62 published teaching artifacts as of v0.8.17+. Used to add real-world positive references to rule documentation.

Config File

.claude/nlpm.local.md — project-level NLPM configuration:

Default pass threshold (default 70)
Rule overrides (suppress, max_penalty, threshold adjustments)
R51 vocabulary drift opt-in (rule_overrides.R51.enabled: true)

Vocabulary Registry

skills/nlpm/vocabulary/registry.yaml — vocabulary drift rules registry (optional, only loaded if R51 is enabled).

Cross-Session Handoff

Score trends persist in project files. Rules and scoring rubric are loaded from installed skill files.

Orchestration

xiaolai/nlpm-for-claude — Orchestration

Multi-Agent Pattern

Pattern: hierarchical — slash commands dispatch to named agents (scorer, checker, tester, etc.) which load specific sub-skills. The scorer agent loads nlpm:scoring, nlpm:conventions, nlpm:conventions-claude, etc.

Agent Coordination

Each command invokes a specific agent:

/nlpm:score → scorer agent (loads scoring + conventions skills)
/nlpm:check → checker agent
/nlpm:test → tester agent
/nlpm:security-scan → security-scanner agent

Hook-Driven Advisory

PostToolUse hook on Write|Edit|MultiEdit — advises when NL artifacts are modified. Fail-open (exit 0 on error). Advisory only, not blocking.

Self-Evolving Pipeline (GitHub Actions)

GitHub Actions workflows run on schedule:

auditor-cite-exemplars.yml — weekly exemplar citation runs
validate-rule-ids.py — drift detection on each new audit
rule-health.py — rule coverage reporting

Multi-Model

Agents specify model in frontmatter: scorer uses sonnet. Others use agent-level model selection per R10 rule (haiku for mechanical, sonnet for reasoning, opus for complex judgment).

Execution Mode

One-shot (slash commands) + event-driven (PostToolUse hook) + scheduled (GitHub Actions auditor pipeline).

Cross-Tool Support

Three-tier artifact classification:

Tier 1 Universal: SKILL.md, AGENTS.md
Tier 2-Claude: commands, agents, skills, hooks, manifests, CLAUDE.md
Tier 2-Codex: Codex-specific paths
Tier 2-Antigravity: Gemini/Antigravity paths

Ui Cli Surface

xiaolai/nlpm-for-claude — UI / CLI Surface

CLI Binary (Standalone)

Yes — nlpm-check (Python 3.11+ single file, no dependencies).

Not a thin wrapper: own deterministic validation engine
Usage: nlpm-check . [exit 1 on high-confidence findings]
Drop-in CI: templates/workflows/nlpm-check.yml for GitHub Actions
Drop-in pre-commit: templates/pre-commit-nlpm.sh

Slash Commands (Claude Code)

8 commands: /nlpm:ls, /nlpm:score, /nlpm:check, /nlpm:fix, /nlpm:trend, /nlpm:test, /nlpm:init, /nlpm:security-scan

Score subcommands:

/nlpm:score agents/ — score just agents directory
/nlpm:score --changed — score only git-changed files

IDE Integration

Claude Code: plugin marketplace (primary)
Also available in Codex, Antigravity with tier-aware rule overlays

Observability

Score trends tracked in project files
auditor/exemplars/ — 62 public teaching artifacts
auditor/findings.jsonl — persistent audit log
GitHub Actions CI pipeline visible on the repo

Dependency

Standalone bin/nlpm-check has zero dependencies (Python 3.11+ stdlib only)
Full Claude Code plugin requires Claude Code installed

Related frameworks

same archetype · same primary tool · same memory type

alirezarezvani/claude-skills ★ 16k

A18 Self-evolving

313+ skills for 12 AI tools covering engineering, marketing, C-level advisory, compliance, research, and finance — all from one…

MoAI-ADK ★ 1.0k

A18 Self-evolving

Implements Harness Engineering as a Go-binary-installed Claude Code environment with auto-TDD/DDD methodology selection, 20-event…

REAP (c-d-cc/reap) ★ 41

A18 Self-evolving

Prevent context loss, scattered development, and forgotten lessons through a generation-based lifecycle where AI and human…

Codex Harness MCP ★ 7

A18 Self-evolving

Gives MCP-capable coding agents a local contract-lifecycle harness with governance audits and explicit completion gates.

meta-agent-teams (jbrahy) ★ 2

A18 Self-evolving

Build self-improving AI agent teams via a supervised training loop: specialist agents advise, a meta-agent evolves prompts based…

Browser Harness ★ 14k

A18 Self-evolving

Thin, self-healing CDP harness connecting an LLM to the user's real Chrome browser with coordinate-first clicking and…

Distribution

Type: claude-plugin
License: ISC
Install: one-liner
Version: 0.8.18+

Surfaces

CLI binary: nlpm-check
CLI subcmds: 1
Local UI: No

Components

Commands: 8
Skills: 17
Subagents: 7
Hooks: 1
MCP servers: 1
Scripts: 3
Templates: 2

Workflow

Phases: 7
Approval gates: 2
Spec format: markdown
Spec storage: per-feature-folder
Delta or full: none

Orchestration

Multi-agent: Yes
Pattern: hierarchical
Max concurrent: 1
Isolation: none
Consensus: none
Prompt chaining: Yes

Multi-model

Multi-model: Yes
BYOK: No
Modal: text

Execution

Mode: one-shot
Crash recovery: No
Compaction: No
Session handoff: Yes
Streaming: No

Memory

Type: file-based
Persistence: project
Search: none
State files: 3 files

Quality

TDD: Optional
TDD mechanism: pre-impl-test-write
Validators: 2
Self-review: adversarial-subagent

Git / Observability

Auto commit: No
Auto PR: Yes
Auto merge: No
Worktree/feat: No
Audit log: Yes
Audit format: jsonl
Replay: No

Tools

Primary: claude-code
Targets: 3
Portability: medium

Signals

Stars: 55
Last commit: 2026-05-26
Maintainer: active
Quality score: 4.1/10