skill-optimizer

skill-optimizer · fastxyz/skill-optimizer · ★ 57 · last commit 2026-05-26

Primitive shape 1 total

Skills 1

Summary

skill-optimizer — Summary

skill-optimizer is a Docker-based eval workbench and agent skill for running deterministic evaluations against agent skills across multiple LLMs via OpenRouter. It ships a TypeScript CLI and a canonical skill (SKILL.md) that can be installed into Claude Code, Codex, Cursor, OpenCode, and Gemini CLI. The workbench runs each eval in an isolated Docker container with a deterministic grader — the agent cannot see the grader, hidden answers, or eval metadata during the task phase.

A case is one user-like task plus one or more deterministic graders; a suite is a matrix of cases × OpenRouter models. Results include trace.jsonl (what the agent saw, said, and did), result.json, and suite-result.json. The skill teaches the agent how to author and debug eval suites for its own skills.

Compared to seeds: no direct equivalent in the 11 seeds. Closest to spec-kit (structured validation) but inverted — spec-kit validates code against specs; skill-optimizer validates skills against eval cases. The Docker isolation makes this unique in the corpus: it's the only tool that runs agent skills in hermetic containers to produce deterministic pass/fail results across a model matrix.

Overview

skill-optimizer — Overview

Origin

Built by fastxyz (fastxyz GitHub org). Released under MIT. Supports Claude Code, OpenCode, Codex, Cursor, and Gemini CLI plugin formats. Default branch is development.

Philosophy

Skills for AI agents are like software: they need tests to verify they work reliably. The workbench provides a framework for writing "eval suites" — automated test cases that verify whether an agent skill produces correct outputs across different models. The goal is deterministic, reproducible evaluation rather than subjective judgment.

"The workbench gives an agent a skill/reference folder, an isolated /work directory, and deterministic graders."

Key insight: the agent phase is hermetically isolated — it cannot see the grader, hidden answers, or eval metadata. This prevents the eval from measuring whether the model knows the answer format rather than whether the skill works.

Design Principles

Prefer real CLI/API/service over mocks — mock only when you're sure the mock matches the real command surface
Deterministic graders — graders run after the agent with full context; agent runs blind
Local secrets stay local — trace.jsonl, result.json, and preserved workspace dirs are potentially sensitive if the agent prints secret values
MCP support — cases can define MCP servers; local server source can be hidden from the agent via separate service containers

Explicit Antipatterns

Mock when you don't know the real CLI's behavior well enough (measures the mock, not the skill)
One broad grader per case (prefer multiple small deterministic graders)
Exposing grader logic, hidden answers, or /case to the agent during eval phase
Hardcoded model refs (only openrouter/... model refs supported)

Architecture

skill-optimizer — Architecture

Distribution

Type: npm-package (TypeScript CLI) + skill-pack
License: MIT
Install complexity: multi-step (npm install + Docker + OPENROUTER_API_KEY)

Install Commands

# Claude Code plugin
/plugin marketplace add fastxyz/skill-optimizer
/plugin install skill-optimizer@skill-optimizer

# Cursor
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a cursor -y

# Skill-only install (any agent)
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a claude-code -a opencode -a codex -a cursor -y

# Local CLI
npm install && npm run build

Directory Layout

skills/skill-optimizer/
└── SKILL.md              # Canonical skill: authoring + debugging eval suites

src/
└── cli.ts                # TypeScript CLI (run-case, run-suite)

examples/
└── workbench/pdf/        # Example suite

tests/

docker/                   # Docker workbench image

.claude-plugin/           # Claude Code marketplace plugin
.codex-plugin/            # Codex plugin
.cursor-plugin/           # Cursor plugin
.opencode/                # OpenCode plugin
gemini-extension.json     # Gemini extension
AGENTS.md, CLAUDE.md, GEMINI.md  # Agent guidance files

Container Architecture

Docker container: /work    ← agent-visible workspace
Docker container: /case    ← case definition (agent cannot see)
Docker container: /results ← grader output (agent cannot see)
Docker mounts: references → /work (skill files visible to agent)
Optional: MCP service containers (agent sees only HTTP URL)

Required Runtime

Node.js 20+
Docker
OPENROUTER_API_KEY

Target AI Tools

Claude Code
OpenCode
Codex
Cursor
Gemini CLI

Components

skill-optimizer — Components

Skills

Name	File	Purpose
`skill-optimizer`	`skills/skill-optimizer/SKILL.md`	Canonical skill: source of truth for authoring eval suites, running CLI, schema, patterns

CLI Commands

Command	Purpose
`npx tsx src/cli.ts run-case <case.yml>`	Run one eval case against a model
`npx tsx src/cli.ts run-case <case.yml> --models m1,m2`	Run one case across multiple models
`npx tsx src/cli.ts run-suite <suite.yml>`	Run a suite (cases × models matrix)
`npx tsx src/cli.ts run-suite <suite.yml> --trials N`	Run suite with N trials per combination
`npx tsx src/cli.ts --help`	CLI help
`npx tsx src/cli.ts run-case --help`	run-case help
`npx tsx src/cli.ts run-suite --help`	run-suite help

Case Definition Format (YAML)

name: extract-pdf-facts
task: |
  Read statement.pdf and write answer.json with the account, quarter,
  approval code, and risk flags.
graders:
  - name: answer-json
    command: node $CASE/checks/extract-pdf-facts.mjs

Suite Definition Format (YAML)

name: pdf-skill-eval
references: ./references   # Files copied into /work for agent
models:
  - openrouter/google/gemini-2.5-flash
env:
  - OPENROUTER_API_KEY
timeoutSeconds: 600
setup:
  - node $CASE/checks/create-inputs.mjs
appendSystemPrompt: |
  Keep task outputs at the top level of /work unless the user asks otherwise.
cases:
  - name: ...

Output Artifacts

Artifact	Purpose
`result.json`	Per-case pass/fail result with grader evidence
`suite-result.json`	Matrix result across all cases × models
`trace.jsonl`	Full agent trace (what it saw, said, and did)
`summary.json`	Suite summary
`workspace/`	Preserved agent workspace (optional, potentially sensitive)

Prompts

skill-optimizer — Prompt Excerpts

Excerpt 1: Core Model Definition (from skills/skill-optimizer/SKILL.md)

Technique: Hermetic isolation contract expressed as an invariant

## Core Model

- A case is one user-like task plus one or more deterministic graders.
- A suite is a set of cases and OpenRouter models to run as a matrix.
- `references` are copied into `/work` before the agent starts; this is where eval skills live.
- The agent phase sees `/work` only. It cannot see `/case`, `/results`, graders, hidden answers, or hidden metadata.
- Cases can define `mcpServers`; these are exposed through a workbench `mcp` command during the agent phase.
- Graders run after the agent with `/case`, `/work`, and `/results` mounted.
- `trace.jsonl` is the debugging source for what the agent saw, said, and did.

Analysis: The isolation contract is stated as a list of invariants, not guidelines. "It cannot see" is absolute. This prevents the most common eval contamination: the model seeing grader logic or expected outputs before performing the task.

Excerpt 2: Mock vs. Real Service Decision Rule (from skills/skill-optimizer/SKILL.md)

Technique: Conditional rule with explicit when-to-mock criteria

Prefer the real CLI/API/service when you do not know its internal behavior well enough to mock it faithfully. Mock only when you are sure the mock matches the real command surface, validation, outputs, and failure modes; otherwise the eval will measure the mock, not the skill.

Analysis: This is a precise epistemological rule: mock only when you have sufficient knowledge of the real system to replicate its observable behavior completely. The anti-pattern ("measuring the mock, not the skill") is named explicitly, making the failure mode visible.

Excerpt 3: Case Authoring Rules (from skills/skill-optimizer/SKILL.md)

Technique: Exclusion rules for task text to prevent eval leakage

Write natural user tasks. Do not mention graders, hidden answers, `/case`, or eval internals.

And the recommended case coverage pattern:

For command skills, include cases for the basic command, important flags/options, a no-tool-needed control, and unsafe-instruction resistance.

Analysis: The "no-tool-needed control" case is a calibration check — it verifies the model doesn't unnecessarily invoke tools when none are needed. "Unsafe-instruction resistance" tests whether the skill correctly refuses adversarial prompts. Both are systematic coverage requirements that prevent superficial evals that only test the happy path.

Uniqueness

skill-optimizer — Uniqueness & Positioning

Differs From Seeds

No direct equivalent in the 11 seeds. The closest conceptual analog is spec-kit's validation phase, but skill-optimizer is inverted: spec-kit validates code against specs during development; skill-optimizer validates agent skills against deterministic eval cases in isolated Docker containers. The key differentiator is Docker-based hermetic isolation — the agent runs blind, graders run with full visibility, and traces capture everything for debugging.

Also distinct from heavy3-code-audit (which runs LLM-based review) and aurite-agent-verifier (which applies static rules to code). skill-optimizer runs behavioral evals: does this skill, given this task, produce the correct output? That is a functional test, not a static analysis.

Observable Failure Modes

Docker dependency: Requires Docker running locally — higher friction than pure-npm tools.
OPENROUTER_API_KEY required: No free tier; all model runs cost money.
No GUI for results: Users must read suite-result.json manually.
Suite setup complexity: Getting references/, checks/, bin/ directories right requires understanding the isolation model.
Default branch is development: Install commands from README may not match latest if branch diverges.

Distinctive Opinion

Skills for AI agents are programs and should be tested like programs: with deterministic test cases, isolated environments, and reproducible results across models. The "eval workbench" pattern — separate graders from agent, run blind, grade after — is borrowed from competitive programming judges and should be the standard for AI skill quality assurance.

Target User

Primarily skill authors who want to verify their skills work reliably across models and across skill updates — not end users of those skills.

Workflow

skill-optimizer — Workflow

Authoring Workflow (From SKILL.md)

Step	Action	Artifact
1	Create `suite.yml` with models, defaults, and case paths	`suite.yml`
2	Put skill/reference material under `references/`	`references/`
3	Write natural user tasks (no mention of graders/answers)	Case task text
4	Put setup helpers and graders under `checks/`; fake CLIs under `bin/`	`checks/` + `bin/`
5	Add one or more deterministic graders per case	Grader scripts
6	Run `run-suite --trials N` and inspect results	`suite-result.json`, `trace.jsonl`

Execution Phases (Per Case)

Phase	What Happens	Artifact
Setup	`setup` commands run; inputs created	Populated `/work`
References copy	`references/` files copied into `/work`	`/work` with skill files
Agent phase	Model receives task; runs in `/work` only	Agent trace
Grade phase	Grader runs with `/case`, `/work`, `/results` mounted	`result.json`
Cleanup	Optional workspace preservation	`workspace/`

Isolation Guarantee

During the agent phase:

Agent sees only /work
Agent cannot see /case (case definition, hidden answers)
Agent cannot see /results (grader outputs)
Agent cannot see grader source

MCP in Evals

Cases can define mcpServers. For local servers whose source should stay hidden, put server files under the case mcp/ support directory — Docker starts them as separate service containers; agent only sees the HTTP MCP URL.

Approval Gates

None — fully automated pipeline. Results are written to disk; human reviews suite-result.json.

Memory Context

skill-optimizer — Memory & Context

State Storage

File-based, per-eval-run. No persistent cross-session memory.

Artifacts Per Eval Run

trace.jsonl — full agent trace (what it saw, said, and did). Primary debugging tool.
result.json — per-case pass/fail with grader evidence
suite-result.json — matrix result
summary.json — aggregate statistics
workspace/ — optional preserved agent workspace

Context Per Agent Run

The agent sees:

/work directory (references + setup outputs)
The task text
System prompt (plus appendSystemPrompt from suite.yml)
MCP server URLs (if configured)

The agent cannot see:

Grader source
/case directory
Expected outputs
OPENROUTER_API_KEY or other secrets (unless explicitly passed via env:)

Trace Format

trace.jsonl is a structured newline-delimited JSON log of the agent's entire execution — tool calls, model outputs, reasoning traces. Used for debugging why a grader failed.

Sensitive Data Warning

trace.jsonl, result.json, preserved workspace/ dirs are potentially sensitive if the agent or grader prints/writes secret values. Users are warned to treat these artifacts as sensitive.

Orchestration

skill-optimizer — Orchestration

Multi-Agent Pattern

Pattern: none (single agent per eval case). The workbench spawns one agent instance per case and records its trace. Multiple cases in a suite run sequentially or in parallel depending on CLI configuration.

Isolation Mechanism

Container (Docker) — each eval case runs in an isolated Docker container. Agent workspace is /work only; grader runs after with full context but agent-inaccessible paths.

Multi-Model Routing

Yes — the suite defines a models: list. The CLI runs each case against each model in the matrix. Each model run is an independent Docker container.

Execution Mode

One-shot per case — the CLI is invoked to run a suite, which fans out to N cases × M models = N×M container runs.

Cross-Model Comparison

suite-result.json contains per-model pass rates across all cases, enabling model capability comparison for specific skill tasks.

MCP Service Containers

Cases can start local MCP servers as separate Docker service containers. Agent sees only the HTTP URL — server source is hidden.

Subagent Definition Format

None — there are no subagents. The agent under test is a single LLM run in a container.

Ui Cli Surface

skill-optimizer — UI / CLI Surface

CLI Binary

Yes — npx tsx src/cli.ts (TypeScript CLI, not installed globally by default but can be).

Subcommands: run-case, run-suite, --help
Not a thin wrapper: own eval runtime with Docker orchestration
Model override: run-case accepts --model or --models flags

UI / Dashboard

None. Results are written to JSON files; no web dashboard.

IDE Integration

Multi-platform plugin support:

Claude Code: .claude-plugin/plugin.json + marketplace
OpenCode: .opencode/plugins/skill-optimizer.js
Codex: .codex-plugin/plugin.json
Cursor: .cursor-plugin/plugin.json
Gemini: gemini-extension.json + GEMINI.md

Observability

trace.jsonl — full agent trace per eval run
result.json — pass/fail with grader evidence
suite-result.json — model × case matrix result
No streaming output during eval (batch results written after completion)

Docker Requirement

Users must have Docker running. The workbench image is skill-optimizer-workbench:local (default) — must be built locally.

Related frameworks

same archetype · same primary tool · same memory type

claude-mem (thedotmack) ★ 78k

A8 Cross-runtime harness

Background worker service captures every tool call as an observation, AI-compresses sessions, and auto-injects relevant past…

pi (badlogic/earendil) ★ 55k

A8 Cross-runtime harness

A minimal, hackable, multi-provider terminal coding agent that adapts to your workflows via npm-installable TypeScript Extensions…

Agent Skills (Addy Osmani) ★ 46k

A8 Cross-runtime harness

Encodes senior-engineer software development lifecycle as 23 auto-routed skills and 7 slash commands for any AI coding agent.

wshobson/agents Plugin Marketplace ★ 36k

A8 Cross-runtime harness

Single Markdown source for 83 domain-specialized plugins that auto-generates idiomatic artifacts for five AI coding harnesses.

TabbyML/Tabby ★ 34k

A8 Cross-runtime harness

Self-hosted AI coding assistant server (alternative to GitHub Copilot) with admin dashboard, RAG-based completions, and multi-IDE…

Compound Engineering ★ 17k

A8 Cross-runtime harness

Make each unit of engineering work compound into easier future work via brainstorm→plan→execute→review→learn cycles.

Distribution

Type: npm-package
License: MIT
Install: multi-step

Surfaces

CLI binary: npx tsx src/cli.ts
CLI subcmds: 2
Local UI: No

Components

Commands: 0
Skills: 1
Subagents: 0
Hooks: 0
MCP servers: 0
MCP tools: 0
Scripts: 1
Templates: 2

Workflow

Phases: 5
Approval gates: 0
Spec format: yaml
Spec storage: per-feature-folder
Delta or full: none

Orchestration

Multi-agent: No
Pattern: none
Max concurrent: 1
Isolation: container
Consensus: none
Prompt chaining: No

Multi-model

Multi-model: Yes
BYOK: Yes
Modal: text

Execution

Mode: one-shot
Crash recovery: No
Compaction: No
Session handoff: No
Streaming: No

Memory

Type: file-based
Persistence: none
Search: none
State files: 4 files

Quality

TDD: Optional
TDD mechanism: pre-impl-test-write
Validators: 1
Self-review: none

Git / Observability

Auto commit: No
Auto PR: No
Auto merge: No
Worktree/feat: No
Audit log: Yes
Audit format: jsonl
Replay: Yes

Tools

Primary: claude-code
Targets: 5
Portability: high

Signals

Stars: 57
Last commit: 2026-05-26
Maintainer: active
Quality score: 5.1/10