Skip to content
/

skill-optimizer

skill-optimizer · fastxyz/skill-optimizer · ★ 57 · last commit 2026-05-26

Primitive shape 1 total
Skills 1
00

Summary

skill-optimizer — Summary

skill-optimizer is a Docker-based eval workbench and agent skill for running deterministic evaluations against agent skills across multiple LLMs via OpenRouter. It ships a TypeScript CLI and a canonical skill (SKILL.md) that can be installed into Claude Code, Codex, Cursor, OpenCode, and Gemini CLI. The workbench runs each eval in an isolated Docker container with a deterministic grader — the agent cannot see the grader, hidden answers, or eval metadata during the task phase.

A case is one user-like task plus one or more deterministic graders; a suite is a matrix of cases × OpenRouter models. Results include trace.jsonl (what the agent saw, said, and did), result.json, and suite-result.json. The skill teaches the agent how to author and debug eval suites for its own skills.

Compared to seeds: no direct equivalent in the 11 seeds. Closest to spec-kit (structured validation) but inverted — spec-kit validates code against specs; skill-optimizer validates skills against eval cases. The Docker isolation makes this unique in the corpus: it's the only tool that runs agent skills in hermetic containers to produce deterministic pass/fail results across a model matrix.

01

Overview

skill-optimizer — Overview

Origin

Built by fastxyz (fastxyz GitHub org). Released under MIT. Supports Claude Code, OpenCode, Codex, Cursor, and Gemini CLI plugin formats. Default branch is development.

Philosophy

Skills for AI agents are like software: they need tests to verify they work reliably. The workbench provides a framework for writing "eval suites" — automated test cases that verify whether an agent skill produces correct outputs across different models. The goal is deterministic, reproducible evaluation rather than subjective judgment.

"The workbench gives an agent a skill/reference folder, an isolated /work directory, and deterministic graders."

Key insight: the agent phase is hermetically isolated — it cannot see the grader, hidden answers, or eval metadata. This prevents the eval from measuring whether the model knows the answer format rather than whether the skill works.

Design Principles

  1. Prefer real CLI/API/service over mocks — mock only when you're sure the mock matches the real command surface
  2. Deterministic graders — graders run after the agent with full context; agent runs blind
  3. Local secrets stay localtrace.jsonl, result.json, and preserved workspace dirs are potentially sensitive if the agent prints secret values
  4. MCP support — cases can define MCP servers; local server source can be hidden from the agent via separate service containers

Explicit Antipatterns

  • Mock when you don't know the real CLI's behavior well enough (measures the mock, not the skill)
  • One broad grader per case (prefer multiple small deterministic graders)
  • Exposing grader logic, hidden answers, or /case to the agent during eval phase
  • Hardcoded model refs (only openrouter/... model refs supported)
02

Architecture

skill-optimizer — Architecture

Distribution

  • Type: npm-package (TypeScript CLI) + skill-pack
  • License: MIT
  • Install complexity: multi-step (npm install + Docker + OPENROUTER_API_KEY)

Install Commands

# Claude Code plugin
/plugin marketplace add fastxyz/skill-optimizer
/plugin install skill-optimizer@skill-optimizer

# Cursor
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a cursor -y

# Skill-only install (any agent)
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a claude-code -a opencode -a codex -a cursor -y

# Local CLI
npm install && npm run build

Directory Layout

skills/skill-optimizer/
└── SKILL.md              # Canonical skill: authoring + debugging eval suites

src/
└── cli.ts                # TypeScript CLI (run-case, run-suite)

examples/
└── workbench/pdf/        # Example suite

tests/

docker/                   # Docker workbench image

.claude-plugin/           # Claude Code marketplace plugin
.codex-plugin/            # Codex plugin
.cursor-plugin/           # Cursor plugin
.opencode/                # OpenCode plugin
gemini-extension.json     # Gemini extension
AGENTS.md, CLAUDE.md, GEMINI.md  # Agent guidance files

Container Architecture

Docker container: /work    ← agent-visible workspace
Docker container: /case    ← case definition (agent cannot see)
Docker container: /results ← grader output (agent cannot see)
Docker mounts: references → /work (skill files visible to agent)
Optional: MCP service containers (agent sees only HTTP URL)

Required Runtime

  • Node.js 20+
  • Docker
  • OPENROUTER_API_KEY

Target AI Tools

  • Claude Code
  • OpenCode
  • Codex
  • Cursor
  • Gemini CLI
03

Components

skill-optimizer — Components

Skills

Name File Purpose
skill-optimizer skills/skill-optimizer/SKILL.md Canonical skill: source of truth for authoring eval suites, running CLI, schema, patterns

CLI Commands

Command Purpose
npx tsx src/cli.ts run-case <case.yml> Run one eval case against a model
npx tsx src/cli.ts run-case <case.yml> --models m1,m2 Run one case across multiple models
npx tsx src/cli.ts run-suite <suite.yml> Run a suite (cases × models matrix)
npx tsx src/cli.ts run-suite <suite.yml> --trials N Run suite with N trials per combination
npx tsx src/cli.ts --help CLI help
npx tsx src/cli.ts run-case --help run-case help
npx tsx src/cli.ts run-suite --help run-suite help

Case Definition Format (YAML)

name: extract-pdf-facts
task: |
  Read statement.pdf and write answer.json with the account, quarter,
  approval code, and risk flags.
graders:
  - name: answer-json
    command: node $CASE/checks/extract-pdf-facts.mjs

Suite Definition Format (YAML)

name: pdf-skill-eval
references: ./references   # Files copied into /work for agent
models:
  - openrouter/google/gemini-2.5-flash
env:
  - OPENROUTER_API_KEY
timeoutSeconds: 600
setup:
  - node $CASE/checks/create-inputs.mjs
appendSystemPrompt: |
  Keep task outputs at the top level of /work unless the user asks otherwise.
cases:
  - name: ...

Output Artifacts

Artifact Purpose
result.json Per-case pass/fail result with grader evidence
suite-result.json Matrix result across all cases × models
trace.jsonl Full agent trace (what it saw, said, and did)
summary.json Suite summary
workspace/ Preserved agent workspace (optional, potentially sensitive)
05

Prompts

skill-optimizer — Prompt Excerpts

Excerpt 1: Core Model Definition (from skills/skill-optimizer/SKILL.md)

Technique: Hermetic isolation contract expressed as an invariant

## Core Model

- A case is one user-like task plus one or more deterministic graders.
- A suite is a set of cases and OpenRouter models to run as a matrix.
- `references` are copied into `/work` before the agent starts; this is where eval skills live.
- The agent phase sees `/work` only. It cannot see `/case`, `/results`, graders, hidden answers, or hidden metadata.
- Cases can define `mcpServers`; these are exposed through a workbench `mcp` command during the agent phase.
- Graders run after the agent with `/case`, `/work`, and `/results` mounted.
- `trace.jsonl` is the debugging source for what the agent saw, said, and did.

Analysis: The isolation contract is stated as a list of invariants, not guidelines. "It cannot see" is absolute. This prevents the most common eval contamination: the model seeing grader logic or expected outputs before performing the task.


Excerpt 2: Mock vs. Real Service Decision Rule (from skills/skill-optimizer/SKILL.md)

Technique: Conditional rule with explicit when-to-mock criteria

Prefer the real CLI/API/service when you do not know its internal behavior well enough to mock it faithfully. Mock only when you are sure the mock matches the real command surface, validation, outputs, and failure modes; otherwise the eval will measure the mock, not the skill.

Analysis: This is a precise epistemological rule: mock only when you have sufficient knowledge of the real system to replicate its observable behavior completely. The anti-pattern ("measuring the mock, not the skill") is named explicitly, making the failure mode visible.


Excerpt 3: Case Authoring Rules (from skills/skill-optimizer/SKILL.md)

Technique: Exclusion rules for task text to prevent eval leakage

Write natural user tasks. Do not mention graders, hidden answers, `/case`, or eval internals.

And the recommended case coverage pattern:

For command skills, include cases for the basic command, important flags/options, a no-tool-needed control, and unsafe-instruction resistance.

Analysis: The "no-tool-needed control" case is a calibration check — it verifies the model doesn't unnecessarily invoke tools when none are needed. "Unsafe-instruction resistance" tests whether the skill correctly refuses adversarial prompts. Both are systematic coverage requirements that prevent superficial evals that only test the happy path.

09

Uniqueness

skill-optimizer — Uniqueness & Positioning

Differs From Seeds

No direct equivalent in the 11 seeds. The closest conceptual analog is spec-kit's validation phase, but skill-optimizer is inverted: spec-kit validates code against specs during development; skill-optimizer validates agent skills against deterministic eval cases in isolated Docker containers. The key differentiator is Docker-based hermetic isolation — the agent runs blind, graders run with full visibility, and traces capture everything for debugging.

Also distinct from heavy3-code-audit (which runs LLM-based review) and aurite-agent-verifier (which applies static rules to code). skill-optimizer runs behavioral evals: does this skill, given this task, produce the correct output? That is a functional test, not a static analysis.

Observable Failure Modes

  1. Docker dependency: Requires Docker running locally — higher friction than pure-npm tools.
  2. OPENROUTER_API_KEY required: No free tier; all model runs cost money.
  3. No GUI for results: Users must read suite-result.json manually.
  4. Suite setup complexity: Getting references/, checks/, bin/ directories right requires understanding the isolation model.
  5. Default branch is development: Install commands from README may not match latest if branch diverges.

Distinctive Opinion

Skills for AI agents are programs and should be tested like programs: with deterministic test cases, isolated environments, and reproducible results across models. The "eval workbench" pattern — separate graders from agent, run blind, grade after — is borrowed from competitive programming judges and should be the standard for AI skill quality assurance.

Target User

Primarily skill authors who want to verify their skills work reliably across models and across skill updates — not end users of those skills.

04

Workflow

skill-optimizer — Workflow

Authoring Workflow (From SKILL.md)

Step Action Artifact
1 Create suite.yml with models, defaults, and case paths suite.yml
2 Put skill/reference material under references/ references/
3 Write natural user tasks (no mention of graders/answers) Case task text
4 Put setup helpers and graders under checks/; fake CLIs under bin/ checks/ + bin/
5 Add one or more deterministic graders per case Grader scripts
6 Run run-suite --trials N and inspect results suite-result.json, trace.jsonl

Execution Phases (Per Case)

Phase What Happens Artifact
Setup setup commands run; inputs created Populated /work
References copy references/ files copied into /work /work with skill files
Agent phase Model receives task; runs in /work only Agent trace
Grade phase Grader runs with /case, /work, /results mounted result.json
Cleanup Optional workspace preservation workspace/

Isolation Guarantee

During the agent phase:

  • Agent sees only /work
  • Agent cannot see /case (case definition, hidden answers)
  • Agent cannot see /results (grader outputs)
  • Agent cannot see grader source

MCP in Evals

Cases can define mcpServers. For local servers whose source should stay hidden, put server files under the case mcp/ support directory — Docker starts them as separate service containers; agent only sees the HTTP MCP URL.

Approval Gates

None — fully automated pipeline. Results are written to disk; human reviews suite-result.json.

06

Memory Context

skill-optimizer — Memory & Context

State Storage

File-based, per-eval-run. No persistent cross-session memory.

Artifacts Per Eval Run

  • trace.jsonl — full agent trace (what it saw, said, and did). Primary debugging tool.
  • result.json — per-case pass/fail with grader evidence
  • suite-result.json — matrix result
  • summary.json — aggregate statistics
  • workspace/ — optional preserved agent workspace

Context Per Agent Run

The agent sees:

  • /work directory (references + setup outputs)
  • The task text
  • System prompt (plus appendSystemPrompt from suite.yml)
  • MCP server URLs (if configured)

The agent cannot see:

  • Grader source
  • /case directory
  • Expected outputs
  • OPENROUTER_API_KEY or other secrets (unless explicitly passed via env:)

Trace Format

trace.jsonl is a structured newline-delimited JSON log of the agent's entire execution — tool calls, model outputs, reasoning traces. Used for debugging why a grader failed.

Sensitive Data Warning

trace.jsonl, result.json, preserved workspace/ dirs are potentially sensitive if the agent or grader prints/writes secret values. Users are warned to treat these artifacts as sensitive.

07

Orchestration

skill-optimizer — Orchestration

Multi-Agent Pattern

Pattern: none (single agent per eval case). The workbench spawns one agent instance per case and records its trace. Multiple cases in a suite run sequentially or in parallel depending on CLI configuration.

Isolation Mechanism

Container (Docker) — each eval case runs in an isolated Docker container. Agent workspace is /work only; grader runs after with full context but agent-inaccessible paths.

Multi-Model Routing

Yes — the suite defines a models: list. The CLI runs each case against each model in the matrix. Each model run is an independent Docker container.

Execution Mode

One-shot per case — the CLI is invoked to run a suite, which fans out to N cases × M models = N×M container runs.

Cross-Model Comparison

suite-result.json contains per-model pass rates across all cases, enabling model capability comparison for specific skill tasks.

MCP Service Containers

Cases can start local MCP servers as separate Docker service containers. Agent sees only the HTTP URL — server source is hidden.

Subagent Definition Format

None — there are no subagents. The agent under test is a single LLM run in a container.

08

Ui Cli Surface

skill-optimizer — UI / CLI Surface

CLI Binary

Yes — npx tsx src/cli.ts (TypeScript CLI, not installed globally by default but can be).

  • Subcommands: run-case, run-suite, --help
  • Not a thin wrapper: own eval runtime with Docker orchestration
  • Model override: run-case accepts --model or --models flags

UI / Dashboard

None. Results are written to JSON files; no web dashboard.

IDE Integration

Multi-platform plugin support:

  • Claude Code: .claude-plugin/plugin.json + marketplace
  • OpenCode: .opencode/plugins/skill-optimizer.js
  • Codex: .codex-plugin/plugin.json
  • Cursor: .cursor-plugin/plugin.json
  • Gemini: gemini-extension.json + GEMINI.md

Observability

  • trace.jsonl — full agent trace per eval run
  • result.json — pass/fail with grader evidence
  • suite-result.json — model × case matrix result
  • No streaming output during eval (batch results written after completion)

Docker Requirement

Users must have Docker running. The workbench image is skill-optimizer-workbench:local (default) — must be built locally.

Related frameworks

same archetype · same primary tool · same memory type

claude-mem (thedotmack) ★ 78k

Background worker service captures every tool call as an observation, AI-compresses sessions, and auto-injects relevant past…

pi (badlogic/earendil) ★ 55k

A minimal, hackable, multi-provider terminal coding agent that adapts to your workflows via npm-installable TypeScript Extensions…

Agent Skills (Addy Osmani) ★ 46k

Encodes senior-engineer software development lifecycle as 23 auto-routed skills and 7 slash commands for any AI coding agent.

wshobson/agents Plugin Marketplace ★ 36k

Single Markdown source for 83 domain-specialized plugins that auto-generates idiomatic artifacts for five AI coding harnesses.

TabbyML/Tabby ★ 34k

Self-hosted AI coding assistant server (alternative to GitHub Copilot) with admin dashboard, RAG-based completions, and multi-IDE…

Compound Engineering ★ 17k

Make each unit of engineering work compound into easier future work via brainstorm→plan→execute→review→learn cycles.