walkinglabs/learn-harness-engineering

learn-harness-engineering-walkinglabs · walkinglabs/learn-harness-engineering · ★ 6.6k · last commit 2026-05-23

Primitive shape 1 total

Skills 1

Summary

walkinglabs/learn-harness-engineering — Summary

Learn Harness Engineering is a project-based course on building reliable environments for AI coding agents, structured as 12 lectures + 6 projects + a companion skill (harness-creator) that can scaffold production-grade harnesses. The course's central argument (backed by Anthropic's cited experiment) is that the same model produces radically different results with vs. without a harness — not because the model improved, but because the environment did. The five harness subsystems taught are: Instructions, State, Verification, Scope, and Session Lifecycle. The course ships in 13 languages, has a VitePress documentation site with PDF export, and is actively developed (6,558 stars, last commit 2026-05-23). The harness-creator skill is the only executable component: a set of Node.js scripts (create-harness.mjs, validate-harness.mjs, run-benchmark.mjs) that scaffold and score harnesses for any project. Unlike the awesome-harness-engineering reference list, this is curriculum-first: theory → practical project → repeatable artifact.

differs_from_seeds: No direct seed analog — this is a course/curriculum, not a framework. The harness-creator skill is mechanically closest to agent-os (scaffolds instruction files for a project) but adds a five-subsystem validation scorer and HTML benchmark report. The lecture series synthesizes the field (Anthropic, OpenAI, walkinglabs' own awesome list) into a coherent 12-lecture learning path, which no seed framework attempts.

Overview

walkinglabs/learn-harness-engineering — Overview

Origin

Maintained by walkinglabs. MIT license (based on README badge). 6,558 stars, 656 forks. TypeScript codebase (VitePress site + scripts). Last commit 2026-05-23. Available in 13 languages: English, 简体中文, 繁體中文, 日本語, 한국어, Español, Français, Русский, Deutsch, العربية, Tiếng Việt, Oʻzbekcha, Türkçe.

Core Thesis

Verbatim from README:

"There's a hard truth most people learn the hard way: the strongest model in the world will still fail on real engineering tasks if you don't build a proper environment around it."

"Anthropic ran a controlled experiment: same model (Opus 4.5), same prompt ('build a 2D retro game editor'). Without a harness, it spent $9 in 20 minutes and produced something that didn't work. With a full harness... it spent $200 in 6 hours and built a game you could actually play. The model didn't change. The harness did."

The Five Harness Subsystems

Instructions  → AGENTS.md or CLAUDE.md
State         → feature_list.json, progress.md
Verification  → init.sh or documented commands
Scope         → feature dependencies and done criteria
Session Lifecycle → session-handoff.md, end-of-session routine

"The MODEL decides what code to write. The HARNESS governs when, where, and how it writes it."

12 Lectures

Strong Models Don't Mean Reliable Execution
What a Harness Actually Is
Why the Repository Must Become the System of Record
Why One Giant Instruction File Fails
Why Long-Running Tasks Lose Continuity
Why Initialization Needs Its Own Phase
Why Agents Overreach and Under-Finish
Why Feature Lists Are Harness Primitives
Why Agents Declare Victory Too Early
Why End-to-End Testing Changes Results
Why Observability Belongs Inside the Harness
Why Every Session Must Leave a Clean State

6 Projects

Baseline vs Minimal Harness: How Much Difference Does a Harness Make
Agent-Readable Workspace
Multi-Session Continuity
Incremental Indexing
Grounded QA Verification
Runtime Observability and Debugging

Architecture

walkinglabs/learn-harness-engineering — Architecture

Distribution

Type: methodology-doc (course + companion skill)
Language: TypeScript (VitePress + scripts), Markdown (lectures/projects)
License: MIT
Version: 0.1.0 (package.json)

Install (Course)

No install — read at the VitePress site or browse the repo.

Install (harness-creator skill)

npx skills add walkinglabs/learn-harness-engineering --skill harness-creator
# or
cp skills/harness-creator/ ~/your-project/skills/

Directory Structure

docs/
  en/
    lectures/
      lecture-01-why-capable-agents-still-fail/
      lecture-02-what-a-harness-actually-is/
      ...  (12 lectures, each with index.md + code/ dir)
    projects/
      project-01-baseline-vs-minimal-harness/
      ...  (6 projects)
    resources/
    skills/
  zh/, ja/, ko/, es/, fr/, ru/, de/, ar/, vi/, uz/, tr/, zh-TW/  (12 translations)
skills/
  harness-creator/
    SKILL.md              # AI-facing skill instructions
    SKILL.md.en           # English canonical
    metadata.json
    agents/
      openai.yaml         # Agent definition (OpenAI format)
    scripts/
      create-harness.mjs  # Scaffold a harness from scratch
      validate-harness.mjs # Score five subsystems
      render-assessment-html.mjs
      run-benchmark.mjs   # HTML benchmark report
      lib/
        harness-utils.mjs
    templates/
      agents.md           # AGENTS.md template
      feature-list.json   # feature_list.json template
      feature-list.schema.json
      init.sh             # Verification bootstrap template
      progress.md         # Progress tracking template
      session-handoff.md  # End-of-session handoff template
    references/           # Pattern reference docs
    evals/
      evals.json          # 10 eval cases
scripts/
  capture-readme-screenshots.ts
  build-course-pdfs.ts

Required Runtime

Node.js (for harness-creator scripts — built-in modules only, no deps)
VitePress (for local site development)
Playwright + tsx (for screenshots and PDF export)

Target AI Tools

Claude Code, Codex, any agent that reads AGENTS.md/CLAUDE.md

Components

walkinglabs/learn-harness-engineering — Components

harness-creator Skill

The only executable component in the repository.

Scripts

Script	Purpose
`create-harness.mjs`	Scaffold AGENTS.md/CLAUDE.md, feature_list.json, progress.md, init.sh, session-handoff.md
`validate-harness.mjs`	Score five subsystems (Instructions, State, Verification, Scope, Lifecycle)
`render-assessment-html.mjs`	Generate HTML assessment report
`run-benchmark.mjs`	Generate HTML structural benchmark report
`lib/harness-utils.mjs`	Shared utilities

Supported Project Types (detection in create-harness.mjs)

Node/npm/pnpm/yarn/bun, Python, Go, Rust, Maven, Gradle, .NET

Templates (6)

Template	Content
`templates/agents.md`	AGENTS.md template
`templates/feature-list.json`	Feature list JSON template
`templates/feature-list.schema.json`	JSON schema for feature list
`templates/init.sh`	Verification bootstrap template
`templates/progress.md`	Progress tracking template
`templates/session-handoff.md`	Session handoff template

Evals

evals/evals.json — 10 eval cases for the harness-creator skill

Agent Definition

agents/openai.yaml — OpenAI-format agent definition for the harness-creator

Course Content

Lectures (12)

Plain Markdown files, each covering one "why X fails without a harness" lesson. Each lecture has a companion code/ directory with runnable examples.

Projects (6)

Hands-on projects building and measuring real harnesses.

Commands / Hooks / MCP

None — course documentation does not ship Claude Code commands, hooks, or MCP servers.

Prompts

walkinglabs/learn-harness-engineering — Prompt Excerpts

Excerpt 1: harness-creator SKILL.md — Core Model

Technique: Five-subsystem table as the skill's organizing principle; reference-based progressive disclosure

---
name: harness-creator
description: >-
  Build, audit, and improve lightweight harnesses for AI coding agents: AGENTS.md/CLAUDE.md,
  feature state, verification workflows, scope boundaries, lifecycle handoff,
  memory persistence, context control, tool safety, and multi-agent coordination.
---

## Core Model

Every useful coding-agent harness has five subsystems:

| Subsystem | Minimal artifact | Purpose |
|---|---|---|
| Instructions | AGENTS.md or CLAUDE.md | Startup path, working rules, definition of done |
| State | feature_list.json, progress.md | Current feature, status, evidence, next step |
| Verification | init.sh or documented commands | Tests/checks the agent must run before claiming done |
| Scope | Feature dependencies and done criteria | Prevents overreach and half-finished work |
| Lifecycle | session-handoff.md, end-of-session routine | Makes the next session restartable |

## First Move

1. Inspect what already exists: instruction files, feature/state files, verification commands, docs, package manifests.
2. Ask only for missing context that cannot be inferred safely.
3. Prefer a minimal harness first. Add memory, tool safety, multi-agent, or benchmark details only when the user's problem calls for them.

Analysis: Minimal-first design principle ("Prefer a minimal harness"); structured as a table (not prose) for the core model; explicit "first move" pattern (inspect before creating) echoes ozzeron-prompt-pack's reuse-before-create philosophy but applied to harness scaffolding.

Excerpt 2: Lecture 01 — The Anthropic experiment

Technique: Primary-source evidence used as course hook; specific cost + outcome data

Anthropic ran a controlled experiment that illustrates the point perfectly. Same prompt ("build
a 2D retro game editor"), same model (Opus 4.5), two runs. First run: bare, no support — 20
minutes, $9, the game's core features didn't work. Second run: full harness — a planner,
generator, evaluator three-agent architecture — 6 hours, $200, the game was fully playable.

They didn't change the model. Opus 4.5 was still Opus 4.5. What changed was the tack.

Analysis: Pedagogical structure — cites a real experiment with specific dollar amounts and outcomes to make the abstract concept concrete. The "tack" metaphor (harness = equestrian tack) is maintained throughout the course.

Excerpt 3: Lecture 01 — Definition of Done as harness primitive

Write an explicit Definition of Done for every task. Don't say "add a search feature." Spell it out:

Completion criteria:

New endpoint GET /api/search?q=xxx
Supports pagination, default 20 items
Results include highlighted snippets
All new code passes pytest
Type checking passes (mypy --strict)


**Analysis**: Actionable prompt engineering instruction embedded in course content — the course teaches through examples of bad vs. good instructions.

Uniqueness

walkinglabs/learn-harness-engineering — Uniqueness & Positioning

differs_from_seeds

No direct seed analog — this is a course/curriculum with a companion scaffolding tool, not a drop-in framework. The harness-creator skill is mechanically closest to agent-os (scaffolds instruction files for a project) but adds five-subsystem structural validation scoring and HTML benchmark reports — things agent-os does not provide. The lecture series synthesizes the entire harness engineering field (drawing on Anthropic, OpenAI, the awesome-harness-engineering list) into a 12-lecture pedagogical arc, which no seed framework attempts. Among batch-27 peers, it forms a curriculum layer above awesome-harness-engineering (reference list) and is a practical complement to nexu-harness-guide (technical reference with code examples).

Positioning

Curriculum, not framework: Teaches the why before the how
Evidence-based pedagogy: Primary source citations (Anthropic experiment, OpenAI field report) as course anchors
13 languages: Broadest localization of any framework in the batch
Measurable outcomes: Structural validation + benchmark report give a score, not just advice
Minimal-first discipline: "Prefer a minimal harness first" as a design principle

Observable Limitations

Structural validation only: validate-harness.mjs scores presence/coherence — does not verify the agent actually follows the harness
No runtime enforcement: The course teaches patterns; nothing enforces them in agent execution
Score vs. behavior gap: "Real effectiveness still needs before/after agent sessions on representative tasks" — acknowledged in the SKILL.md itself
No hosting for the skill: Installing the skill requires manually copying directories or using npx skills add (if that tool exists)

Workflow

walkinglabs/learn-harness-engineering — Workflow

Learning Path

1. Read lecture-01 (Why capable agents still fail)
2. Read lecture-02 (What a harness actually is)
3. Complete Project 01 (baseline vs minimal harness)
4. Progress through lectures 03-12 in order
5. Complete projects 02-06
6. Use harness-creator to scaffold your own project

harness-creator Usage Flow

1. Run create-harness.mjs --target /path/to/project
   → Scaffolds AGENTS.md, feature_list.json, progress.md, init.sh, session-handoff.md
   
2. Edit generated files to add project-specific features and verification commands

3. Run validate-harness.mjs --target /path/to/project
   → Scores five subsystems; identifies lowest-scoring area

4. Run run-benchmark.mjs --target /path/to/project --html /path/to/report.html
   → Generates structural benchmark HTML report

Five Subsystem Phases + Artifacts

Subsystem	Artifact
Instructions	AGENTS.md or CLAUDE.md
State	feature_list.json + progress.md
Verification	init.sh (runnable commands)
Scope	Feature dependencies and done criteria in feature_list.json
Lifecycle	session-handoff.md + end-of-session routine

Approval Gates

None — course content is self-paced. harness-creator scripts are non-interactive except for the --force flag (requires explicit override for destructive operations).

Disclaimer

The harness-creator validation score is structural ("is it present and coherent?") — not runtime evidence. Real effectiveness requires before/after agent sessions on representative tasks.

Memory Context

walkinglabs/learn-harness-engineering — Memory & Context

The harness-creator artifacts ARE the memory system

The skill scaffolds exactly the memory/state files that agents need:

File	Purpose
`AGENTS.md` or `CLAUDE.md`	Startup context: project conventions, verification commands, definition of done
`feature_list.json`	Current feature backlog with status, dependencies, done criteria
`progress.md`	Session-by-session progress log: what was done, what's next, evidence
`session-handoff.md`	End-of-session artifact: clean restart path for next session

Cross-Session Memory Pattern

The course teaches that cross-session memory is not about the agent "remembering" — it is about the repository being the system of record. All state lives in files the next session can read on startup.

Compaction

Not addressed as a mechanism. The course emphasizes keeping instruction files short ("Keep the root instruction file short: routing and invariants, not a full manual") to avoid context bloat.

Validation

validate-harness.mjs scores the State subsystem based on presence and coherence of feature_list.json and progress.md. Not a runtime evaluation — structural only.

Orchestration

walkinglabs/learn-harness-engineering — Orchestration

harness-creator Skill: No Multi-Agent

The harness-creator skill is single-agent by design. It creates harness infrastructure but does not itself run multiple agents.

Course: Multi-Agent Coverage

The course covers multi-agent patterns in the advanced sections (referenced in harness-creator SKILL.md's reference list: references/multi-agent-pattern.md). The course teaches multi-agent coordination as a harness design concern — what ownership boundaries to establish, how to isolate agents — but does not implement it.

Execution Mode

One-shot (harness-creator scripts run to completion and exit).

Cross-Tool Portability

High — the harness artifacts (AGENTS.md, CLAUDE.md, feature_list.json, etc.) are readable by any AI coding agent. The course teaches tool-agnostic patterns; the skill supports Claude Code and OpenAI agents (via agents/openai.yaml).

Ui Cli Surface

walkinglabs/learn-harness-engineering — UI & CLI Surface

CLI Binary

None for the course. The harness-creator scripts are invoked directly via Node.js:

node skills/harness-creator/scripts/create-harness.mjs --target /path/to/project
node skills/harness-creator/scripts/validate-harness.mjs --target /path/to/project

Local UI / Dashboard

VitePress documentation site (course content, not agent runtime):

npm run dev → local VitePress site
npm run build → static site build
Screenshots captured by Playwright for README previews

PDF Export

npm run pdf:build → generates PDF coursebooks in artifacts/pdfs/. GitHub Actions workflow publishes PDFs to GitHub Releases.

IDE Integration

None — the skill is invoked from the terminal. The agents/openai.yaml agent definition enables use via the OpenAI Agents SDK.

Observability

The run-benchmark.mjs script generates an HTML report with:

Five-subsystem structural scores
Identified bottleneck (lowest-scoring subsystem)
Recommended improvements

This is the only "dashboard" surface — a static HTML report, not a live dashboard.

Related frameworks

same archetype · same primary tool · same memory type

Context-Engineering Handbook ★ 9.0k

A13 Methodology

Provides a first-principles, research-grounded vocabulary and learning path for context engineering — the discipline of designing…

Awesome Harness Engineering (walkinglabs) ★ 2.7k

A13 Methodology

Curate the authoritative reference list of articles, benchmarks, and tools for harness engineering — the practice of shaping the…

cline-memory-bank (nickbaumann98) ★ 581

A13 Methodology

Custom instructions + 6-file hierarchical Markdown memory bank so Cline maintains full project context across sessions, with a…

FPF (First Principles Framework) ★ 372

A13 Methodology

Provides a formal pattern language for making reasoning explicit, traceable, and publishable in mixed human/AI engineering work —…

nexu-io/harness-engineering-guide ★ 134

A13 Methodology

Provide a practical, code-first reference guide to harness engineering — from first principles to production patterns —…

knowhub ★ 40

A13 Methodology

Synchronize AI coding-agent knowledge files (rules, guidelines, templates) from a central source to multiple AI-tool-specific…

Distribution

Type: methodology-doc
License: MIT
Install: clone-and-configure
Version: 0.1.0

Surfaces

CLI binary: No
CLI subcmds: 0
Local UI: web-dashboard
Tech stack: VitePress + Mermaid + Playwright (screenshots) + pdf-lib (PDF export)

Components

Commands: 0
Skills: 1
Subagents: 0
Hooks: 0
MCP servers: 0
MCP tools: 0
Scripts: 5
Templates: 6

Workflow

Phases: 4
Approval gates: 0
Spec format: none
Spec storage: none
Delta or full: none

Orchestration

Multi-agent: No
Pattern: none
Max concurrent: 1
Isolation: none
Consensus: none
Prompt chaining: No

Multi-model

Multi-model: No
BYOK: Yes
Modal: text

Execution

Mode: one-shot
Crash recovery: No
Compaction: No
Session handoff: Yes
Streaming: No

Memory

Type: file-based
Persistence: project
Search: none
State files: 4 files

Quality

TDD: No
TDD mechanism: none
Validators: 1
Self-review: none

Git / Observability

Auto commit: No
Auto PR: No
Auto merge: No
Worktree/feat: No
Audit log: No
Audit format: none
Replay: No

Tools

Primary: generic
Targets: 3
Portability: high

Signals

Stars: 6.6k
Last commit: 2026-05-23
Maintainer: active
Quality score: 0.9/10