Skip to content
/

revfactory Claude Code Harness (A/B Research)

revfactory-cc-harness · revfactory/claude-code-harness · ★ 98 · last commit 2026-03-06

Empirically proves that Claude Code pre-configuration improves output quality by 60% across 15 software engineering tasks.

Best whenHarness value should be measured, not asserted — quantitative A/B evidence is the only credible claim.
Skip ifClaiming harness effectiveness without controlled experiment
vs seeds
superpowersstyle multi-agent work but instrumentalizes it for controlled A/B comparison rather than production delivery. Unlike all…
Primitive shape 10 total
Commands 4 Skills 3 Subagents 3
00

Summary

revfactory Claude Code Harness — Summary

This is a research repository, not a user-facing framework: it contains a controlled A/B experiment comparing Claude Code output quality with vs without a pre-configured .claude/ harness across 15 software engineering tasks at three difficulty levels. The central finding, reported in a bundled paper, is that harness pre-configuration improves average quality scores from 49.5 to 79.3 (+60%), with the effect scaling with task complexity (Expert +36.2 points). The actual harness delivered is small — 4 slash commands (/experiment, /evaluate, /report, /run-advanced-experiment), 3 skills (experiment-runner, output-evaluator, report-generator), and an experiments/ directory with YAML test cases, worktree-isolated baseline and harness agents, and JSON result tracking. There is no installable plugin or reusable end-user tool — the value is the research evidence. Differs from seeds: this is the only "methodology proof" in the catalog rather than a tool — analogous to agent-os (markdown scaffold + proof of concept) but focused on empirical validation rather than opinionated guidelines. The experiment design resembles the parallel-fan-out orchestration of superpowers and revfactory-harness but instrumentalized for measurement.

01

Overview

revfactory Claude Code Harness — Overview

Origin

GitHub: https://github.com/revfactory/claude-code-harness
Stars: 98
Language: HTML (paper visualizations)
Last commit: 2026-03-06
No license declared.

Philosophy

This repository treats "harness" not as an installable tool but as an empirical claim: that providing structured pre-configuration to Claude Code demonstrably improves output quality. The README presents itself as a research paper with methodology, results tables, and dimension-by-dimension analysis.

From the README:

"In a controlled A/B experiment across 15 software engineering tasks, Harness improved average quality from 49.5 to 79.3 points (out of 100) — a 60% improvement. The effect scales with complexity: Basic +23.8, Advanced +29.6, Expert +36.2."

"Harness lives in the .claude/ directory and provides four types of guidance: architectural blueprint (CLAUDE.md), skills (domain-specific knowledge), agents (role decomposition), and commands (workflow orchestration)."

Key Design Decisions

  1. Research-first: the repository exists to prove a claim, not to ship a product.
  2. Parallel agent teams: the experiment uses BaselineAgent and HarnessAgent running in parallel on the same task, then EvaluatorAgent comparing outputs — a three-agent experiment harness.
  3. 10-dimension scoring: quality measured on completeness, code quality, efficiency, accuracy, and structure (each 0-10, total 100).
  4. Worktree isolation: baseline and harness experiments run in separate worktrees to prevent contamination.
02

Architecture

revfactory Claude Code Harness — Architecture

Distribution

Not installable as a user-facing plugin. This is a research repository. The .claude/ configuration functions as a self-contained experiment environment.

Directory Tree

claude-code-harness/
├── .claude/
│   ├── CLAUDE.md              # Project memory / experiment context
│   ├── commands/
│   │   ├── experiment.md      # /experiment — run A/B test with agent teams
│   │   ├── evaluate.md        # /evaluate — compare outputs
│   │   ├── report.md          # /report — generate summary report
│   │   └── run-advanced-experiment.md
│   └── skills/
│       ├── experiment-runner.md
│       ├── output-evaluator.md
│       └── report-generator.md
├── experiments/
│   ├── cases/                 # YAML test case definitions
│   ├── results/               # Per-case baseline/ and harness/ directories
│   └── reports/               # Aggregated reports
├── paper/
│   └── figures/               # Charts used in the paper
└── README.md                  # Acts as the research paper

Required Runtime

  • Claude Code (no version specified)
  • No additional dependencies

Target AI Tools

Claude Code only (experiment design is Claude Code-specific).

03

Components

revfactory Claude Code Harness — Components

Commands (4)

Name Purpose
/experiment [category|all] Runs A/B experiment: spawns BaselineAgent + HarnessAgent in parallel, then EvaluatorAgent
/evaluate [case-id] Evaluates a specific case's baseline vs harness outputs
/report [full|summary|comparison] Generates aggregated report from all results
/run-advanced-experiment Advanced variant with additional controls

Skills (3)

Name Purpose
experiment-runner Orchestrates parallel agent team execution
output-evaluator Scores outputs on 10 quality dimensions
report-generator Aggregates results into summary report

Agents (3, experiment-specific)

These are NOT defined as .claude/agents/*.md files — they are instantiated inline within the experiment command:

Name Role
BaselineAgent Runs task with no harness, saves to experiments/results/{case-id}/baseline/
HarnessAgent Runs task with full harness, saves to experiments/results/{case-id}/harness/
EvaluatorAgent Compares outputs on 5 dimensions, scores 0-10 each

Test Cases (15 documented)

Difficulty Cases
Basic (001-005) REST API, Bug Fix, Refactoring, README, CLI Tool
Advanced (006-010) Interpreter, Microservice, SQL Engine, CRDT Editor
Expert (011-015) Raft Consensus, Bytecode VM, Event Sourcing, LSP Server
05

Prompts

revfactory Claude Code Harness — Prompts

Excerpt 1: BaselineAgent inline prompt (from /experiment command)

Technique: Explicit constraint-by-exclusion (tell agent what NOT to use)

당신은 BaselineAgent입니다. 아래 태스크를 순수 기본 도구만으로 수행하세요.
- Skills, Commands, Agent 팀 패턴을 사용하지 마세요
- 별도의 구조화 가이드 없이 자연스럽게 작업하세요
- 결과물을 experiments/results/{case-id}/baseline/ 에 저장하세요

태스크: {케이스 description}

완료 후 아래 메타데이터를 result.json으로 저장:
- files_created: 생성한 파일 목록
- total_lines: 총 코드 라인 수
- approach: 접근 방식 설명 (2-3문장)

(Translation: "You are BaselineAgent. Perform the task below using only basic tools. Do NOT use Skills, Commands, or Agent team patterns. Work naturally without structural guidance. Save results to experiments/results/{case-id}/baseline/")

Excerpt 2: HarnessAgent inline prompt (from /experiment command)

Technique: Explicit maximization mandate for harness features

당신은 HarnessAgent입니다. 아래 태스크를 Harness 시스템을 최대한 활용하여 수행하세요.
- 작업을 하위 에이전트로 분할하여 병렬 처리하세요
- 체계적인 프로젝트 구조를 설계한 뒤 구현하세요
- 모듈화, 패턴 적용, 에러 핸들링을 적극 적용하세요
- 결과물을 experiments/results/{case-id}/harness/ 에 저장하세요

태스크: {케이스 description}

완료 후 아래 메타데이터를 result.json으로 저장:
- files_created: 생성한 파일 목록
- total_lines: 총 코드 라인 수
- agents_used: 사용한 하위 에이전트 수
- approach: 접근 방식 설명 (2-3문장)

(Translation: "You are HarnessAgent. Perform the task maximizing use of the Harness system. Decompose work into subagents and run them in parallel. Design a systematic project structure first. Apply modularization, patterns, and error handling aggressively.")

Excerpt 3: EvaluatorAgent prompt (from /experiment command)

Technique: Structured multi-dimension scoring rubric

당신은 EvaluatorAgent입니다. 아래 두 산출물을 비교 평가하세요.

평가 기준 (각 0-10점):
1. 완성도 (Completeness)
2. 코드 품질 (Code Quality)
3. 효율성 (Efficiency)
4. 정확성 (Accuracy)
5. 구조화 (Structure)

평가 결과를 JSON으로 저장하세요.
09

Uniqueness

revfactory Claude Code Harness — Uniqueness

differs_from_seeds

No seed is purely a research harness — this is the only framework in the catalog whose primary deliverable is empirical evidence rather than an installable tool. The closest analogy is agent-os (Archetype 4: markdown scaffold, zero primitives) but agent-os ships reusable markdown templates, while revfactory-cc-harness ships a measurement apparatus. The experiment design borrows the parallel-fan-out pattern from what superpowers and revfactory-harness implement, but instrumentalized for controlled comparison rather than production use. It is most notable as the only framework that provides quantitative proof that .claude/ pre-configuration improves LLM coding quality.

Positioning

This is a proof-of-concept / research paper repository. The "harness" is both the experimental apparatus and the subject of the experiment. Not suitable for production use as a standalone tool — the value is the validation data.

Observable Failure Modes

  1. No install path: there is no way to use this as a standalone harness outside the experimental context.
  2. Korean-only documentation: README and skill files are in Korean; most users cannot evaluate the quality without translation.
  3. Inline agent prompts: agents are defined as inline strings within command files, not as reusable .claude/agents/*.md definitions — they cannot be reused outside the experiment context.
  4. No license: no license declared, making derivative use legally uncertain.
  5. Dormant: last commit 2026-03-06; likely not actively maintained.
04

Workflow

revfactory Claude Code Harness — Workflow

Experiment Phases

Phase Action Artifact Gate
Case load Read YAML from experiments/cases/ test case list none
Parallel execution BaselineAgent + HarnessAgent run same task result.json in baseline/ and harness/ none
Evaluation EvaluatorAgent scores both outputs evaluation JSON with 5-dimension scores none
Reporting ReportAgent aggregates summary report none

Approval Gates

None. The experiment is fully automated once triggered.

Result Schema (per case)

{
  "files_created": ["list of files"],
  "total_lines": 0,
  "approach": "description",
  "agents_used": 0
}

Evaluation Dimensions

Each scored 0-10:

  1. Completeness
  2. Code Quality
  3. Efficiency
  4. Accuracy
  5. Structure/Architecture
06

Memory Context

revfactory Claude Code Harness — Memory & Context

State Storage

File-based, project-scoped:

  • experiments/cases/ — YAML test case definitions (input)
  • experiments/results/{case-id}/baseline/ — baseline agent output
  • experiments/results/{case-id}/harness/ — harness agent output
  • experiments/results/{case-id}/result.json — per-run metadata
  • experiments/reports/ — aggregated reports

Memory Type

File-based, per-experiment.

Cross-Session Persistence

Results persist in the filesystem. New experiment runs can build on existing results.

Context Compaction

Not addressed — experiments are bounded tasks.

07

Orchestration

revfactory Claude Code Harness — Orchestration

Multi-Agent Architecture

Yes. The experiment design requires three concurrent agents:

  • BaselineAgent (general-purpose, worktree-isolated)
  • HarnessAgent (general-purpose, worktree-isolated)
  • EvaluatorAgent (sequential, after both complete)

Orchestration Pattern

parallel-fan-out with sequential evaluation: BaselineAgent and HarnessAgent run in parallel on the same task; EvaluatorAgent runs after both complete.

Isolation Mechanism

git-worktree: the command explicitly specifies worktree isolation for each experiment agent to prevent cross-contamination.

Multi-Model

No. All agents use the default session model.

Execution Mode

One-shot per /experiment invocation.

Multi-Agent Spawn

Via Claude Code's Task tool (inline prompts in command files, not .claude/agents/ definitions).

Consensus

None. EvaluatorAgent produces a single verdict.

08

Ui Cli Surface

revfactory Claude Code Harness — UI & CLI Surface

Dedicated CLI Binary

No.

Local UI

None.

Slash Commands (Claude Code)

  • /experiment [category|all] — run A/B comparison experiment
  • /evaluate [case-id] — evaluate single case
  • /report [full|summary|comparison] — generate report
  • /run-advanced-experiment — advanced experiment variant

Observability

Results are written as JSON to experiments/results/ and aggregated to experiments/reports/. The paper figures in paper/figures/ are the primary observability output.

Related frameworks

same archetype · same primary tool · same memory type

Claude-Flow / Ruflo ★ 55k

Eliminates single-agent context limits and sequential bottlenecks by orchestrating fault-tolerant swarms of specialized AI agents…

Hermes Agent (NousResearch) ★ 168k

Self-improving personal AI agent with closed learning loop, 7 terminal backends, and messaging gateway — not tied to any AI…

OpenCode ★ 165k

Terminal-first AI coding agent with multi-model routing, native desktop app, and a typed .opencode/ configuration system for…

OpenHands ★ 75k

Open-source AI software development platform (open-source Devin alternative) with Docker sandbox isolation, 77.6% SWE-bench…

DeerFlow ★ 70k

Long-horizon superagent that researches, codes, and creates by orchestrating parallel sub-agents with isolated contexts in Docker…

oh-my-openagent (omo) ★ 60k

Multi-provider AI agent orchestration for OpenCode: escape vendor lock-in by routing Sisyphus (Claude/Kimi/GLM) and Hephaestus…