revfactory Claude Code Harness (A/B Research)

revfactory-cc-harness · revfactory/claude-code-harness · ★ 98 · last commit 2026-03-06

Empirically proves that Claude Code pre-configuration improves output quality by 60% across 15 software engineering tasks.

Best whenHarness value should be measured, not asserted — quantitative A/B evidence is the only credible claim.

Skip ifClaiming harness effectiveness without controlled experiment

vs seeds

superpowersstyle multi-agent work but instrumentalizes it for controlled A/B comparison rather than production delivery. Unlike all…

Primitive shape 10 total

Commands 4 Skills 3 Subagents 3

Summary

revfactory Claude Code Harness — Summary

This is a research repository, not a user-facing framework: it contains a controlled A/B experiment comparing Claude Code output quality with vs without a pre-configured .claude/ harness across 15 software engineering tasks at three difficulty levels. The central finding, reported in a bundled paper, is that harness pre-configuration improves average quality scores from 49.5 to 79.3 (+60%), with the effect scaling with task complexity (Expert +36.2 points). The actual harness delivered is small — 4 slash commands (/experiment, /evaluate, /report, /run-advanced-experiment), 3 skills (experiment-runner, output-evaluator, report-generator), and an experiments/ directory with YAML test cases, worktree-isolated baseline and harness agents, and JSON result tracking. There is no installable plugin or reusable end-user tool — the value is the research evidence. Differs from seeds: this is the only "methodology proof" in the catalog rather than a tool — analogous to agent-os (markdown scaffold + proof of concept) but focused on empirical validation rather than opinionated guidelines. The experiment design resembles the parallel-fan-out orchestration of superpowers and revfactory-harness but instrumentalized for measurement.

Overview

revfactory Claude Code Harness — Overview

Origin

GitHub: https://github.com/revfactory/claude-code-harness
Stars: 98
Language: HTML (paper visualizations)
Last commit: 2026-03-06
No license declared.

Philosophy

This repository treats "harness" not as an installable tool but as an empirical claim: that providing structured pre-configuration to Claude Code demonstrably improves output quality. The README presents itself as a research paper with methodology, results tables, and dimension-by-dimension analysis.

From the README:

"In a controlled A/B experiment across 15 software engineering tasks, Harness improved average quality from 49.5 to 79.3 points (out of 100) — a 60% improvement. The effect scales with complexity: Basic +23.8, Advanced +29.6, Expert +36.2."

"Harness lives in the .claude/ directory and provides four types of guidance: architectural blueprint (CLAUDE.md), skills (domain-specific knowledge), agents (role decomposition), and commands (workflow orchestration)."

Key Design Decisions

Research-first: the repository exists to prove a claim, not to ship a product.
Parallel agent teams: the experiment uses BaselineAgent and HarnessAgent running in parallel on the same task, then EvaluatorAgent comparing outputs — a three-agent experiment harness.
10-dimension scoring: quality measured on completeness, code quality, efficiency, accuracy, and structure (each 0-10, total 100).
Worktree isolation: baseline and harness experiments run in separate worktrees to prevent contamination.

Architecture

revfactory Claude Code Harness — Architecture

Distribution

Not installable as a user-facing plugin. This is a research repository. The .claude/ configuration functions as a self-contained experiment environment.

Directory Tree

claude-code-harness/
├── .claude/
│   ├── CLAUDE.md              # Project memory / experiment context
│   ├── commands/
│   │   ├── experiment.md      # /experiment — run A/B test with agent teams
│   │   ├── evaluate.md        # /evaluate — compare outputs
│   │   ├── report.md          # /report — generate summary report
│   │   └── run-advanced-experiment.md
│   └── skills/
│       ├── experiment-runner.md
│       ├── output-evaluator.md
│       └── report-generator.md
├── experiments/
│   ├── cases/                 # YAML test case definitions
│   ├── results/               # Per-case baseline/ and harness/ directories
│   └── reports/               # Aggregated reports
├── paper/
│   └── figures/               # Charts used in the paper
└── README.md                  # Acts as the research paper

Required Runtime

Claude Code (no version specified)
No additional dependencies

Target AI Tools

Claude Code only (experiment design is Claude Code-specific).

Components

revfactory Claude Code Harness — Components

Commands (4)

Name	Purpose
`/experiment [category\|all]`	Runs A/B experiment: spawns BaselineAgent + HarnessAgent in parallel, then EvaluatorAgent
`/evaluate [case-id]`	Evaluates a specific case's baseline vs harness outputs
`/report [full\|summary\|comparison]`	Generates aggregated report from all results
`/run-advanced-experiment`	Advanced variant with additional controls

Skills (3)

Name	Purpose
experiment-runner	Orchestrates parallel agent team execution
output-evaluator	Scores outputs on 10 quality dimensions
report-generator	Aggregates results into summary report

Agents (3, experiment-specific)

These are NOT defined as .claude/agents/*.md files — they are instantiated inline within the experiment command:

Name	Role
BaselineAgent	Runs task with no harness, saves to `experiments/results/{case-id}/baseline/`
HarnessAgent	Runs task with full harness, saves to `experiments/results/{case-id}/harness/`
EvaluatorAgent	Compares outputs on 5 dimensions, scores 0-10 each

Test Cases (15 documented)

Difficulty	Cases
Basic (001-005)	REST API, Bug Fix, Refactoring, README, CLI Tool
Advanced (006-010)	Interpreter, Microservice, SQL Engine, CRDT Editor
Expert (011-015)	Raft Consensus, Bytecode VM, Event Sourcing, LSP Server

Prompts

revfactory Claude Code Harness — Prompts

Excerpt 1: BaselineAgent inline prompt (from /experiment command)

Technique: Explicit constraint-by-exclusion (tell agent what NOT to use)

당신은 BaselineAgent입니다. 아래 태스크를 순수 기본 도구만으로 수행하세요.
- Skills, Commands, Agent 팀 패턴을 사용하지 마세요
- 별도의 구조화 가이드 없이 자연스럽게 작업하세요
- 결과물을 experiments/results/{case-id}/baseline/ 에 저장하세요

태스크: {케이스 description}

완료 후 아래 메타데이터를 result.json으로 저장:
- files_created: 생성한 파일 목록
- total_lines: 총 코드 라인 수
- approach: 접근 방식 설명 (2-3문장)

(Translation: "You are BaselineAgent. Perform the task below using only basic tools. Do NOT use Skills, Commands, or Agent team patterns. Work naturally without structural guidance. Save results to experiments/results/{case-id}/baseline/")

Excerpt 2: HarnessAgent inline prompt (from /experiment command)

Technique: Explicit maximization mandate for harness features

당신은 HarnessAgent입니다. 아래 태스크를 Harness 시스템을 최대한 활용하여 수행하세요.
- 작업을 하위 에이전트로 분할하여 병렬 처리하세요
- 체계적인 프로젝트 구조를 설계한 뒤 구현하세요
- 모듈화, 패턴 적용, 에러 핸들링을 적극 적용하세요
- 결과물을 experiments/results/{case-id}/harness/ 에 저장하세요

태스크: {케이스 description}

완료 후 아래 메타데이터를 result.json으로 저장:
- files_created: 생성한 파일 목록
- total_lines: 총 코드 라인 수
- agents_used: 사용한 하위 에이전트 수
- approach: 접근 방식 설명 (2-3문장)

(Translation: "You are HarnessAgent. Perform the task maximizing use of the Harness system. Decompose work into subagents and run them in parallel. Design a systematic project structure first. Apply modularization, patterns, and error handling aggressively.")

Excerpt 3: EvaluatorAgent prompt (from /experiment command)

Technique: Structured multi-dimension scoring rubric

당신은 EvaluatorAgent입니다. 아래 두 산출물을 비교 평가하세요.

평가 기준 (각 0-10점):
1. 완성도 (Completeness)
2. 코드 품질 (Code Quality)
3. 효율성 (Efficiency)
4. 정확성 (Accuracy)
5. 구조화 (Structure)

평가 결과를 JSON으로 저장하세요.

Uniqueness

revfactory Claude Code Harness — Uniqueness

differs_from_seeds

No seed is purely a research harness — this is the only framework in the catalog whose primary deliverable is empirical evidence rather than an installable tool. The closest analogy is agent-os (Archetype 4: markdown scaffold, zero primitives) but agent-os ships reusable markdown templates, while revfactory-cc-harness ships a measurement apparatus. The experiment design borrows the parallel-fan-out pattern from what superpowers and revfactory-harness implement, but instrumentalized for controlled comparison rather than production use. It is most notable as the only framework that provides quantitative proof that .claude/ pre-configuration improves LLM coding quality.

Positioning

This is a proof-of-concept / research paper repository. The "harness" is both the experimental apparatus and the subject of the experiment. Not suitable for production use as a standalone tool — the value is the validation data.

Observable Failure Modes

No install path: there is no way to use this as a standalone harness outside the experimental context.
Korean-only documentation: README and skill files are in Korean; most users cannot evaluate the quality without translation.
Inline agent prompts: agents are defined as inline strings within command files, not as reusable .claude/agents/*.md definitions — they cannot be reused outside the experiment context.
No license: no license declared, making derivative use legally uncertain.
Dormant: last commit 2026-03-06; likely not actively maintained.

Workflow

revfactory Claude Code Harness — Workflow

Experiment Phases

Phase	Action	Artifact	Gate
Case load	Read YAML from experiments/cases/	test case list	none
Parallel execution	BaselineAgent + HarnessAgent run same task	result.json in baseline/ and harness/	none
Evaluation	EvaluatorAgent scores both outputs	evaluation JSON with 5-dimension scores	none
Reporting	ReportAgent aggregates	summary report	none

Approval Gates

None. The experiment is fully automated once triggered.

Result Schema (per case)

{
  "files_created": ["list of files"],
  "total_lines": 0,
  "approach": "description",
  "agents_used": 0
}

Evaluation Dimensions

Each scored 0-10:

Completeness
Code Quality
Efficiency
Accuracy
Structure/Architecture

Memory Context

revfactory Claude Code Harness — Memory & Context

State Storage

File-based, project-scoped:

experiments/cases/ — YAML test case definitions (input)
experiments/results/{case-id}/baseline/ — baseline agent output
experiments/results/{case-id}/harness/ — harness agent output
experiments/results/{case-id}/result.json — per-run metadata
experiments/reports/ — aggregated reports

Memory Type

File-based, per-experiment.

Cross-Session Persistence

Results persist in the filesystem. New experiment runs can build on existing results.

Context Compaction

Not addressed — experiments are bounded tasks.

Orchestration

revfactory Claude Code Harness — Orchestration

Multi-Agent Architecture

Yes. The experiment design requires three concurrent agents:

BaselineAgent (general-purpose, worktree-isolated)
HarnessAgent (general-purpose, worktree-isolated)
EvaluatorAgent (sequential, after both complete)

Orchestration Pattern

parallel-fan-out with sequential evaluation: BaselineAgent and HarnessAgent run in parallel on the same task; EvaluatorAgent runs after both complete.

Isolation Mechanism

git-worktree: the command explicitly specifies worktree isolation for each experiment agent to prevent cross-contamination.

Multi-Model

No. All agents use the default session model.

Execution Mode

One-shot per /experiment invocation.

Multi-Agent Spawn

Via Claude Code's Task tool (inline prompts in command files, not .claude/agents/ definitions).

Consensus

None. EvaluatorAgent produces a single verdict.

Ui Cli Surface

revfactory Claude Code Harness — UI & CLI Surface

Dedicated CLI Binary

No.

Local UI

None.

Slash Commands (Claude Code)

/experiment [category|all] — run A/B comparison experiment
/evaluate [case-id] — evaluate single case
/report [full|summary|comparison] — generate report
/run-advanced-experiment — advanced experiment variant

Observability

Results are written as JSON to experiments/results/ and aggregated to experiments/reports/. The paper figures in paper/figures/ are the primary observability output.

Related frameworks

same archetype · same primary tool · same memory type

Claude-Flow / Ruflo ★ 55k

A6 Multi-agent orchestrator

Eliminates single-agent context limits and sequential bottlenecks by orchestrating fault-tolerant swarms of specialized AI agents…

Hermes Agent (NousResearch) ★ 168k

A6 Multi-agent orchestrator

Self-improving personal AI agent with closed learning loop, 7 terminal backends, and messaging gateway — not tied to any AI…

OpenCode ★ 165k

A6 Multi-agent orchestrator

Terminal-first AI coding agent with multi-model routing, native desktop app, and a typed .opencode/ configuration system for…

OpenHands ★ 75k

A6 Multi-agent orchestrator

Open-source AI software development platform (open-source Devin alternative) with Docker sandbox isolation, 77.6% SWE-bench…

DeerFlow ★ 70k

A6 Multi-agent orchestrator

Long-horizon superagent that researches, codes, and creates by orchestrating parallel sub-agents with isolated contexts in Docker…

oh-my-openagent (omo) ★ 60k

A6 Multi-agent orchestrator

Multi-provider AI agent orchestration for OpenCode: escape vendor lock-in by routing Sisyphus (Claude/Kimi/GLM) and Hephaestus…

Distribution

Type: standalone-repo
Install: clone-and-configure

Surfaces

CLI binary: No
CLI subcmds: 0
Local UI: No
Tech stack: none

Components

Commands: 4
Skills: 3
Subagents: 3
Hooks: 0
MCP servers: 0
MCP tools: 0
Scripts: 0
Templates: 0

Workflow

Phases: 4
Approval gates: 0
Spec format: yaml
Spec storage: per-feature-folder
Delta or full: whole-file

Orchestration

Multi-agent: Yes
Pattern: parallel-fan-out
Max concurrent: 2
Isolation: git-worktree
Consensus: none
Prompt chaining: Yes

Multi-model

Multi-model: No
BYOK: No
Modal: text

Execution

Mode: one-shot
Crash recovery: No
Compaction: No
Session handoff: No
Streaming: No

Memory

Type: file-based
Persistence: project
Search: none
State files: 3 files

Quality

TDD: No
TDD mechanism: none
Self-review: adversarial-subagent

Git / Observability

Auto commit: No
Auto PR: No
Auto merge: No
Worktree/feat: Yes
Audit log: Yes
Audit format: structured-md
Replay: Yes

Tools

Primary: claude-code
Targets: 1
Portability: single-tool

Signals

Stars: 98
Last commit: 2026-03-06
Contributors: 1
Maintainer: dormant
Quality score: 3.7/10