Skip to content
/

evo

evo · evo-hq/evo · ★ 770 · last commit 2026-05-26

Autonomously optimize code through parallel tree-search experiments with shared state, gate-validated results, and configurable frontier strategies.

Best whenCode optimization should be a tree search with gates, not a greedy hill climb — multiple parallel directions prevent premature convergence to degenerate solu…
Skip ifCommitting evo artifacts to main branch, Running without gates (degenerate optimization)
vs seeds
superpowersin its skills-only architecture, but evo uses those skills for autonomous code optimization via tree search while superp…
Primitive shape 13 total
Skills 5 Subagents 5 Hooks 3
00

Summary

evo — Summary

evo is a Python CLI + multi-host plugin that turns any codebase into an autonomous optimization loop: it discovers what to measure, instruments the benchmark, then runs a parallel tree-search with semi-autonomous subagents, each in an isolated git worktree (or remote sandbox), each reading shared failure traces before deciding what to try. The orchestrator uses configurable frontier strategies (argmax, top_k, epsilon_greedy, softmax, pareto_per_task) to select which branch to extend next, while cross-cutting scan subagents run RLM-inspired analysis to surface compound failure patterns. Gates (pass/fail checks) prevent the search from finding degenerate solutions. A web dashboard at http://127.0.0.1:8080 shows real-time experiment status. Runs on Claude Code, Codex, Cursor, Pi, Hermes, Opencode, and OpenClaw. Remote backends include Modal, E2B, Daytona, AWS, and Azure. evo is uniquely the only framework in this batch (and arguably in the entire catalog) designed for code optimization via tree search rather than feature development — its use case is "make this faster/better" rather than "build this feature." Compared to seeds, evo is closest to superpowers in its skills-only architecture, but with a radically different purpose: superpowers is behavioral scaffolding for developers, while evo is an autonomous research loop for optimizing code performance.

01

Overview

evo — Overview

Origin

evo was built by evo-hq (2 contributors). Apache-2.0 license. 770 stars. Published to PyPI as evo-hq-cli. Last pushed 2026-05-26 (active).

Philosophy

From the README:

"A plugin for your agentic framework that optimizes code through experiments. You give it a codebase. It discovers metrics to optimize, sets up the evaluation, and starts running experiments in a loop — trying things, keeping what improves the score, throwing away what doesn't."

Inspired by Karpathy's autoresearch (pure hill climb). evo adds:

  • Tree search over greedy hill climb. Multiple directions can fork from any committed node
  • Parallel semi-autonomous agents. Spawn multiple subagents simultaneously, each in its own git worktree
  • Shared state. Failure traces, annotations, and discarded hypotheses accessible to every agent
  • Gating. Regression tests or safety checks wired as gates; experiments failing a gate are discarded
  • Observability. A dashboard to monitor experiments

Key Design Opinions

  • Main stays clean: No evo-specific artifacts committed to main
  • Baseline is a worktree: First experiment (exp_0000) is where benchmark and instrumentation live; main is untouched
  • Ask the user as little as possible: Minimize friction; one question for benchmark selection
  • Directive injection: User can send [EVO DIRECTIVE] messages mid-run via any hook channel; these are honored as authoritative
  • Gates are mandatory for reliable search: Without gates, the optimizer finds degenerate solutions (return a constant, skip work, trade correctness for speed)

Supported Hosts

Claude Code, Codex, Cursor, Pi, Hermes, Opencode, OpenClaw

02

Architecture

evo — Architecture

Distribution & Install

  • Distribution type: cli-tool (Python, PyPI) + claude-plugin (skill-pack)
  • CLI install: uv tool install evo-hq-cli
  • Plugin install: evo install <host> (claude-code | codex | cursor | hermes | opencode | openclaw | pi)
  • Version: 0.4.x (analyzed from skills and README)
  • License: Apache-2.0
  • Required runtime: Python (via uv), host CLI (claude-code, codex, etc.)

Directory Tree (repo)

evo/
├── sdk/                    # Python SDK for evo experiment management
├── plugins/
│   └── evo/
│       ├── .claude-plugin/ # Claude Code marketplace manifest
│       ├── .codex-plugin/  # Codex plugin manifest
│       ├── hooks/
│       │   └── hooks.json  # PreToolUse, UserPromptSubmit, SessionStart hooks
│       ├── skills/
│       │   ├── discover/   # SKILL.md — codebase exploration + benchmark setup
│       │   ├── optimize/   # SKILL.md — parallel subagent optimization loop
│       │   ├── subagent/   # SKILL.md — subagent execution protocol
│       │   ├── infra-setup/ # SKILL.md — remote backend setup
│       │   └── references/ # CLI quick reference, provider matrix
│       ├── bin/
│       │   └── evo-hook-drain  # Hook drain binary (receives hook events)
│       └── src/            # Plugin Python source
├── scripts/                # Developer utilities
└── tests/                  # Test suite

Target AI Tools

Claude Code (/evo:), Codex ($evo), Cursor (/), Pi (extension via pi-subagents), Hermes, Opencode, OpenClaw

Experiment Workspace Backends

Backend Location Install
worktree (default) Local git worktree included
pool Reuse fixed set of workspaces included
ssh Your own SSH host included
modal Modal serverless evo-hq-cli[modal]
e2b E2B cloud sandboxes evo-hq-cli[e2b]
daytona Daytona workspaces evo-hq-cli[daytona]
aws AWS EC2 evo-hq-cli[aws]
azure Azure VMs evo-hq-cli[azure]

Hook Architecture

The evo-hook-drain binary runs on all three hook events (PreToolUse, UserPromptSubmit, SessionStart). It drains the evo message queue into the agent's context, enabling the orchestrator to send directives to in-flight subagents via the hook channel.

03

Components

evo — Components

CLI Commands (evo-hq-cli)

Command Purpose
evo install <host> Install plugin into host's marketplace + stage hooks
evo doctor <host> Verify installation
evo update <host> Update plugin
evo init Initialize evo workspace for a project
evo status Show workspace status
evo new --parent <exp> Create new experiment branch
evo run <exp_id> Run benchmark on experiment
evo discard <exp_id> Discard failed experiment
evo dashboard Start web dashboard
evo config runtime show Show benchmark runtime configuration
evo env show Show environment configuration
evo workspace status Show workspace/pool occupancy
evo bash/read/write/edit/glob/grep --exp-id <id> Remote backend file operations
evo-version-check Verify CLI/plugin version sync
evo direct Send user directive to in-flight subagents

Skills

Skill Invocation Purpose
discover /evo:discover One-time: explore repo, identify optimization target, set up benchmark, run baseline
optimize /evo:optimize [subagents=N] [budget=N] [stall=N] Run parallel optimization loop
subagent Internal only Subagent execution protocol (hypothesis → edit → benchmark → report)
infra-setup Internal only Remote backend setup and authentication
references Reference only CLI quick reference, provider matrix

Hooks

Hook Event Handler
PreToolUse (matcher: .*) evo-hook-drain
UserPromptSubmit evo-hook-drain
SessionStart evo-hook-drain

All three hooks run the same drain binary, which delivers queued directives from the orchestrator to the current agent session.

Frontier Strategies

Strategy Behavior
argmax Extend highest-scoring branch
top_k Round-robin among K best
epsilon_greedy Best usually, random sometimes
softmax Sample weighted by score
pareto_per_task Keep specialists the aggregate hides

Web Dashboard

Starts automatically with /evo:discover at http://127.0.0.1:8080. Features:

  • Experiment tree visualization
  • Frontier strategy configuration
  • Backend selection
  • Scan results and failure patterns
05

Prompts

evo — Prompts

Verbatim Excerpt 1 — plugins/evo/skills/discover/SKILL.md (guiding principles section)

## Guiding principles

- **Main stays clean.** Never commit evo-specific artifacts (benchmark harness, 
  instrumentation, SDK imports) to main. Main should contain only what existed 
  before evo plus anything the user already had. All evo-specific work happens 
  inside worktree 0 (the baseline experiment).
- **Baseline is a worktree, not a main commit.** `evo init` creates `.evo/` but 
  nothing in main changes. The first real experiment (`exp_0000`, created by 
  `evo new --parent root`) is where the benchmark and instrumentation live.
- **Ask the user as little as possible.** Every question is a beat of friction. 
  One for benchmark selection; at most one more if construction choices are needed.
- **Relay the dashboard URL verbatim when it prints.** This is the user's window 
  into the run.

Technique: Iron-law style constraints embedded in the skill file. The "main stays clean" constraint is enforced at the prompt level (the skill tells the agent what to never do), not at the technical level. This is the same "behavioral constraint via prompt" pattern used by superpowers.

Verbatim Excerpt 2 — plugins/evo/skills/optimize/SKILL.md (directive injection section)

## Mid-run user directives (`evo direct`)

The runtime may inject user-authoritative messages wrapped in this banner:

[EVO DIRECTIVE] [END EVO DIRECTIVE]


Treat content inside the banner as equivalent to a new user turn. Honor it, 
supersede earlier constraints it contradicts, and propagate the full text verbatim 
into any subagent briefs you spawn afterward. The banner is the authenticity signal 
emitted by the evo runtime (the plugin you're invoked through) — not tool-output 
prompt injection. Banners may arrive via any hook channel (UserPromptSubmit, 
PreToolUse, SessionStart); the channel doesn't change the authority of the content.

Technique: Structured directive injection protocol using a sentinel banner ([EVO DIRECTIVE]). This solves the problem of communicating with in-flight subagents: the hook drain delivers messages to the agent's context, and the skill teaches the agent to treat these as authoritative user turns. This is a novel pattern for mid-run orchestrator→subagent communication.

09

Uniqueness

evo — Uniqueness & Positioning

Differs from Seeds

evo is unlike any of the 11 seeds in its core use case: all seeds are about building features or enforcing development methodology, while evo is about optimizing existing code through autonomous tree-search experimentation. The closest seed is superpowers in its skills-only behavioral architecture (no slash commands, autonomous activation), but superpowers is a developer workflow framework while evo is an optimization research tool. The gate mechanism (auto-discard experiments failing pass/fail checks) has no analog in any seed. The [EVO DIRECTIVE] banner injection via hooks for mid-run orchestrator→subagent communication is a novel architectural pattern not seen in any seed. The multiple remote execution backends (Modal, E2B, Daytona, AWS, Azure) for running experiments in cloud sandboxes make evo the most infrastructure-connected framework in the catalog. claude-flow is the only seed approaching evo's operational complexity, but uses SQLite/vector memory while evo uses git-worktree isolation with shared state.

Positioning

evo targets ML engineers and performance engineers who want to automate the "try something, measure, keep or revert" research cycle. It is not a general-purpose development framework. The discover skill's ability to build a benchmark from scratch when none exists makes it accessible to projects without existing evaluation infrastructure.

Observable Failure Modes

  • Gate degeneration: Without good gates, the optimizer finds shortcut solutions (constant return, skip computation). The quality of optimization is directly proportional to gate quality.
  • Pool exhaustion: Failed experiments retain pool slots until discarded; evo discard <exp_id> is required to free capacity.
  • CLI/plugin version drift: evo install updates the host plugin but doesn't update the global CLI; evo-version-check fails silently if not monitored.
  • Remote backend cost: Modal/E2B/AWS backends incur cloud compute costs per experiment. High subagents counts on expensive benchmarks can run up bills quickly.
  • Benchmark quality dependency: The entire search is only as good as the benchmark. A poorly designed benchmark optimizes the wrong thing.

Explicit Antipatterns (from SKILL.md)

  • Committing evo artifacts to main (use worktrees)
  • Skipping gates (degenerate solutions found without gates)
  • Auto-installing CLI from agent context (use user-executed install commands)
  • Running evo new when an experiment is already active in remote mode (use evo run <exp_id> for recovery)
04

Workflow

evo — Workflow

Phases

Phase Description Artifact
Discover /evo:discover — explore repo, identify metric, build benchmark, run baseline .evo/ init, exp_0000 worktree with benchmark
Optimize /evo:optimize — run parallel rounds of subagent experiments Experiment branches, scores
Per-round: Brief generation Orchestrator writes one brief per subagent with objective, parent, boundaries, pointer traces Subagent briefs
Per-round: Parallel subagents N subagents (default 5) each read traces, form hypothesis, edit, run benchmark Experiment results per subagent
Per-round: Cross-cutting scan RLM-inspired scan subagents read trace batches; surface compound failure patterns Shared state annotations
Per-round: Frontier selection Orchestrator selects which committed branch to extend next per strategy Next parent experiment
Gate check Every experiment run triggers gate checks (exit 0 = pass, non-zero = discard) Gate pass/fail per experiment
Dashboard monitoring User monitors experiment tree, scores, frontier strategy Dashboard view
Stop Stall limit reached (N consecutive rounds with no improvement) or user interrupts Final best experiment

Phase-to-Artifact Map

Phase Artifact
Discover .evo/ configuration, exp_0000 baseline worktree
Per-round subagent New experiment branch (exp_NNNN), benchmark score, trace files
Cross-cutting scan Shared state failure patterns
Frontier selection Updated next-parent pointer

Approval Gates

Gate Type
Benchmark gate (held-out slice) Automatic — command exit code
User-defined gates Automatic — any command that exits 0/non-0
Version check (evo-version-check) Automatic at session start

Gates are mandatory. Without gates, the optimizer finds degenerate solutions. Gate failure discards the experiment even if score improves.

Discover Phase Details

  1. Verify evo-version-check
  2. Explore repo (READMEs, entry points, config files, tests, existing eval scripts)
  3. Check for existing benchmarks; ask user at most once if ambiguous
  4. Create .evo/ init (never commits to main)
  5. Create exp_0000 worktree with benchmark and instrumentation
  6. Run baseline; relay dashboard URL to user
06

Memory Context

evo — Memory & Context

State Storage

evo uses a multi-layer state system:

Layer Location Content
Workspace state .evo/ (not committed to main) Experiment tree, scores, backend config
Experiment worktrees Git worktrees per experiment Isolated code changes per hypothesis
Shared state evo runtime store Failure traces, annotations, discarded hypotheses
Dashboard HTTP at port 8080 Live experiment status
Attempt state attempts/NNN/ per experiment Checkpoint files, attempt_state.json for recovery

Shared State (Cross-Agent Memory)

Shared state is the primary memory mechanism for coordination: before any subagent begins an experiment, it reads the shared state to learn what has already been tried, what failed, and why. This prevents redundant hypotheses and builds on prior work. Shared state includes:

  • Failure traces from all prior experiments
  • Annotations added by subagents
  • Discarded hypothesis summaries
  • Scan results from cross-cutting scan subagents

Context for Subagents

Each subagent brief (written by the orchestrator) includes:

  • Objective (what to optimize)
  • Parent experiment ID
  • Boundaries (what not to change)
  • Pointer traces (specific failure patterns to address)

The subagent does NOT inherit the orchestrator's full conversation history — it gets a focused brief. This is the same context-isolation pattern as cestDone's Director+Worker split.

Crash Recovery

For remote backends: evo run <exp_id> is also the recovery command. evo reattaches to the existing remote process if still active. Checkpoint files in attempts/NNN/checkpoints/ enable phase-level recovery for expensive benchmarks.

Cross-Session Handoff

Yes — .evo/ persists across sessions. evo status picks up where the last session left off. The experiment tree and all scores persist.

07

Orchestration

evo — Orchestration

Multi-Agent Support

Yes — core design. The optimize skill spawns N subagents (default 5) in parallel, each in an isolated workspace.

Orchestration Pattern

task-decomposition-tree — the orchestrator maintains a tree of experiment branches, selects which to extend next based on frontier strategy, writes focused briefs for each subagent, and collects results. Cross-cutting scan subagents analyze failure patterns between rounds.

This is a hierarchical pattern (orchestrator → subagents) combined with a tree search rather than a linear queue.

Isolation Mechanism

git-worktree (default) — each experiment gets its own isolated git worktree. Remote backends (Modal, E2B, Daytona, AWS, Azure) provide container-level isolation for expensive experiments.

Multi-Model Support

No. evo invokes whichever AI CLI is configured on the host. No role-based model routing within evo itself (though the user could configure their host to use different models).

Execution Mode

Continuous optimization loop — rounds continue until the stall limit is reached or the user interrupts. The loop is not a Ralph-style sequential story executor; it is an active search process.

Directive Injection (unique mechanism)

The evo-hook-drain binary runs on PreToolUse, UserPromptSubmit, and SessionStart events. This allows the orchestrator to inject [EVO DIRECTIVE] messages into in-flight subagent sessions via the hook channel. This is the only framework in the catalog that uses hook events as an inter-agent communication channel.

Consensus Mechanism

None formal. Frontier strategy selection (argmax, top_k, etc.) is unilateral orchestrator decision. Cross-cutting scan subagents surface patterns for the orchestrator to consider.

Max Concurrent Agents

Configurable: subagents=N (default 5). Pool mode caps at pool size.

Quality Gates

Gates are first-class: any command exiting 0 = pass, non-0 = fail. Gate failure discards the experiment. Gates are inherited down the experiment tree from where they are registered. The discover skill automatically adds a held-out-slice gate.

08

Ui Cli Surface

evo — UI & CLI Surface

Dedicated CLI Binary

Yes — evo (Python, distributed via PyPI as evo-hq-cli). A self-contained orchestration runtime, not a thin wrapper.

Key Subcommands

install, doctor, update, init, status, new, run, discard, dashboard, config, env, workspace, direct, evo-version-check

Local Web Dashboard

Yes — the most prominent local dashboard in the batch.

Feature Detail
URL http://127.0.0.1:8080 (auto-increments if port in use)
Starts automatically With /evo:discover (or evo init)
Features Experiment tree, scores, frontier strategy config, backend config, scan results
Persistence Port is remembered across runs

Dashboard tabs:

  • Frontier — select and configure search strategy (argmax, top_k, epsilon_greedy, softmax, pareto_per_task) with per-strategy parameters
  • Backend — select workspace backend (local worktree, pool, ssh, Modal, E2B, Daytona, AWS, Azure)

Skills Surface (within host AI tool)

Invocation Host
/evo:discover, /evo:optimize Claude Code
$evo discover, $evo optimize Codex
/discover, /optimize (skill menu) Cursor
Natural language Hermes, Opencode, OpenClaw, Pi

IDE Integration

No dedicated IDE plugin. Integrates with any AI coding tool that supports the Agent Skills spec.

Observability

  • Web dashboard (experiment tree, scores, failure patterns)
  • evo status — workspace status
  • evo workspace status — pool occupancy, commit strategy
  • Attempt state files: attempts/NNN/attempt_state.json
  • Checkpoint files in attempts/NNN/checkpoints/

Related frameworks

same archetype · same primary tool · same memory type

Claude-Flow / Ruflo ★ 55k

Eliminates single-agent context limits and sequential bottlenecks by orchestrating fault-tolerant swarms of specialized AI agents…

Hermes Agent (NousResearch) ★ 168k

Self-improving personal AI agent with closed learning loop, 7 terminal backends, and messaging gateway — not tied to any AI…

OpenCode ★ 165k

Terminal-first AI coding agent with multi-model routing, native desktop app, and a typed .opencode/ configuration system for…

OpenHands ★ 75k

Open-source AI software development platform (open-source Devin alternative) with Docker sandbox isolation, 77.6% SWE-bench…

DeerFlow ★ 70k

Long-horizon superagent that researches, codes, and creates by orchestrating parallel sub-agents with isolated contexts in Docker…

oh-my-openagent (omo) ★ 60k

Multi-provider AI agent orchestration for OpenCode: escape vendor lock-in by routing Sisyphus (Claude/Kimi/GLM) and Hephaestus…