Skip to content
/

SWE-agent

swe-agent · SWE-agent/SWE-agent · ★ 19k · last commit 2026-05-25

Primitive shape
No installable primitives
00

Summary

SWE-agent — Summary

SWE-agent is a research-grade, GitHub-issue-fixing agent from Princeton University and Stanford, recognized at NeurIPS 2024. It is governed by a single YAML configuration file that defines system prompts, tool bundles, and history processors. The agent runs in a REPL loop: write ONE bash command, observe output, write the next command. The default config uses Anthropic's computer-use-demo style with SEARCH/REPLACE edits; a bash_only config uses pure shell. SWE-agent is primarily designed for benchmark evaluation (SWE-bench, cybersecurity CTFs) rather than daily developer use — most of its active development effort has moved to mini-SWE-agent (100 lines of Python).

SWE-agent is architecturally distinct from all seed frameworks: it is the academic benchmark baseline — every other coding agent measures itself against SWE-bench, which SWE-agent was built to solve. Seed frameworks like superpowers/BMAD/spec-kit are developer workflow tools; SWE-agent is a research instrument. Its single-YAML-file configuration approach (no plugin system, no skills directory) is the most minimal agent architecture in this batch.

01

Overview

SWE-agent — Overview

Origin

Academic project from Princeton University (John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press). Published at NeurIPS 2024. Originally the system that defined SWE-bench performance as a benchmark for coding agents.

Current Status

From README:

"Most of our current development effort is on mini-swe-agent, which has superseded SWE-agent. It matches the performance of SWE-agent, while being much simpler. Our general recommendation is to use mini-SWE-agent instead of SWE-agent going forward."

SWE-agent still works and is maintained, but is in maintenance mode while the team focuses on mini-SWE-agent (65% SWE-bench in 100 lines of Python).

Philosophy

SWE-agent's design philosophy is research-first:

  1. Configurable via YAML: The entire agent behavior is defined in a YAML config file. No plugin system, no complexity.
  2. Benchmarkable: Designed to run thousands of trials reliably against SWE-bench and other benchmarks
  3. Model-agnostic: Works with GPT-4o, Claude Sonnet, and custom open-weights models
  4. Tool bundles: Tools are organized into bundles (shell, file editing, review) that can be mixed and matched

Research Contributions

  1. SWE-bench (the benchmark for coding agents)
  2. EnIGMA mode for offensive cybersecurity / CTF challenges
  3. The Agent-Computer Interface (ACI) concept — how to design tools for LLM agents
  4. SWE-smith: training data pipeline for coding agents

REPL Architecture

The agent operates strictly as:

  1. Write ONE command
  2. System executes command
  3. Observe output
  4. Repeat

This is deliberately minimal — no parallel execution, no multi-step commands (except && chains), no concurrent tools. The simplicity enables reliable benchmarking.

02

Architecture

SWE-agent — Architecture

Distribution

pip install swe-agent
# or
git clone https://github.com/SWE-agent/SWE-agent
cd SWE-agent && pip install -e .

CLI Binary

From pyproject.toml:

  • sweagent — main CLI

Config-Driven Design

The entire agent behavior is defined in YAML config files in config/:

config/
  default.yaml            # Main config (Anthropic computer-use style)
  bash_only.yaml          # Simple bash-only config
  coding_challenge.yaml   # For competitive coding
  default_mm_with_images.yaml  # Multimodal with images
  default_mm_no_images.yaml    # Multimodal without images
  benchmarks/             # Benchmark-specific configs
  demo/                   # Demo configs
  exotic/                 # Experimental configs
  human/                  # Human baseline configs
  sweagent_0_7/           # v0.7 compatibility configs

Source Layout

sweagent/
  agent/
    agents.py          # Agent implementation
    history_processors.py  # Context management
    reviewer.py        # Self-review capability
    models.py          # LLM provider abstraction
    problem_statement.py  # Problem input parsing
  environment/         # Execution environment (Docker/local)
  tools/
    bundle.py          # Tool bundle loading
    commands.py        # Tool command definitions
    tools.py           # Tool execution
    parsing.py         # Tool output parsing
  run/                 # CLI entrypoints
  inspector/           # Trajectory inspection tools
tools/                 # Tool bundles (YAML/bash scripts)
  registry/            # Tool registry
  edit_anthropic/      # Anthropic-style edit tools
  review_on_submit_m/  # Review on submit tools

Required Runtime

  • Python >= 3.9
  • Docker (for sandboxed execution)
  • Or: local execution (for simple tasks)

Target AI Tools

  • Anthropic Claude (primary benchmark model)
  • OpenAI GPT-4o
  • Custom open-weights models (SWE-agent-LM-32b)
  • Any model via the models.py abstraction
03

Components

SWE-agent — Components

Config Files (Primary Customization Point)

Config Purpose
default.yaml Primary: Anthropic computer-use style with SEARCH/REPLACE + review on submit
bash_only.yaml Minimal: REPL with bash only, any instruction-following model
coding_challenge.yaml Competitive coding / APPS/MBPP benchmarks
default_mm_with_images.yaml Multimodal with screenshot context
default_mm_no_images.yaml Multimodal without image injection

Tool Bundles

Tools organized into composable bundles:

Bundle Purpose
tools/registry Core file navigation and search tools
tools/edit_anthropic Anthropic-style SEARCH/REPLACE edit tools
tools/review_on_submit_m Self-review triggered before submitting

CLI Subcommands

Command Purpose
sweagent run Run agent on a single problem
sweagent run-batch Run agent on multiple problems (batch mode)
sweagent inspect Inspect a trajectory file

History Processors

history_processors.py — manages what history is fed back to the model:

  • cache_control — Anthropic prompt caching on last N messages
  • Truncation strategies for long trajectories

Reviewer

reviewer.py — implements self-review capability. Before submitting a fix, the agent can be configured to review its own work (the review_on_submit_m tool bundle triggers this).

Trajectory Files

SWE-agent generates trajectory files (.json) recording:

  • Every action taken
  • Every observation received
  • Model thinking/reasoning
  • Final patch

These are machine-readable audit logs used for research analysis.

EnIGMA Mode

Cybersecurity-focused configuration for CTF (Capture the Flag) challenges. Uses different tool bundles and system prompts specialized for offensive security tasks.

05

Prompts

SWE-agent — Prompts

Prompt 1: Default Config System Template

Source: config/default.yamlagent.templates.system_template

Technique: Minimal system prompt (two sentences) designed for the Anthropic computer-use style. The simplicity is intentional — the agent uses tools rather than reasoning about how to use them.

You are a helpful assistant that can interact with a computer to solve tasks.

Prompt 2: Default Config Instance Template

Source: config/default.yamlagent.templates.instance_template

Technique: Structured problem statement injection with explicit workflow instructions and boundaries.

instance_template: |-
  <uploaded_files>
  {{working_dir}}
  </uploaded_files>
  I've uploaded a python code repository in the directory {{working_dir}}. Consider the following PR description:

  <pr_description>
  {{problem_statement}}
  </pr_description>

  Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
  I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
  Your task is to make the minimal changes to non-tests files in the {{working_dir}} directory to ensure the <pr_description> is satisfied.
  Follow these steps to resolve the issue:
  1. As a first step, it might be a good idea to find and read code relevant to the <pr_description>
  2. Create a script to reproduce the error and execute it with `python <filename.py>` using the bash tool, to confirm the error
  3. Edit the sourcecode of the repo to resolve the issue
  4. Rerun your reproduce script and confirm that the error is fixed!
  5. Think about edgecases and make sure your fix handles them as well
  Your thinking should be thorough and so it's fine if it's very long.

Prompt 3: bash_only Config System Template

Source: config/bash_only.yamlagent.templates.system_template

Technique: REPL enforcement — agent must output EXACTLY ONE command per turn. Format violation = rejection.

You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks.
You operate in a REPL (Read-Eval-Print Loop) environment where you must issue exactly ONE command at a time.
Your response must contain exactly ONE bash code block with ONE command (or commands connected with && or ||).

Include a THOUGHT section before your command where you explain your reasoning process.
Format your response as:

THOUGHT: Your reasoning and analysis here

```bash
your_command_here

Failure to follow these rules will cause your response to be rejected.


---

## Prompt 4: Submit Review Message

Source: `config/default.yaml` — `registry_variables.SUBMIT_REVIEW_MESSAGES`

**Technique**: Post-implementation self-review checklist. This is the review gate before submission.

Thank you for your work on this issue. Please carefully follow the steps below to help review your changes.

  1. If you made any changes to your code after running the reproduction script, please run the reproduction script again. If the reproduction script is failing, please revisit your changes and make sure they are correct. If you have already removed your reproduction script, please ignore this step.
  2. Remove your reproduction script (if you haven't done so already).
  3. If you have modified any TEST files, please revert them to the state they had before you started fixing the issue. You can do this with git checkout -- /path/to/test/file.py. Use below to find the files you need to revert.
  4. Run the submit command again to confirm.

Here is a list of all of your changes:

{{diff}} ```

Prompting Techniques Used

  1. REPL enforcement: Exactly ONE command per turn — format violation = rejection. Forces structured output.
  2. Workflow steps: Numbered steps (find → reproduce → edit → verify → edge cases)
  3. Boundary declaration: "DON'T modify tests" — explicit prohibition
  4. Self-review checklist: Post-implementation review with diff injection before submit
  5. Minimal system prompts: Short system prompts rely on tool definitions and history, not long instructions
  6. Template variables: {{working_dir}}, {{problem_statement}}, {{diff}} — dynamic injection
09

Uniqueness

SWE-agent — Uniqueness and Positioning

Differs from Seeds

SWE-agent is the academic benchmark baseline for the entire AI coding agent space. Every framework in this batch (aider, cline, openhands, goose, etc.) and every seed framework measures itself against SWE-bench, which SWE-agent was built to solve. Unlike seed frameworks (which are developer workflow tools), SWE-agent is a research instrument — it generates trajectory files for analysis, supports batch evaluation against thousands of problems, and is optimized for reproducible benchmarking rather than daily developer use. The single YAML config file approach (no plugin system, no skills directory, no methodology) is the most minimal architecture in this corpus. The closest seed comparison is aider (both are execution-layer agents), but aider is optimized for developer ergonomics while SWE-agent is optimized for benchmark performance.

Key Differentiators

  1. Academic pedigree: NeurIPS 2024 publication, Princeton/Stanford, cited by virtually every other agent's paper/README. The SWE-bench benchmark originated here.

  2. YAML-configurable agent loop: The entire system prompt, tool bundles, history processors, and instance templates are specified in a single YAML file. No plugin system, no code changes needed to experiment.

  3. Trajectory files: Every run produces a JSON trajectory with every action, observation, and reasoning step. This enables research-grade analysis of agent behavior.

  4. Batch evaluation mode: sweagent run-batch designed for running thousands of problems in parallel. No other framework in this corpus is designed for this scale.

  5. EnIGMA mode: Cybersecurity CTF capabilities (offensive security) that no other framework ships.

  6. One-command-at-a-time REPL: The strict ONE-command-per-turn enforcement (with rejection on violation) enables more reliable benchmarking — no multi-step actions confound results.

  7. Self-review before submit: The review_on_submit_m tool bundle makes the agent review its own work before submitting. This is a mini-adversarial self-check.

Observable Failure Modes

  1. Development effort moved to mini-SWE-agent: The team recommends using mini-SWE-agent instead — suggests SWE-agent may not evolve much
  2. Not designed for daily developer use: The benchmark-first design means poor ergonomics for iterative development
  3. Requires reproduction script: The workflow requires writing a reproduction script, which is extra work vs aider's direct editing
  4. Strict REPL may slow exploration: ONE command at a time is efficient for benchmarks but tedious for complex codebases
  5. No memory: Each run starts from scratch — no learning across runs (by design for benchmark reproducibility)
04

Workflow

SWE-agent — Workflow

Single Issue Workflow

  1. Provide problem — GitHub issue URL, local repo path, or problem statement
  2. Config selection — specify YAML config file
  3. REPL loop:
    • Agent receives system prompt + instance context (repo contents summary, problem statement)
    • Agent outputs THOUGHT + one bash command
    • System executes command in environment
    • Observation fed back to agent
    • Repeat until agent submits
  4. Review gate (if configured) — agent reviews its own patch before submission
  5. Submit — patch applied; trajectory file saved

Batch Mode (Benchmarking)

sweagent run-batch --config config/default.yaml --dataset swebench

Runs the agent against hundreds/thousands of problems in parallel. Used for benchmark evaluation.

Interactive Mode

Also supports interactive use where a human is the "agent" (human baseline configs in config/human/).

Phases + Artifacts Table

Phase Artifact
Problem intake Parsed problem statement
Exploration Bash command history
Implementation Code edits
Self-review Review output (if configured)
Submission Patch (git diff) + trajectory JSON

Approval Gates

Gate Type Notes
Submit confirmation Built into the tool Agent must call the submit tool explicitly
Self-review automatic review_on_submit_m bundle triggers review before submit

SWE-agent is designed for autonomous operation — no human approval gates in the standard benchmark workflow. The submit confirmation is agent-to-system, not agent-to-human.

06

Memory Context

SWE-agent — Memory and Context

State Storage

State Storage Scope
Trajectory JSON file (trajectories/) Per-run
Conversation history In-memory Session
Config YAML files Shared

Trajectory Files

The primary persistent artifact is the trajectory JSON file:

  • Records every action and observation
  • Records model thinking/reasoning
  • Records the final submitted patch
  • Machine-readable for research analysis
  • Stored in trajectories/ directory

Context Management (History Processors)

history_processors.py manages what history is fed to the model:

  • cache_control: Applies Anthropic prompt caching to the last N messages (configurable via last_n_messages: 2 in default.yaml)
  • Truncation strategies to handle long trajectories

The cache_control processor is specifically for cost optimization — reusing cached prompt prefixes across turns.

No Cross-Session State

Each SWE-agent run is completely independent. There is no cross-run memory, no database, no accumulated knowledge. The trajectory files are for analysis, not for feeding back into future runs.

Context Window

The entire history of actions and observations is maintained in the conversation. History processors trim/compress it when it grows too large. The default config uses cache_control which doesn't trim but applies caching.

07

Orchestration

SWE-agent — Orchestration

Multi-Agent

No. Single agent per problem. Batch mode runs multiple independent agents in parallel (one per problem), but they don't communicate.

Orchestration Pattern

Sequential REPL loop. No branching, no subagents.

Isolation Mechanism

Docker container per run (for benchmarking). Local execution also supported.

Multi-Model

Configurable per config file. No built-in multi-model routing.

Execution Mode

One-shot: the agent runs until it submits a patch or hits a limit, then exits.

Crash Recovery

Trajectory files provide crash recovery — runs can be resumed from a checkpoint.

Context Compaction

Via history processors. The cache_control processor applies caching rather than compaction. Other processors may truncate.

Consensus Mechanism

None. The reviewer in reviewer.py is a self-review, not a consensus mechanism.

Prompt Chaining

Yes: execution output is fed back as observation, which becomes part of the next prompt. This is the core REPL loop.

Batch Mode

sweagent run-batch runs many independent one-shot agents in parallel against a dataset. Each agent is isolated and runs in its own environment. Results are aggregated for benchmark analysis.

08

Ui Cli Surface

SWE-agent — UI and CLI Surface

CLI Binary

Name: sweagent
Install: pip install swe-agent
Is thin wrapper: No — own agent runtime

Subcommands

Subcommand Purpose
sweagent run Run agent on single problem
sweagent run-batch Run agent on multiple problems
sweagent inspect Inspect trajectory file

No UI

SWE-agent is a pure CLI tool with no web dashboard or IDE extension. The terminal output is the UI.

GitHub Codespaces

The repo supports "Open in GitHub Codespaces" for browser-based usage.

Trajectory Inspection

sweagent inspect <trajectory.json> — view a recorded agent trajectory. This is the primary observability tool.

Observability

  • Trajectory JSON files (trajectories/) — complete machine-readable record of every action
  • Stdout logging during run
  • No web dashboard

Research Tools

The inspector/ module provides tools for analyzing trajectory files:

  • View model reasoning
  • Replay agent actions
  • Compare trajectories across models
  • Compute benchmark statistics

Related frameworks

same archetype · same primary tool · same memory type

Claude-Flow / Ruflo ★ 55k

Eliminates single-agent context limits and sequential bottlenecks by orchestrating fault-tolerant swarms of specialized AI agents…

Hermes Agent (NousResearch) ★ 168k

Self-improving personal AI agent with closed learning loop, 7 terminal backends, and messaging gateway — not tied to any AI…

OpenCode ★ 165k

Terminal-first AI coding agent with multi-model routing, native desktop app, and a typed .opencode/ configuration system for…

OpenHands ★ 75k

Open-source AI software development platform (open-source Devin alternative) with Docker sandbox isolation, 77.6% SWE-bench…

DeerFlow ★ 70k

Long-horizon superagent that researches, codes, and creates by orchestrating parallel sub-agents with isolated contexts in Docker…

oh-my-openagent (omo) ★ 60k

Multi-provider AI agent orchestration for OpenCode: escape vendor lock-in by routing Sisyphus (Claude/Kimi/GLM) and Hephaestus…