Skip to content
/

terminal-bench-env (TermiGen)

terminal-bench-env · ucsb-mlsec/terminal-bench-env · ★ 82 · last commit 2026-03-24

Primitive shape
No installable primitives
00

Summary

terminal-bench-env — Summary

terminal-bench-env is a research repository from UC Santa Barbara's ML Security lab providing 3,500+ verified Docker environments and two minimal BashAgent implementations for evaluating terminal-based AI agents. It accompanies the TermiGen paper (arXiv 2602.07274), which introduces a 32B parameter model (TermiGen-32B) fine-tuned from Qwen2.5-Coder via error-correction trajectory synthesis. The repository is not an agent harness in the traditional sense — it is a benchmark environment corpus spanning 11 task categories (infrastructure, DevOps, security, data processing, ML/MLOps, algorithms, software development, scientific computing, interactive environments, distributed computing, formal verification). Tasks are available in TerminalBench 1.0 format (Docker Compose) and Harbor 2.0 format. The BashAgent implementation is a minimal ReAct-style agent with tmux-based shell interaction. This is a Tier B/C entry: no workflow methodology, no skill system, no persistent memory — a pure evaluation infrastructure.

Differs from seeds: No seed is a benchmark environment. terminal-bench-env is closer to evaluation infrastructure than an agent harness. It has no overlap with any seed framework philosophically or architecturally.

01

Overview

terminal-bench-env — Overview

Origin

Developed by the UC Santa Barbara ML Security Lab (ucsb-mlsec). Research artifact accompanying the TermiGen paper. The project was selected as the #2 Hugging Face Daily Papers on 2026-02-10.

Purpose

Provide high-fidelity Docker environments for training and evaluating terminal-based AI agents. The core research contribution is:

  1. 3,500+ verified executable Docker tasks
  2. BashAgent (minimal ReAct implementation)
  3. TermiGen-32B (fine-tuned model, separate on HuggingFace)

Key Metrics (from README)

  • 420 unique command-line tools in the corpus
  • 16 functional domains
  • Average task complexity: 25.5 turns, 8,722 tokens
  • TermiGen-32B performance: 31.3% TerminalBench 1.0, 19.3% TerminalBench 2.0, 21.4% SWE-Bench Verified
  • +26.8% absolute improvement over base Qwen2.5-Coder-32B

Repo Facts

02

Architecture

terminal-bench-env — Architecture

Repository Layout

terminal-bench-env/
├── tasks/                    # 3,500+ task definitions
│   ├── infrastructure/
│   ├── devops/
│   ├── security/
│   ├── data_processing/
│   ├── ml_mlops/
│   ├── algorithms/
│   ├── software_development/
│   ├── scientific_computing/
│   ├── interactive_environments/
│   ├── distributed_computing/
│   └── formal_verification/
├── bash_agent.py             # BashAgent for TerminalBench 1.0 (tb framework)
├── bash_agent_harbor.py      # BashAgent for Harbor 2.0 framework
├── docker-compose.yml        # Task environment composition
└── README.md

Task Structure

Each task is a Docker-based environment:

  • docker-compose.yml or Harbor manifest
  • task.json / task.yaml describing the goal, success criteria, and setup
  • Pre-seeded filesystem state
  • Automated verification (bash scripts or test suites)

Tasks are self-contained: start the container → give the agent a terminal → run the verifier.

BashAgent Architecture

Two minimal implementations, both ReAct-style:

bash_agent.py (TerminalBench tb framework)

LLM → think → act (bash command) → observe (stdout/stderr) → repeat
  • tmux-based shell: sends commands to a tmux pane, reads output
  • No tool abstraction layer — raw bash in/out
  • Model: any OpenAI-compatible endpoint
  • Context window: sliding window over turn history
  • No memory beyond the current session

bash_agent_harbor.py (Harbor 2.0 framework)

  • Same ReAct loop, adapted to Harbor task format
  • Harbor provides structured task metadata and scoring

Isolation Model

Each task runs in a Docker container with:

  • Fresh filesystem state per task
  • No network egress in most tasks
  • Resource limits via Docker
  • Verifier script runs inside or alongside the container

Data Pipeline (TermiGen paper)

The repo is the benchmark environment component of a larger data pipeline:

  1. Human-curated task definitions → Docker environments
  2. Expert solutions captured as trajectories
  3. Error-correction: agent makes mistakes → corrections captured
  4. Synthetic trajectory augmentation → fine-tuning dataset
  5. TermiGen-32B trained on augmented dataset

terminal-bench-env provides only steps 1-2 (the environment corpus). The trajectory synthesis pipeline is separate.

TerminalBench 1.0 vs Harbor 2.0

Aspect TerminalBench 1.0 Harbor 2.0
Format Docker Compose Harbor manifest
Evaluation Bash verifier Harbor scoring
Agent file bash_agent.py bash_agent_harbor.py
Scope Original 3,500+ tasks Extended/updated tasks

Non-Architecture (What's Absent)

  • No workflow engine
  • No skill system
  • No hook infrastructure
  • No memory store beyond current session
  • No orchestration layer
  • No multi-agent coordination

This is intentional: the repo is an evaluation substrate, not an agent harness.

03

Skills And Commands

terminal-bench-env — Skills & Commands

Skills

None. terminal-bench-env contains no skill files, no .claude/skills/ directory, and no slash-command definitions. It is not a Claude Code configuration — it is a benchmark corpus.

Commands

None. No commands/ directory exists. No / prefixed commands.

BashAgent "Tools"

The only "tools" in the repo are the two agent scripts. Neither defines a tool registry. The agent's tool is a single implicit action: run a bash command in the tmux session and read the output.

# Pseudocode from bash_agent.py
action = llm.complete(system_prompt + history)
command = extract_bash_command(action)
output = tmux_run(command)
history.append({"role": "assistant", "content": action})
history.append({"role": "user", "content": output})

System Prompt

The BashAgent provides a minimal system prompt describing:

  • The task goal (from task.json)
  • That the agent has a bash terminal
  • Instructions to complete the task and then say DONE

Task Categories as "Domains"

The 11 task categories represent functional domains, not skills:

  1. infrastructure (server setup, networking, filesystems)
  2. devops (CI/CD, containers, monitoring)
  3. security (pen testing, hardening, crypto)
  4. data_processing (ETL, CSV/JSON manipulation, pipelines)
  5. ml_mlops (model training/serving, MLflow, datasets)
  6. algorithms (sorting, graph problems, optimization)
  7. software_development (debugging, refactoring, build systems)
  8. scientific_computing (numpy, scipy, simulations)
  9. interactive_environments (Jupyter, curses, terminal UIs)
  10. distributed_computing (MPI, Spark, message queues)
  11. formal_verification (Coq, TLA+, property-based testing)

Metrics

  • 420 unique command-line tools referenced across all tasks
  • Average task length: 25.5 turns, 8,722 tokens
  • 3,500+ verified tasks total

No Agent Extensibility

There is no plugin system, no way to add skills, and no mechanism to inject custom tool behavior into the BashAgent. The evaluation harness is deliberately minimal to avoid confounding benchmark results with harness sophistication.

05

Memory And Context

terminal-bench-env — Memory & Context

Memory Model

Session-only, no persistence.

The BashAgent has no memory beyond the current task session. Context is a sliding window over the conversation history within one task run. When the task ends (success or max_turns), all context is discarded.

Context Window Management

  • History is a list of {"role": ..., "content": ...} dicts
  • No summarization
  • No compaction
  • When context exceeds the model's limit: older turns are dropped from the front (sliding window)
  • Average task: 25.5 turns, 8,722 tokens — fits in most modern context windows without truncation

State Storage

Level Storage Persistence
Agent reasoning In-memory list Session only
Shell state tmux pane (Docker) Task lifetime
Filesystem state Docker container Task lifetime
Results/scores Operator-managed External

No Cross-Task Memory

  • Each task is an independent Docker container
  • No shared filesystem between tasks
  • No vector store, SQLite, or external DB
  • No semantic search over prior runs

Implicit Memory via Docker

The Docker container provides a form of "working memory" for the duration of the task:

  • Files written to the container persist for the task lifetime
  • Commands like cat > notes.txt work as scratch memory
  • This is explicitly used by agents: writing plans, intermediate results, etc.

Context for Benchmark vs Fine-tuning

For fine-tuning (TermiGen use case), trajectories ARE persisted — but by the data collection infrastructure, not by the agent itself. The benchmark consumer (researcher) stores the full conversation log externally.

Comparison to Seeds

Unlike seeds with persistent memory:

  • ccmemory: persistent vector/SQLite memory across sessions
  • agent-os: tiered memory (working/episodic/semantic)
  • claude-flow: SQLite hive-mind shared state

terminal-bench-env intentionally has NO persistent memory — evaluation requires a fresh state for reproducibility.

09

Uniqueness

terminal-bench-env — Uniqueness & Positioning

differs_from_seeds

terminal-bench-env has no architectural overlap with any seed framework. All seeds are agent harnesses (providing workflow, skills, commands, memory to humans building software). terminal-bench-env is evaluation infrastructure (providing Docker environments and a minimal reference agent to researchers measuring agent performance). It operates one layer below agent harnesses — it is what you test an agent against, not what you use to build software.

Distinctive Positioning

  1. Verified executable environments at scale: 3,500+ Docker tasks with automated verifiers. No other entry in this batch provides a benchmark corpus — all others are agent harnesses or orchestrators.

  2. Domain breadth: 11 task categories including unusual domains (formal verification with Coq/TLA+, distributed computing with MPI/Spark, interactive environments with curses/Jupyter). Most agent benchmarks focus on software development (SWE-Bench). terminal-bench-env covers the full terminal-capable surface area.

  3. Error-correction trajectory synthesis: The TermiGen paper's core contribution is the data pipeline: collect failure trajectories, add corrections, synthesize training data. The benchmark provides the ground truth for this pipeline. +26.8% absolute improvement over the base model is the claimed result.

  4. Companion to a published model: TermiGen-32B (separate HuggingFace artifact) is a direct product of this dataset. The benchmark and the model are co-released — unusual for research artifacts.

  5. Harbor 2.0 format support: Dual-format support (TerminalBench 1.0 Docker Compose + Harbor 2.0) means the corpus can be used with different evaluation frameworks. Most benchmark repos are locked to one format.

Observable Limitations

  • 82 stars, no license — limits adoption
  • Research artifact, likely static after paper publication (last commit 2026-03-24)
  • BashAgent is minimal (no tool abstraction, no memory) — for baseline comparison only
  • No integration with commercial agent harnesses (Claude Code, Gemini CLI) — would require wrapping
  • Tasks are terminal-specific; no GUI, no browser, no document-editing tasks
  • Verifier scripts quality varies by domain (formal verification is harder to automate)

Not a Fit For

  • Building software with AI assistance
  • Persistent multi-session workflows
  • Team collaboration on code
  • Any use case requiring memory, skills, or tool abstraction

Fit For

  • Researchers measuring terminal agent capability
  • Teams building training datasets for terminal-focused models
  • Ablation studies on agent loop designs
  • Curriculum learning: identify which task categories an agent struggles with
04

Workflow

terminal-bench-env — Workflow

Evaluation Workflow (Benchmark Consumer)

1. Select a task from tasks/<category>/
2. docker-compose up (or Harbor start)
3. Run bash_agent.py <task_id>
4. Agent loop runs until DONE or max_turns
5. Verifier script executes
6. Pass/Fail recorded

Agent Loop

system_prompt = load_task_description(task_id)
history = []

while turn < max_turns:
    response = llm.complete(system_prompt, history)
    if "DONE" in response:
        break
    command = extract_command(response)
    output = tmux_exec(command)
    history.append(...)

result = run_verifier()

No Development Workflow

terminal-bench-env has no workflow for software development tasks from the agent's perspective. The workflow is entirely:

  • Benchmark operator: set up environment → run agent → collect scores
  • Not: developer → write spec → implement → review

Data Collection Workflow (Research Pipeline)

For TermiGen paper trajectory collection:

  1. Expert human solves a task (trajectory captured)
  2. BashAgent attempts same task (some fail)
  3. Error-correction: corrections to failed steps added
  4. Dataset assembled: (prompt, correct_trajectory) pairs
  5. Fine-tuning run on Qwen2.5-Coder-32B base

Phases

Phase Actor Action
Environment setup Operator docker-compose up
Task briefing System Load task.json into system prompt
Execution BashAgent ReAct loop, tmux commands
Verification System Run verifier script
Scoring Operator Aggregate pass rates

No Approval Gates

No human-in-the-loop during task execution. The loop runs to completion or max turns without interruption.

No Spec Format

There is no spec file format. Tasks are described in task.json/task.yaml with fields like description, success_criteria, setup_commands — these are task definitions, not feature specs.

Applicable Use Cases

  1. Benchmark: Evaluate any terminal agent against the 3,500+ tasks
  2. Fine-tuning data: Use verified trajectories to train new models
  3. Curriculum: Use task categories to measure agent skill gaps
  4. Research baseline: Compare new agent architectures against TermiGen-32B baselines
06

Multi Agent

terminal-bench-env — Multi-Agent

Multi-Agent Support

None. terminal-bench-env is a single-agent evaluation framework. There is one BashAgent per task run. No coordination, no spawning, no swarm.

Parallelism

The only form of parallelism is at the benchmark operator level: running multiple task evaluations in parallel (each in its own Docker container). This is external parallelism managed by the operator's test runner, not an internal multi-agent architecture.

Why Single-Agent

The benchmark is designed to measure single-agent terminal capability:

  • Tasks are defined for one agent
  • Verification expects outputs from one agent's actions
  • Multi-agent setups would confound the measurement

Comparison to Multi-Agent Seeds

Framework Multi-agent Mechanism
claude-flow Yes Hive-mind, SQLite bus
scion-gcp Yes Container swarm, tmux sessions
clawmanager Yes K8s pods, Redis team bus
terminal-bench-env No N/A

No Orchestration

  • No coordinator role
  • No subagent spawning
  • No message passing between agents
  • No consensus mechanism
  • No shared state between concurrent runs

TermiGen-32B Multi-Instance Evaluation

During benchmarking at scale, multiple TermiGen-32B instances run in parallel across the 3,500+ task corpus, but each instance operates independently — this is parameter-parallel evaluation infrastructure, not a multi-agent system.

07

Isolation And Security

terminal-bench-env — Isolation & Security

Isolation Mechanism: Docker Container

Each task runs inside a Docker container. This provides:

  • Filesystem isolation: Fresh state per task; agent cannot access host filesystem
  • Process isolation: Container processes cannot escape to host
  • Network isolation: Most tasks configure no-egress or limited network access
  • Resource limits: CPU/memory limits via Docker

Container Lifecycle

docker-compose up → task starts
  agent runs commands in container
docker-compose down → container destroyed, all state lost

Every task evaluation gets a clean container from the image. No state leaks between evaluations.

Security Threat Model

The threat model is benchmark integrity, not production security:

  • Prevent agent from cheating (e.g., reading the verifier script)
  • Prevent environment contamination between tasks
  • Ensure reproducibility (same starting state every time)

There is no threat model for protecting credentials, user data, or network access beyond task isolation. This is a research artifact, not a production system.

Harbor Framework

Harbor 2.0 tasks use a similar container-based isolation but with the Harbor manifest format specifying additional environment constraints.

Verifier Access Control

The verification script typically runs outside the agent's working directory or in a privileged context. The agent cannot modify the verifier or pre-seed the verification results (within normal Docker isolation).

No Credential Management

  • No API keys managed by the framework
  • The bash_agent.py accepts a model endpoint — the researcher provides their own API key via environment variable
  • No credential vault, no secret injection beyond env vars

Comparison to Batch Peers

Framework Isolation Security Focus
ironclaw WASM capability sandbox Production tool safety
stakpak-agent Docker + Warden network Network egress control
agentbox-mattolson mitmproxy + iptables Two-layer network enforcement
osaurus Apple Container Linux VM Privacy + OS-level isolation
terminal-bench-env Docker container Benchmark reproducibility

terminal-bench-env's isolation goal is reproducibility, not security. This is appropriate for a research artifact.

08

Ui Cli Surface

terminal-bench-env — UI & CLI Surface

CLI Surface

Minimal. The "CLI" is running the Python scripts directly:

python bash_agent.py --task <task_id> --model <endpoint>
python bash_agent_harbor.py --task <task_id> --model <endpoint>

No dedicated CLI binary. No subcommand structure. No --help beyond Python's argparse.

No Web UI

No dashboard, no web frontend, no status monitor.

tmux Interface

The BashAgent uses tmux as an execution substrate:

  • Creates a tmux session for each task run
  • Sends commands via tmux send-keys
  • Reads output via tmux capture-pane
  • This is internal infrastructure — the operator does not interact with the tmux session

The operator may observe the tmux session for debugging by attaching to it, but this is not a designed interaction surface.

Observability

  • stdout/stderr logging from the agent script
  • JSON log files of trajectories (for data collection)
  • No structured metrics dashboard
  • No OTEL or telemetry

Docker Compose Interface

docker-compose up     # Start task environment
docker-compose down   # Destroy environment
docker-compose logs   # View container logs

Standard Docker tooling — no custom wrappers.

Output Format

Task results are typically written to a file or stdout:

  • PASS / FAIL from verifier
  • Full conversation transcript (for trajectory collection)
  • Token count, turn count per task

Target Audience

The primary users of this CLI surface are ML researchers running benchmark evaluations, not end users of an agent harness. The interface is intentionally minimal — researchers are expected to write their own evaluation scripts around the primitives.

Related frameworks

same archetype · same primary tool · same memory type

claude-mem (thedotmack) ★ 78k

Background worker service captures every tool call as an observation, AI-compresses sessions, and auto-injects relevant past…

pi (badlogic/earendil) ★ 55k

A minimal, hackable, multi-provider terminal coding agent that adapts to your workflows via npm-installable TypeScript Extensions…

Agent Skills (Addy Osmani) ★ 46k

Encodes senior-engineer software development lifecycle as 23 auto-routed skills and 7 slash commands for any AI coding agent.

wshobson/agents Plugin Marketplace ★ 36k

Single Markdown source for 83 domain-specialized plugins that auto-generates idiomatic artifacts for five AI coding harnesses.

TabbyML/Tabby ★ 34k

Self-hosted AI coding assistant server (alternative to GitHub Copilot) with admin dashboard, RAG-based completions, and multi-IDE…

Compound Engineering ★ 17k

Make each unit of engineering work compound into easier future work via brainstorm→plan→execute→review→learn cycles.