terminal-bench-env (TermiGen)

terminal-bench-env · ucsb-mlsec/terminal-bench-env · ★ 82 · last commit 2026-03-24

Primitive shape

No installable primitives

Summary

terminal-bench-env — Summary

terminal-bench-env is a research repository from UC Santa Barbara's ML Security lab providing 3,500+ verified Docker environments and two minimal BashAgent implementations for evaluating terminal-based AI agents. It accompanies the TermiGen paper (arXiv 2602.07274), which introduces a 32B parameter model (TermiGen-32B) fine-tuned from Qwen2.5-Coder via error-correction trajectory synthesis. The repository is not an agent harness in the traditional sense — it is a benchmark environment corpus spanning 11 task categories (infrastructure, DevOps, security, data processing, ML/MLOps, algorithms, software development, scientific computing, interactive environments, distributed computing, formal verification). Tasks are available in TerminalBench 1.0 format (Docker Compose) and Harbor 2.0 format. The BashAgent implementation is a minimal ReAct-style agent with tmux-based shell interaction. This is a Tier B/C entry: no workflow methodology, no skill system, no persistent memory — a pure evaluation infrastructure.

Differs from seeds: No seed is a benchmark environment. terminal-bench-env is closer to evaluation infrastructure than an agent harness. It has no overlap with any seed framework philosophically or architecturally.

Overview

terminal-bench-env — Overview

Origin

Developed by the UC Santa Barbara ML Security Lab (ucsb-mlsec). Research artifact accompanying the TermiGen paper. The project was selected as the #2 Hugging Face Daily Papers on 2026-02-10.

Purpose

Provide high-fidelity Docker environments for training and evaluating terminal-based AI agents. The core research contribution is:

3,500+ verified executable Docker tasks
BashAgent (minimal ReAct implementation)
TermiGen-32B (fine-tuned model, separate on HuggingFace)

Key Metrics (from README)

420 unique command-line tools in the corpus
16 functional domains
Average task complexity: 25.5 turns, 8,722 tokens
TermiGen-32B performance: 31.3% TerminalBench 1.0, 19.3% TerminalBench 2.0, 21.4% SWE-Bench Verified
+26.8% absolute improvement over base Qwen2.5-Coder-32B

Repo Facts

GitHub: https://github.com/ucsb-mlsec/terminal-bench-env
Stars: 82 (2026-05-26)
Language: Python
License: unknown
Last commit: 2026-03-24
Status: Research artifact (likely static after paper publication)

Architecture

terminal-bench-env — Architecture

Repository Layout

terminal-bench-env/
├── tasks/                    # 3,500+ task definitions
│   ├── infrastructure/
│   ├── devops/
│   ├── security/
│   ├── data_processing/
│   ├── ml_mlops/
│   ├── algorithms/
│   ├── software_development/
│   ├── scientific_computing/
│   ├── interactive_environments/
│   ├── distributed_computing/
│   └── formal_verification/
├── bash_agent.py             # BashAgent for TerminalBench 1.0 (tb framework)
├── bash_agent_harbor.py      # BashAgent for Harbor 2.0 framework
├── docker-compose.yml        # Task environment composition
└── README.md

Task Structure

Each task is a Docker-based environment:

docker-compose.yml or Harbor manifest
task.json / task.yaml describing the goal, success criteria, and setup
Pre-seeded filesystem state
Automated verification (bash scripts or test suites)

Tasks are self-contained: start the container → give the agent a terminal → run the verifier.

BashAgent Architecture

Two minimal implementations, both ReAct-style:

bash_agent.py (TerminalBench tb framework)

LLM → think → act (bash command) → observe (stdout/stderr) → repeat

tmux-based shell: sends commands to a tmux pane, reads output
No tool abstraction layer — raw bash in/out
Model: any OpenAI-compatible endpoint
Context window: sliding window over turn history
No memory beyond the current session

bash_agent_harbor.py (Harbor 2.0 framework)

Same ReAct loop, adapted to Harbor task format
Harbor provides structured task metadata and scoring

Isolation Model

Each task runs in a Docker container with:

Fresh filesystem state per task
No network egress in most tasks
Resource limits via Docker
Verifier script runs inside or alongside the container

Data Pipeline (TermiGen paper)

The repo is the benchmark environment component of a larger data pipeline:

Human-curated task definitions → Docker environments
Expert solutions captured as trajectories
Error-correction: agent makes mistakes → corrections captured
Synthetic trajectory augmentation → fine-tuning dataset
TermiGen-32B trained on augmented dataset

terminal-bench-env provides only steps 1-2 (the environment corpus). The trajectory synthesis pipeline is separate.

TerminalBench 1.0 vs Harbor 2.0

Aspect	TerminalBench 1.0	Harbor 2.0
Format	Docker Compose	Harbor manifest
Evaluation	Bash verifier	Harbor scoring
Agent file	bash_agent.py	bash_agent_harbor.py
Scope	Original 3,500+ tasks	Extended/updated tasks

Non-Architecture (What's Absent)

No workflow engine
No skill system
No hook infrastructure
No memory store beyond current session
No orchestration layer
No multi-agent coordination

This is intentional: the repo is an evaluation substrate, not an agent harness.

Skills And Commands

terminal-bench-env — Skills & Commands

Skills

None. terminal-bench-env contains no skill files, no .claude/skills/ directory, and no slash-command definitions. It is not a Claude Code configuration — it is a benchmark corpus.

Commands

None. No commands/ directory exists. No / prefixed commands.

BashAgent "Tools"

The only "tools" in the repo are the two agent scripts. Neither defines a tool registry. The agent's tool is a single implicit action: run a bash command in the tmux session and read the output.

# Pseudocode from bash_agent.py
action = llm.complete(system_prompt + history)
command = extract_bash_command(action)
output = tmux_run(command)
history.append({"role": "assistant", "content": action})
history.append({"role": "user", "content": output})

System Prompt

The BashAgent provides a minimal system prompt describing:

The task goal (from task.json)
That the agent has a bash terminal
Instructions to complete the task and then say DONE

Task Categories as "Domains"

The 11 task categories represent functional domains, not skills:

infrastructure (server setup, networking, filesystems)
devops (CI/CD, containers, monitoring)
security (pen testing, hardening, crypto)
data_processing (ETL, CSV/JSON manipulation, pipelines)
ml_mlops (model training/serving, MLflow, datasets)
algorithms (sorting, graph problems, optimization)
software_development (debugging, refactoring, build systems)
scientific_computing (numpy, scipy, simulations)
interactive_environments (Jupyter, curses, terminal UIs)
distributed_computing (MPI, Spark, message queues)
formal_verification (Coq, TLA+, property-based testing)

Metrics

420 unique command-line tools referenced across all tasks
Average task length: 25.5 turns, 8,722 tokens
3,500+ verified tasks total

No Agent Extensibility

There is no plugin system, no way to add skills, and no mechanism to inject custom tool behavior into the BashAgent. The evaluation harness is deliberately minimal to avoid confounding benchmark results with harness sophistication.

Memory And Context

terminal-bench-env — Memory & Context

Memory Model

Session-only, no persistence.

The BashAgent has no memory beyond the current task session. Context is a sliding window over the conversation history within one task run. When the task ends (success or max_turns), all context is discarded.

Context Window Management

History is a list of {"role": ..., "content": ...} dicts
No summarization
No compaction
When context exceeds the model's limit: older turns are dropped from the front (sliding window)
Average task: 25.5 turns, 8,722 tokens — fits in most modern context windows without truncation

State Storage

Level	Storage	Persistence
Agent reasoning	In-memory list	Session only
Shell state	tmux pane (Docker)	Task lifetime
Filesystem state	Docker container	Task lifetime
Results/scores	Operator-managed	External

No Cross-Task Memory

Each task is an independent Docker container
No shared filesystem between tasks
No vector store, SQLite, or external DB
No semantic search over prior runs

Implicit Memory via Docker

The Docker container provides a form of "working memory" for the duration of the task:

Files written to the container persist for the task lifetime
Commands like cat > notes.txt work as scratch memory
This is explicitly used by agents: writing plans, intermediate results, etc.

Context for Benchmark vs Fine-tuning

For fine-tuning (TermiGen use case), trajectories ARE persisted — but by the data collection infrastructure, not by the agent itself. The benchmark consumer (researcher) stores the full conversation log externally.

Comparison to Seeds

Unlike seeds with persistent memory:

ccmemory: persistent vector/SQLite memory across sessions
agent-os: tiered memory (working/episodic/semantic)
claude-flow: SQLite hive-mind shared state

terminal-bench-env intentionally has NO persistent memory — evaluation requires a fresh state for reproducibility.

Uniqueness

terminal-bench-env — Uniqueness & Positioning

differs_from_seeds

terminal-bench-env has no architectural overlap with any seed framework. All seeds are agent harnesses (providing workflow, skills, commands, memory to humans building software). terminal-bench-env is evaluation infrastructure (providing Docker environments and a minimal reference agent to researchers measuring agent performance). It operates one layer below agent harnesses — it is what you test an agent against, not what you use to build software.

Distinctive Positioning

Verified executable environments at scale: 3,500+ Docker tasks with automated verifiers. No other entry in this batch provides a benchmark corpus — all others are agent harnesses or orchestrators.
Domain breadth: 11 task categories including unusual domains (formal verification with Coq/TLA+, distributed computing with MPI/Spark, interactive environments with curses/Jupyter). Most agent benchmarks focus on software development (SWE-Bench). terminal-bench-env covers the full terminal-capable surface area.
Error-correction trajectory synthesis: The TermiGen paper's core contribution is the data pipeline: collect failure trajectories, add corrections, synthesize training data. The benchmark provides the ground truth for this pipeline. +26.8% absolute improvement over the base model is the claimed result.
Companion to a published model: TermiGen-32B (separate HuggingFace artifact) is a direct product of this dataset. The benchmark and the model are co-released — unusual for research artifacts.
Harbor 2.0 format support: Dual-format support (TerminalBench 1.0 Docker Compose + Harbor 2.0) means the corpus can be used with different evaluation frameworks. Most benchmark repos are locked to one format.

Observable Limitations

82 stars, no license — limits adoption
Research artifact, likely static after paper publication (last commit 2026-03-24)
BashAgent is minimal (no tool abstraction, no memory) — for baseline comparison only
No integration with commercial agent harnesses (Claude Code, Gemini CLI) — would require wrapping
Tasks are terminal-specific; no GUI, no browser, no document-editing tasks
Verifier scripts quality varies by domain (formal verification is harder to automate)

Not a Fit For

Building software with AI assistance
Persistent multi-session workflows
Team collaboration on code
Any use case requiring memory, skills, or tool abstraction

Fit For

Researchers measuring terminal agent capability
Teams building training datasets for terminal-focused models
Ablation studies on agent loop designs
Curriculum learning: identify which task categories an agent struggles with

Workflow

terminal-bench-env — Workflow

Evaluation Workflow (Benchmark Consumer)

1. Select a task from tasks/<category>/
2. docker-compose up (or Harbor start)
3. Run bash_agent.py <task_id>
4. Agent loop runs until DONE or max_turns
5. Verifier script executes
6. Pass/Fail recorded

Agent Loop

system_prompt = load_task_description(task_id)
history = []

while turn < max_turns:
    response = llm.complete(system_prompt, history)
    if "DONE" in response:
        break
    command = extract_command(response)
    output = tmux_exec(command)
    history.append(...)

result = run_verifier()

No Development Workflow

terminal-bench-env has no workflow for software development tasks from the agent's perspective. The workflow is entirely:

Benchmark operator: set up environment → run agent → collect scores
Not: developer → write spec → implement → review

Data Collection Workflow (Research Pipeline)

For TermiGen paper trajectory collection:

Expert human solves a task (trajectory captured)
BashAgent attempts same task (some fail)
Error-correction: corrections to failed steps added
Dataset assembled: (prompt, correct_trajectory) pairs
Fine-tuning run on Qwen2.5-Coder-32B base

Phases

Phase	Actor	Action
Environment setup	Operator	`docker-compose up`
Task briefing	System	Load task.json into system prompt
Execution	BashAgent	ReAct loop, tmux commands
Verification	System	Run verifier script
Scoring	Operator	Aggregate pass rates

No Approval Gates

No human-in-the-loop during task execution. The loop runs to completion or max turns without interruption.

No Spec Format

There is no spec file format. Tasks are described in task.json/task.yaml with fields like description, success_criteria, setup_commands — these are task definitions, not feature specs.

Applicable Use Cases

Benchmark: Evaluate any terminal agent against the 3,500+ tasks
Fine-tuning data: Use verified trajectories to train new models
Curriculum: Use task categories to measure agent skill gaps
Research baseline: Compare new agent architectures against TermiGen-32B baselines

Multi Agent

terminal-bench-env — Multi-Agent

Multi-Agent Support

None. terminal-bench-env is a single-agent evaluation framework. There is one BashAgent per task run. No coordination, no spawning, no swarm.

Parallelism

The only form of parallelism is at the benchmark operator level: running multiple task evaluations in parallel (each in its own Docker container). This is external parallelism managed by the operator's test runner, not an internal multi-agent architecture.

Why Single-Agent

The benchmark is designed to measure single-agent terminal capability:

Tasks are defined for one agent
Verification expects outputs from one agent's actions
Multi-agent setups would confound the measurement

Comparison to Multi-Agent Seeds

Framework	Multi-agent	Mechanism
claude-flow	Yes	Hive-mind, SQLite bus
scion-gcp	Yes	Container swarm, tmux sessions
clawmanager	Yes	K8s pods, Redis team bus
terminal-bench-env	No	N/A

No Orchestration

No coordinator role
No subagent spawning
No message passing between agents
No consensus mechanism
No shared state between concurrent runs

TermiGen-32B Multi-Instance Evaluation

During benchmarking at scale, multiple TermiGen-32B instances run in parallel across the 3,500+ task corpus, but each instance operates independently — this is parameter-parallel evaluation infrastructure, not a multi-agent system.

Isolation And Security

terminal-bench-env — Isolation & Security

Isolation Mechanism: Docker Container

Each task runs inside a Docker container. This provides:

Filesystem isolation: Fresh state per task; agent cannot access host filesystem
Process isolation: Container processes cannot escape to host
Network isolation: Most tasks configure no-egress or limited network access
Resource limits: CPU/memory limits via Docker

Container Lifecycle

docker-compose up → task starts
  agent runs commands in container
docker-compose down → container destroyed, all state lost

Every task evaluation gets a clean container from the image. No state leaks between evaluations.

Security Threat Model

The threat model is benchmark integrity, not production security:

Prevent agent from cheating (e.g., reading the verifier script)
Prevent environment contamination between tasks
Ensure reproducibility (same starting state every time)

There is no threat model for protecting credentials, user data, or network access beyond task isolation. This is a research artifact, not a production system.

Harbor Framework

Harbor 2.0 tasks use a similar container-based isolation but with the Harbor manifest format specifying additional environment constraints.

Verifier Access Control

The verification script typically runs outside the agent's working directory or in a privileged context. The agent cannot modify the verifier or pre-seed the verification results (within normal Docker isolation).

No Credential Management

No API keys managed by the framework
The bash_agent.py accepts a model endpoint — the researcher provides their own API key via environment variable
No credential vault, no secret injection beyond env vars

Comparison to Batch Peers

Framework	Isolation	Security Focus
ironclaw	WASM capability sandbox	Production tool safety
stakpak-agent	Docker + Warden network	Network egress control
agentbox-mattolson	mitmproxy + iptables	Two-layer network enforcement
osaurus	Apple Container Linux VM	Privacy + OS-level isolation
terminal-bench-env	Docker container	Benchmark reproducibility

terminal-bench-env's isolation goal is reproducibility, not security. This is appropriate for a research artifact.

Ui Cli Surface

terminal-bench-env — UI & CLI Surface

CLI Surface

Minimal. The "CLI" is running the Python scripts directly:

python bash_agent.py --task <task_id> --model <endpoint>
python bash_agent_harbor.py --task <task_id> --model <endpoint>

No dedicated CLI binary. No subcommand structure. No --help beyond Python's argparse.

No Web UI

No dashboard, no web frontend, no status monitor.

tmux Interface

The BashAgent uses tmux as an execution substrate:

Creates a tmux session for each task run
Sends commands via tmux send-keys
Reads output via tmux capture-pane
This is internal infrastructure — the operator does not interact with the tmux session

The operator may observe the tmux session for debugging by attaching to it, but this is not a designed interaction surface.

Observability

stdout/stderr logging from the agent script
JSON log files of trajectories (for data collection)
No structured metrics dashboard
No OTEL or telemetry

Docker Compose Interface

docker-compose up     # Start task environment
docker-compose down   # Destroy environment
docker-compose logs   # View container logs

Standard Docker tooling — no custom wrappers.

Output Format

Task results are typically written to a file or stdout:

PASS / FAIL from verifier
Full conversation transcript (for trajectory collection)
Token count, turn count per task

Target Audience

The primary users of this CLI surface are ML researchers running benchmark evaluations, not end users of an agent harness. The interface is intentionally minimal — researchers are expected to write their own evaluation scripts around the primitives.

Related frameworks

same archetype · same primary tool · same memory type

claude-mem (thedotmack) ★ 78k

A8 Cross-runtime harness

Background worker service captures every tool call as an observation, AI-compresses sessions, and auto-injects relevant past…

pi (badlogic/earendil) ★ 55k

A8 Cross-runtime harness

A minimal, hackable, multi-provider terminal coding agent that adapts to your workflows via npm-installable TypeScript Extensions…

Agent Skills (Addy Osmani) ★ 46k

A8 Cross-runtime harness

Encodes senior-engineer software development lifecycle as 23 auto-routed skills and 7 slash commands for any AI coding agent.

wshobson/agents Plugin Marketplace ★ 36k

A8 Cross-runtime harness

Single Markdown source for 83 domain-specialized plugins that auto-generates idiomatic artifacts for five AI coding harnesses.

TabbyML/Tabby ★ 34k

A8 Cross-runtime harness

Self-hosted AI coding assistant server (alternative to GitHub Copilot) with admin dashboard, RAG-based completions, and multi-IDE…

Compound Engineering ★ 17k

A8 Cross-runtime harness

Make each unit of engineering work compound into easier future work via brainstorm→plan→execute→review→learn cycles.

Distribution

Type: script
License: unknown (none declared)
Install: simple
Version: main branch (2026-03-24)

Surfaces

CLI binary: No
CLI subcmds: 0
Local UI: No
Tech stack: none

Components

Commands: 0
Skills: 0
Subagents: 0
Hooks: 0
MCP servers: 0
MCP tools: 0
Scripts: 2
Templates: 0

Workflow

Phases: 4
Approval gates: 0
Spec format: none
Spec storage: none
Delta or full: none

Orchestration

Multi-agent: No
Pattern: none
Max concurrent: 1
Isolation: container
Consensus: none
Prompt chaining: No

Multi-model

Multi-model: No
BYOK: Yes
Modal: no

Execution

Mode: script
Crash recovery: No
Compaction: No
Session handoff: No
Streaming: Yes

Memory

Type: none
Persistence: session
Search: none
State files: 2 files

Quality

TDD: No
TDD mechanism: none
Validators: 1
Self-review: none

Git / Observability

Auto commit: No
Auto PR: No
Auto merge: No
Worktree/feat: No
Audit log: No
Audit format: none
Replay: Yes

Tools

Primary: none (model-agnostic)
Targets: 1
Portability: high

Signals

Stars: 82
Last commit: 2026-03-24
Maintainer: static (research artifact)
Quality score: 2.4/10