terminal-bench-env — Summary
terminal-bench-env is a research repository from UC Santa Barbara's ML Security lab providing 3,500+ verified Docker environments and two minimal BashAgent implementations for evaluating terminal-based AI agents. It accompanies the TermiGen paper (arXiv 2602.07274), which introduces a 32B parameter model (TermiGen-32B) fine-tuned from Qwen2.5-Coder via error-correction trajectory synthesis. The repository is not an agent harness in the traditional sense — it is a benchmark environment corpus spanning 11 task categories (infrastructure, DevOps, security, data processing, ML/MLOps, algorithms, software development, scientific computing, interactive environments, distributed computing, formal verification). Tasks are available in TerminalBench 1.0 format (Docker Compose) and Harbor 2.0 format. The BashAgent implementation is a minimal ReAct-style agent with tmux-based shell interaction. This is a Tier B/C entry: no workflow methodology, no skill system, no persistent memory — a pure evaluation infrastructure.
Differs from seeds: No seed is a benchmark environment. terminal-bench-env is closer to evaluation infrastructure than an agent harness. It has no overlap with any seed framework philosophically or architecturally.