Skip to content
/

SWE-Pruner

swe-pruner · ByteDance-Seed/SWE-Pruner

Primitive shape
No installable primitives
00

Summary

swe-pruner — Summary

One-line: Neural context pruner that removes irrelevant code tokens before they reach the LLM, cutting 23–54% of tokens on SWE-Bench Verified with a 0.6B fine-tuned model.

Identity

Field Value
GitHub https://github.com/ByteDance-Seed/SWE-Pruner
Stars 282
License None declared
Language Python
Version (no tag; branch: public)
Package type Standalone research repo (FastAPI server)
Maintainer org ByteDance Seed (research)

What It Does

swe-pruner serves a FastAPI endpoint (port 8000) that accepts a context payload and returns a pruned version, stripping code chunks the model predicts are irrelevant to the query. The pruner model (code-pruner, 0.6B parameters) is fine-tuned and hosted on HuggingFace (ayanami-kitasan/code-pruner). Agents call the /prune endpoint directly; there is no MCP server, no Claude hook, and no vault.

Claimed Results

  • "Make Claude Tokens 40% Saving!" (badge, verbatim)
  • "23–54% token reduction on SWE-Bench Verified"
  • "up to 14.84x compression on LongCodeQA"
  • Paper: arXiv:2601.16746 (ByteDance Seed)

Archetype

Research paper implementation — FastAPI inference server, not a Claude plugin or MCP server. Closest to a preprocessing layer that any agent framework can call via HTTP.

01

Overview

swe-pruner — Overview

Problem Statement

LLM-based coding agents (Claude, OpenHands, SWE-agent) retrieve large repository contexts before issuing a fix. Most of this context is irrelevant to the specific issue. swe-pruner addresses this by interposing a neural pruning step: a small 0.6B model judges each code chunk's relevance and discards those below threshold before the context ever reaches the frontier model.

Positioning

  • Research artifact: Accompanying implementation for arXiv:2601.16746 (ByteDance Seed team)
  • Not a SaaS or plugin: No hosted API, no MCP server, no Claude Code plugin
  • Universal preprocessor: Works with any agent that can call an HTTP endpoint (Claude Agent SDK demo, OpenHands demo, direct HTTP)
  • Complementary to retrieval: Sits after BM25/vector retrieval and before the LLM call; pruner trims what retrieval keeps

Key Claims (verbatim from README)

"Make Claude Tokens 40% Saving!"

"23-54% token reduction on SWE-Bench Verified"

"up to 14.84x compression on LongCodeQA"

Target Users

  • Researchers running SWE-Bench evaluations who need reproducibility
  • Agent framework developers wanting to add a pruning preprocessing step
  • Cost-sensitive production users willing to run a local 0.6B GPU inference server

Prerequisites

  • Python >= 3.12
  • CUDA GPU (flash-attn requires CUDA; CPU fallback not documented)
  • HuggingFace model download (~1.2GB): ayanami-kitasan/code-pruner
  • pip install swe-pruner (installs swe-pruner CLI entry point)
02

Architecture

swe-pruner — Architecture

Deployment Model

Agent (Claude / OpenHands / any HTTP client)
       │
       ▼  POST /prune  (JSON: query + context chunks)
┌─────────────────────────────────────────────┐
│  swe-pruner FastAPI server  (port 8000)      │
│                                              │
│  online_serving.py  ←  pyproject entry point │
│  ┌─────────────────────────────────────────┐ │
│  │  code-pruner model (0.6B, HuggingFace)  │ │
│  │  ayanami-kitasan/code-pruner             │ │
│  │  flash-attn inference (CUDA)            │ │
│  └─────────────────────────────────────────┘ │
│  Returns: pruned context chunks              │
└─────────────────────────────────────────────┘
       │
       ▼  Pruned context passed to frontier LLM
  (Claude / GPT-4 / etc.)

Components

File Role
swe_pruner/online_serving.py FastAPI server, /prune endpoint, main() entry point
swe_pruner/pruner.py Model loading (transformers), inference, relevance scoring
examples/claude_agent_sdk_demo.py Claude Agent SDK integration example
examples/openhands_demo.py OpenHands integration example
pyproject.toml scripts: swe-pruner = "swe_pruner.online_serving:main"

Model

  • Name: code-pruner
  • HuggingFace: ayanami-kitasan/code-pruner
  • Parameters: ~0.6B
  • Fine-tuned for: Code chunk relevance classification
  • Inference: flash-attn (requires CUDA)
  • Download: git lfs pull or huggingface-cli download

Required Dependencies

python >= 3.12
flash-attn        # CUDA-only
transformers
fastapi
uvicorn

No Persistent State

swe-pruner is stateless: no vault, no SQLite, no session memory. Each /prune call is independent.

03

Components

swe-pruner — Components

CLI Entry Point

Binary Source Purpose
swe-pruner swe_pruner.online_serving:main Start FastAPI server
swe-pruner --model-path ./model --port 8000

FastAPI Endpoints

Endpoint Method Description
/prune POST Accept query + context chunks, return pruned subset
/health GET Server health check

Integration Examples

File Integration Target Notes
examples/claude_agent_sdk_demo.py Claude Agent SDK Shows calling /prune before issuing Claude request
examples/openhands_demo.py OpenHands Same pattern for OpenHands agent loop

Claude Hooks

None. swe-pruner has no .claude/settings.json, no MCP server, no plugin manifest.

MCP Tools

None. swe-pruner is not an MCP server.

Skills

None.

HuggingFace Model

  • Repo: ayanami-kitasan/code-pruner
  • Used via transformers AutoModel/AutoTokenizer
  • Not bundled; must be downloaded separately

Paper

arXiv:2601.16746 — "SWE-Pruner: Pruning Irrelevant Context for SWE Agents" (ByteDance Seed)

05

Prompts

swe-pruner — Prompts

Prompt Files

swe-pruner has no CLAUDE.md, no .claude/ directory, no skills, and no prompt templates for the agent runtime.

Model Input (Inference Prompt Pattern)

The code-pruner model itself is a fine-tuned transformer. Its internal prompt template for relevance scoring is not publicly documented in the repository, but inference is called via transformers AutoModel tokenizer — the input is a concatenation of the query and the candidate code chunk.

Integration Notes in README

The README contains a usage snippet showing the JSON payload structure for /prune. No agent-facing CLAUDE.md guidance or system prompt fragments are present.

Examples

examples/claude_agent_sdk_demo.py and examples/openhands_demo.py contain Python integration code showing how to call the pruner within an agent loop, but these are code templates, not prompt files.

09

Uniqueness

swe-pruner — Uniqueness

Differentiator

swe-pruner is the only framework in the batch that uses a fine-tuned neural model for context pruning. Every other framework uses heuristic methods (BM25, entropy scoring, SimHash, knapsack DP, or graph traversal). swe-pruner trains a dedicated 0.6B model to learn relevance from SWE-Bench data, then runs inference on each candidate chunk.

vs. ccmemory (Seed)

Dimension ccmemory swe-pruner
Memory store Neo4j graph (typed nodes) None
Insertion 4 lifecycle hooks, LLM detection None
Retrieval Cypher graph traversal N/A
Compression Not a compression layer Core feature (neural pruning)
Claude integration 4 hooks + MCP server None (HTTP endpoint)
Self-improvement None None (fixed model weights)
Persistence Global vault Stateless

vs. Batch Peers

  • entroly: Both claim 40-70%+ token reduction. entroly uses knapsack DP + BM25 + entropy (heuristic, no model download). swe-pruner uses a 0.6B model (requires CUDA, ~1GB download, GPU inference latency). entroly adds PRISM RL, WITNESS, proxy mode; swe-pruner adds nothing beyond pruning.
  • lean-ctx: Both claim aggressive compression. lean-ctx uses BM25+vector hybrid at MCP layer. swe-pruner uses neural scoring at HTTP preprocessing layer. lean-ctx has no model download requirement.
  • symdex: symdex retrieves code by structure (imports, AST); swe-pruner filters retrieved code by relevance. Complementary: symdex could feed swe-pruner.
  • claude-self-reflect: CSR improves response quality over time via RL; swe-pruner reduces input size with a fixed model. CSR is session-aware; swe-pruner is stateless.

Unique Capabilities Not Found in Seeds or Batch Peers

  1. Neural relevance scoring: Fine-tuned 0.6B model trained on SWE-Bench data — the only learned (not heuristic) pruner in the batch.
  2. SWE-Bench reproducibility: Published paper (arXiv:2601.16746) with reproducible benchmark numbers; most other frameworks have no formal evaluation.
  3. 14.84x compression on LongCodeQA: Highest single-benchmark compression ratio claim in the batch (lean-ctx claims 99.6% on one test, but different task type).
  4. Universal HTTP interface: Agents call /prune via HTTP — no SDK lock-in, no Claude-specific hooks, works with any agent runtime.

Caveats

  • Requires CUDA GPU: unusable on CPU-only machines without modification
  • No active maintenance signal: no version tags, no license file, repo on public branch
  • Model must be downloaded separately (~1GB); not bundled
  • Research prototype: not production-hardened (no auth, no rate limiting, no error recovery)
04

Workflow

swe-pruner — Workflow

Setup

pip install swe-pruner
# Download the pruner model
huggingface-cli download ayanami-kitasan/code-pruner --local-dir ./model
# Start the server
swe-pruner --model-path ./model --port 8000

Agent Integration Pattern

1. Agent retrieves candidate context (BM25 / vector search / file read)
2. Agent POSTs to http://localhost:8000/prune:
   {
     "query": "<issue description>",
     "chunks": ["<file1 content>", "<file2 content>", ...]
   }
3. swe-pruner runs code-pruner inference on each chunk
4. Server returns pruned subset (irrelevant chunks removed)
5. Agent uses pruned context for the actual LLM call

Phases

Phase What Happens Artifact
Install pip install swe-pruner CLI available
Model download huggingface-cli download ./model/ directory
Server start swe-pruner --model-path ./model --port 8000 FastAPI server on :8000
Prune call Agent POSTs query + chunks Pruned context JSON
LLM call Agent uses pruned context Normal LLM response

Approval Gates

None.

Feedback Loop

None. swe-pruner has no outcome recording, no RL loop, and no session memory. Pruning quality depends entirely on the fine-tuned model weights.

Spec Format

None. This is not a spec-driven workflow.

06

Memory Context

swe-pruner — Memory & Context

Memory Model

None. swe-pruner has no persistent memory, no vault, no SQLite database, and no cross-session state. Each pruning call is independent.

Context Compression Mechanism

Neural Relevance Pruning

The core mechanism is fundamentally different from every other framework in this batch:

  • Input: A query (issue description, task) + a list of code chunks (retrieved by BM25, vector search, or any means)
  • Model: code-pruner (0.6B fine-tuned transformer, ayanami-kitasan/code-pruner)
  • Output: A subset of the input chunks judged relevant to the query
  • Method: Each chunk is scored independently for relevance to the query; chunks below threshold are discarded

No Token Budget Solver

Unlike entroly (knapsack DP) or lean-ctx (budget-aware selection), swe-pruner has no explicit token budget constraint. It prunes to relevance, not to a target count.

Claimed Reduction Numbers

  • "23-54% token reduction on SWE-Bench Verified" (verbatim from README)
  • "up to 14.84x compression on LongCodeQA" (verbatim from README)
  • "Make Claude Tokens 40% Saving!" (badge, verbatim)

Context Compaction

swe-pruner does not implement PreCompact hooks or session compaction. It is a preprocessing filter, not a context management layer.

Cross-Session Handoff

None. No CCP, no vault beliefs, no state files.

Comparison: Neural vs. Heuristic Pruning

Approach Method Example Framework
Neural (swe-pruner) Fine-tuned 0.6B model per-chunk relevance swe-pruner
Knapsack DP 0/1 DP on entropy scores + budget entroly
BM25+vector hybrid Term frequency + embeddings lean-ctx, symdex
Entropy scoring Shannon entropy → keep high-entropy entroly (within knapsack)
No compression Just store/retrieve ccmemory, basic-memory
07

Orchestration

swe-pruner — Orchestration

Orchestration Pattern

None. swe-pruner is not an orchestrator. It is a stateless inference service that performs one task: prune a context payload.

Multi-Agent Support

Not applicable. swe-pruner has no concept of agent roles, sub-agents, or task delegation. Multiple agents could independently call the /prune endpoint, but swe-pruner itself does not coordinate them.

Isolation Mechanism

None. The FastAPI server runs in a single process. There is no sandbox, container enforcement, or permission model.

Execution Mode

Server: The pruner runs as a persistent FastAPI process that agents call via HTTP. Agents are not modified; they add a pre-processing HTTP call before their LLM call.

Approval Gates

None.

Self-Improvement

None. The code-pruner model weights are fixed at inference time. There is no feedback loop or online learning.

Integration Pattern

swe-pruner acts as a middleware filter in the agent's retrieval-to-LLM pipeline:

Retrieval (BM25/vector) → swe-pruner /prune → LLM call

This is orthogonal to orchestration frameworks. It can be dropped into any agent architecture that has an HTTP call before the LLM step.

08

Ui Cli Surface

swe-pruner — UI & CLI Surface

CLI Binary

  • Name: swe-pruner
  • Source: swe_pruner.online_serving:main (pyproject.toml scripts)
  • Subcommands: None — single command starts the server
swe-pruner --model-path ./model --port 8000

Arguments

Flag Default Description
--model-path required Path to downloaded code-pruner model
--port 8000 FastAPI server port

Local UI

None. No browser dashboard, no TUI.

API Surface

Endpoint Method Description
POST /prune POST Prune context chunks for a query
GET /health GET Server health check

Transport

HTTP only (FastAPI/uvicorn). No MCP, no stdio, no WebSocket.

HuggingFace Demo

Model available at: https://huggingface.co/ayanami-kitasan/code-pruner

No interactive HuggingFace Space demo documented.

Observability

None beyond standard uvicorn access logs.

Related frameworks

same archetype · same primary tool · same memory type

MemPalace ★ 53k

Verbatim local-first AI memory with 96.6% R@5 retrieval on LongMemEval using zero API calls — structured into a palace hierarchy…

Beads (Yegge) ★ 24k

Dolt-powered distributed graph issue tracker where AI agents track tasks with hierarchical IDs and dependency edges, claim work…

deepagents (LangChain) ★ 23k

Opinionated Python agent harness on top of LangGraph with sub-agents, filesystem, memory, and context compaction bundled in

agentmemory ★ 18k

Persistent, searchable memory for AI coding agents that captures every tool interaction, compresses it via LLM, and injects…

Open Multi-Agent ★ 6.3k

Give a natural-language goal to a coordinator agent and get a dynamically decomposed, parallelized task DAG executed by…

Basic Memory ★ 3.1k

Gives AI agents a persistent, human-readable knowledge graph of project decisions, observations, and relations stored as plain…