SWE-Pruner

Primitive shape

No installable primitives

Summary

swe-pruner — Summary

One-line: Neural context pruner that removes irrelevant code tokens before they reach the LLM, cutting 23–54% of tokens on SWE-Bench Verified with a 0.6B fine-tuned model.

Identity

Field	Value
GitHub	https://github.com/ByteDance-Seed/SWE-Pruner
Stars	282
License	None declared
Language	Python
Version	(no tag; branch: public)
Package type	Standalone research repo (FastAPI server)
Maintainer org	ByteDance Seed (research)

What It Does

swe-pruner serves a FastAPI endpoint (port 8000) that accepts a context payload and returns a pruned version, stripping code chunks the model predicts are irrelevant to the query. The pruner model (code-pruner, 0.6B parameters) is fine-tuned and hosted on HuggingFace (ayanami-kitasan/code-pruner). Agents call the /prune endpoint directly; there is no MCP server, no Claude hook, and no vault.

Claimed Results

"Make Claude Tokens 40% Saving!" (badge, verbatim)
"23–54% token reduction on SWE-Bench Verified"
"up to 14.84x compression on LongCodeQA"
Paper: arXiv:2601.16746 (ByteDance Seed)

Archetype

Research paper implementation — FastAPI inference server, not a Claude plugin or MCP server. Closest to a preprocessing layer that any agent framework can call via HTTP.

Overview

swe-pruner — Overview

Problem Statement

LLM-based coding agents (Claude, OpenHands, SWE-agent) retrieve large repository contexts before issuing a fix. Most of this context is irrelevant to the specific issue. swe-pruner addresses this by interposing a neural pruning step: a small 0.6B model judges each code chunk's relevance and discards those below threshold before the context ever reaches the frontier model.

Positioning

Research artifact: Accompanying implementation for arXiv:2601.16746 (ByteDance Seed team)
Not a SaaS or plugin: No hosted API, no MCP server, no Claude Code plugin
Universal preprocessor: Works with any agent that can call an HTTP endpoint (Claude Agent SDK demo, OpenHands demo, direct HTTP)
Complementary to retrieval: Sits after BM25/vector retrieval and before the LLM call; pruner trims what retrieval keeps

Key Claims (verbatim from README)

"Make Claude Tokens 40% Saving!"

"23-54% token reduction on SWE-Bench Verified"

"up to 14.84x compression on LongCodeQA"

Target Users

Researchers running SWE-Bench evaluations who need reproducibility
Agent framework developers wanting to add a pruning preprocessing step
Cost-sensitive production users willing to run a local 0.6B GPU inference server

Prerequisites

Python >= 3.12
CUDA GPU (flash-attn requires CUDA; CPU fallback not documented)
HuggingFace model download (~1.2GB): ayanami-kitasan/code-pruner
pip install swe-pruner (installs swe-pruner CLI entry point)

Architecture

swe-pruner — Architecture

Deployment Model

Agent (Claude / OpenHands / any HTTP client)
       │
       ▼  POST /prune  (JSON: query + context chunks)
┌─────────────────────────────────────────────┐
│  swe-pruner FastAPI server  (port 8000)      │
│                                              │
│  online_serving.py  ←  pyproject entry point │
│  ┌─────────────────────────────────────────┐ │
│  │  code-pruner model (0.6B, HuggingFace)  │ │
│  │  ayanami-kitasan/code-pruner             │ │
│  │  flash-attn inference (CUDA)            │ │
│  └─────────────────────────────────────────┘ │
│  Returns: pruned context chunks              │
└─────────────────────────────────────────────┘
       │
       ▼  Pruned context passed to frontier LLM
  (Claude / GPT-4 / etc.)

Components

File	Role
`swe_pruner/online_serving.py`	FastAPI server, `/prune` endpoint, `main()` entry point
`swe_pruner/pruner.py`	Model loading (transformers), inference, relevance scoring
`examples/claude_agent_sdk_demo.py`	Claude Agent SDK integration example
`examples/openhands_demo.py`	OpenHands integration example
`pyproject.toml`	`scripts: swe-pruner = "swe_pruner.online_serving:main"`

Model

Name: code-pruner
HuggingFace: ayanami-kitasan/code-pruner
Parameters: ~0.6B
Fine-tuned for: Code chunk relevance classification
Inference: flash-attn (requires CUDA)
Download: git lfs pull or huggingface-cli download

Required Dependencies

python >= 3.12
flash-attn        # CUDA-only
transformers
fastapi
uvicorn

No Persistent State

swe-pruner is stateless: no vault, no SQLite, no session memory. Each /prune call is independent.

Components

swe-pruner — Components

CLI Entry Point

Binary	Source	Purpose
`swe-pruner`	`swe_pruner.online_serving:main`	Start FastAPI server

swe-pruner --model-path ./model --port 8000

FastAPI Endpoints

Endpoint	Method	Description
`/prune`	POST	Accept query + context chunks, return pruned subset
`/health`	GET	Server health check

Integration Examples

File	Integration Target	Notes
`examples/claude_agent_sdk_demo.py`	Claude Agent SDK	Shows calling `/prune` before issuing Claude request
`examples/openhands_demo.py`	OpenHands	Same pattern for OpenHands agent loop

Claude Hooks

None. swe-pruner has no .claude/settings.json, no MCP server, no plugin manifest.

MCP Tools

None. swe-pruner is not an MCP server.

Skills

None.

HuggingFace Model

Repo: ayanami-kitasan/code-pruner
Used via transformers AutoModel/AutoTokenizer
Not bundled; must be downloaded separately

Paper

arXiv:2601.16746 — "SWE-Pruner: Pruning Irrelevant Context for SWE Agents" (ByteDance Seed)

Prompts

swe-pruner — Prompts

Prompt Files

swe-pruner has no CLAUDE.md, no .claude/ directory, no skills, and no prompt templates for the agent runtime.

Model Input (Inference Prompt Pattern)

The code-pruner model itself is a fine-tuned transformer. Its internal prompt template for relevance scoring is not publicly documented in the repository, but inference is called via transformers AutoModel tokenizer — the input is a concatenation of the query and the candidate code chunk.

Integration Notes in README

The README contains a usage snippet showing the JSON payload structure for /prune. No agent-facing CLAUDE.md guidance or system prompt fragments are present.

Examples

examples/claude_agent_sdk_demo.py and examples/openhands_demo.py contain Python integration code showing how to call the pruner within an agent loop, but these are code templates, not prompt files.

Uniqueness

swe-pruner — Uniqueness

Differentiator

swe-pruner is the only framework in the batch that uses a fine-tuned neural model for context pruning. Every other framework uses heuristic methods (BM25, entropy scoring, SimHash, knapsack DP, or graph traversal). swe-pruner trains a dedicated 0.6B model to learn relevance from SWE-Bench data, then runs inference on each candidate chunk.

vs. ccmemory (Seed)

Dimension	ccmemory	swe-pruner
Memory store	Neo4j graph (typed nodes)	None
Insertion	4 lifecycle hooks, LLM detection	None
Retrieval	Cypher graph traversal	N/A
Compression	Not a compression layer	Core feature (neural pruning)
Claude integration	4 hooks + MCP server	None (HTTP endpoint)
Self-improvement	None	None (fixed model weights)
Persistence	Global vault	Stateless

vs. Batch Peers

entroly: Both claim 40-70%+ token reduction. entroly uses knapsack DP + BM25 + entropy (heuristic, no model download). swe-pruner uses a 0.6B model (requires CUDA, ~1GB download, GPU inference latency). entroly adds PRISM RL, WITNESS, proxy mode; swe-pruner adds nothing beyond pruning.
lean-ctx: Both claim aggressive compression. lean-ctx uses BM25+vector hybrid at MCP layer. swe-pruner uses neural scoring at HTTP preprocessing layer. lean-ctx has no model download requirement.
symdex: symdex retrieves code by structure (imports, AST); swe-pruner filters retrieved code by relevance. Complementary: symdex could feed swe-pruner.
claude-self-reflect: CSR improves response quality over time via RL; swe-pruner reduces input size with a fixed model. CSR is session-aware; swe-pruner is stateless.

Unique Capabilities Not Found in Seeds or Batch Peers

Neural relevance scoring: Fine-tuned 0.6B model trained on SWE-Bench data — the only learned (not heuristic) pruner in the batch.
SWE-Bench reproducibility: Published paper (arXiv:2601.16746) with reproducible benchmark numbers; most other frameworks have no formal evaluation.
14.84x compression on LongCodeQA: Highest single-benchmark compression ratio claim in the batch (lean-ctx claims 99.6% on one test, but different task type).
Universal HTTP interface: Agents call /prune via HTTP — no SDK lock-in, no Claude-specific hooks, works with any agent runtime.

Caveats

Requires CUDA GPU: unusable on CPU-only machines without modification
No active maintenance signal: no version tags, no license file, repo on public branch
Model must be downloaded separately (~1GB); not bundled
Research prototype: not production-hardened (no auth, no rate limiting, no error recovery)

Workflow

swe-pruner — Workflow

Setup

pip install swe-pruner
# Download the pruner model
huggingface-cli download ayanami-kitasan/code-pruner --local-dir ./model
# Start the server
swe-pruner --model-path ./model --port 8000

Agent Integration Pattern

1. Agent retrieves candidate context (BM25 / vector search / file read)
2. Agent POSTs to http://localhost:8000/prune:
   {
     "query": "<issue description>",
     "chunks": ["<file1 content>", "<file2 content>", ...]
   }
3. swe-pruner runs code-pruner inference on each chunk
4. Server returns pruned subset (irrelevant chunks removed)
5. Agent uses pruned context for the actual LLM call

Phases

Phase	What Happens	Artifact
Install	`pip install swe-pruner`	CLI available
Model download	`huggingface-cli download`	`./model/` directory
Server start	`swe-pruner --model-path ./model --port 8000`	FastAPI server on :8000
Prune call	Agent POSTs query + chunks	Pruned context JSON
LLM call	Agent uses pruned context	Normal LLM response

Approval Gates

None.

Feedback Loop

None. swe-pruner has no outcome recording, no RL loop, and no session memory. Pruning quality depends entirely on the fine-tuned model weights.

Spec Format

None. This is not a spec-driven workflow.

Memory Context

swe-pruner — Memory & Context

Memory Model

None. swe-pruner has no persistent memory, no vault, no SQLite database, and no cross-session state. Each pruning call is independent.

Context Compression Mechanism

Neural Relevance Pruning

The core mechanism is fundamentally different from every other framework in this batch:

Input: A query (issue description, task) + a list of code chunks (retrieved by BM25, vector search, or any means)
Model: code-pruner (0.6B fine-tuned transformer, ayanami-kitasan/code-pruner)
Output: A subset of the input chunks judged relevant to the query
Method: Each chunk is scored independently for relevance to the query; chunks below threshold are discarded

No Token Budget Solver

Unlike entroly (knapsack DP) or lean-ctx (budget-aware selection), swe-pruner has no explicit token budget constraint. It prunes to relevance, not to a target count.

Claimed Reduction Numbers

"23-54% token reduction on SWE-Bench Verified" (verbatim from README)
"up to 14.84x compression on LongCodeQA" (verbatim from README)
"Make Claude Tokens 40% Saving!" (badge, verbatim)

Context Compaction

swe-pruner does not implement PreCompact hooks or session compaction. It is a preprocessing filter, not a context management layer.

Cross-Session Handoff

None. No CCP, no vault beliefs, no state files.

Comparison: Neural vs. Heuristic Pruning

Approach	Method	Example Framework
Neural (swe-pruner)	Fine-tuned 0.6B model per-chunk relevance	swe-pruner
Knapsack DP	0/1 DP on entropy scores + budget	entroly
BM25+vector hybrid	Term frequency + embeddings	lean-ctx, symdex
Entropy scoring	Shannon entropy → keep high-entropy	entroly (within knapsack)
No compression	Just store/retrieve	ccmemory, basic-memory

Orchestration

swe-pruner — Orchestration

Orchestration Pattern

None. swe-pruner is not an orchestrator. It is a stateless inference service that performs one task: prune a context payload.

Multi-Agent Support

Not applicable. swe-pruner has no concept of agent roles, sub-agents, or task delegation. Multiple agents could independently call the /prune endpoint, but swe-pruner itself does not coordinate them.

Isolation Mechanism

None. The FastAPI server runs in a single process. There is no sandbox, container enforcement, or permission model.

Execution Mode

Server: The pruner runs as a persistent FastAPI process that agents call via HTTP. Agents are not modified; they add a pre-processing HTTP call before their LLM call.

Approval Gates

None.

Self-Improvement

None. The code-pruner model weights are fixed at inference time. There is no feedback loop or online learning.

Integration Pattern

swe-pruner acts as a middleware filter in the agent's retrieval-to-LLM pipeline:

Retrieval (BM25/vector) → swe-pruner /prune → LLM call

This is orthogonal to orchestration frameworks. It can be dropped into any agent architecture that has an HTTP call before the LLM step.

Ui Cli Surface

swe-pruner — UI & CLI Surface

CLI Binary

Name: swe-pruner
Source: swe_pruner.online_serving:main (pyproject.toml scripts)
Subcommands: None — single command starts the server

swe-pruner --model-path ./model --port 8000

Arguments

Flag	Default	Description
`--model-path`	required	Path to downloaded code-pruner model
`--port`	8000	FastAPI server port

Local UI

None. No browser dashboard, no TUI.

API Surface

Endpoint	Method	Description
`POST /prune`	POST	Prune context chunks for a query
`GET /health`	GET	Server health check

Transport

HTTP only (FastAPI/uvicorn). No MCP, no stdio, no WebSocket.

HuggingFace Demo

Model available at: https://huggingface.co/ayanami-kitasan/code-pruner

No interactive HuggingFace Space demo documented.

Observability

None beyond standard uvicorn access logs.

Related frameworks

same archetype · same primary tool · same memory type

MemPalace ★ 53k

A10 Memory engine

Verbatim local-first AI memory with 96.6% R@5 retrieval on LongMemEval using zero API calls — structured into a palace hierarchy…

Beads (Yegge) ★ 24k

A10 Memory engine

Dolt-powered distributed graph issue tracker where AI agents track tasks with hierarchical IDs and dependency edges, claim work…

deepagents (LangChain) ★ 23k

A10 Memory engine

Opinionated Python agent harness on top of LangGraph with sub-agents, filesystem, memory, and context compaction bundled in

agentmemory ★ 18k

A10 Memory engine

Persistent, searchable memory for AI coding agents that captures every tool interaction, compresses it via LLM, and injects…

Open Multi-Agent ★ 6.3k

A10 Memory engine

Give a natural-language goal to a coordinator agent and get a dynamically decomposed, parallelized task DAG executed by…

Basic Memory ★ 3.1k

A10 Memory engine

Gives AI agents a persistent, human-readable knowledge graph of project decisions, observations, and relations stored as plain…