Context Forge - Token Compression Benchmark

Headline findings

The best strategy on the frontier for each task family, picked for the strongest quality at ≥40% tokens saved.

The Pareto frontier

Each point is a (strategy × ratio) configuration. Up and to the right is better: more tokens saved, more quality retained. Bubble size encodes compression latency. Ringed points are Pareto-optimal - nothing beats them on all three axes.

Task

Tokenizer

◯ ringed = on Pareto frontier

Quality is a task-specific retention score (answer-span recall for QA, reference-keyword coverage for summarization, tail-trace coverage for agents) - a compression-preservation proxy, not a model-graded answer score. See Method.

Strategy scorecard

Averaged across every task, tokenizer, and ratio in the run. Latency is the median per-call compression time.

All configurations

Mean metrics for the current task/tokenizer filter. Highlighted rows are on the Pareto frontier.

Task	Tokenizer	Strategy	Ratio	Tokens saved	Quality	Latency	N

Learned token-droppability classifier

A small logistic-regression model that predicts, per token, whether it can be dropped - the seed of a learned compressor. Trained on weak labels from target overlap, with character n-gram + positional features.

Method & reproducibility

No synthetic data, no baked-in numbers. The static site renders whatever the last run wrote to public/results/.

Data (real, public)

RAG QA - SQuAD validation: question + context + gold answer span.
Agent traces - hermes-agent-reasoning-traces: multi-turn tool-use trajectories.
Long-context summarization - GovReport: long US gov. reports + expert summaries.

Strategies

hard_prompt_pruning - head/tail token-budget truncation.
embedding_chunk_drop - TF-IDF/cosine chunk relevance dropping vs. the query.
kv_cache_eviction - prefix + recent-suffix + salient-middle proxy.
llmlingua - optional wrap of Microsoft LLMLingua-2 (extra dep).

Metrics

Tokens saved - measured per tokenizer (tiktoken / HF).
Quality retained - task-specific retention proxy in [0,1].
Latency - median wall-clock per compression call.
Frontier - non-dominated points per (task, tokenizer).

Reproduce

pip install -e .
compress-bench run --limit-per-task 12
compress-bench train-classifier
compress-bench plots