Headline findings
The best strategy on the frontier for each task family, picked for the strongest quality at ≥40% tokens saved.
The Pareto frontier
Each point is a (strategy × ratio) configuration. Up and to the right is better: more tokens saved, more quality retained. Bubble size encodes compression latency. Ringed points are Pareto-optimal - nothing beats them on all three axes.
Quality is a task-specific retention score (answer-span recall for QA, reference-keyword coverage for summarization, tail-trace coverage for agents) - a compression-preservation proxy, not a model-graded answer score. See Method.
Strategy scorecard
Averaged across every task, tokenizer, and ratio in the run. Latency is the median per-call compression time.
All configurations
Mean metrics for the current task/tokenizer filter. Highlighted rows are on the Pareto frontier.
| Task | Tokenizer | Strategy | Ratio | Tokens saved | Quality | Latency | N |
|---|
Learned token-droppability classifier
A small logistic-regression model that predicts, per token, whether it can be dropped - the seed of a learned compressor. Trained on weak labels from target overlap, with character n-gram + positional features.
Method & reproducibility
No synthetic data, no baked-in numbers. The static site renders whatever the last run wrote to public/results/.
Data (real, public)
- RAG QA - SQuAD validation: question + context + gold answer span.
- Agent traces -
hermes-agent-reasoning-traces: multi-turn tool-use trajectories. - Long-context summarization - GovReport: long US gov. reports + expert summaries.
Strategies
- hard_prompt_pruning - head/tail token-budget truncation.
- embedding_chunk_drop - TF-IDF/cosine chunk relevance dropping vs. the query.
- kv_cache_eviction - prefix + recent-suffix + salient-middle proxy.
- llmlingua - optional wrap of Microsoft LLMLingua-2 (extra dep).
Metrics
- Tokens saved - measured per tokenizer (tiktoken / HF).
- Quality retained - task-specific retention proxy in [0,1].
- Latency - median wall-clock per compression call.
- Frontier - non-dominated points per (task, tokenizer).
Reproduce
pip install -e .compress-bench run --limit-per-task 12compress-bench train-classifiercompress-bench plots