◆ Open compression benchmark

Spend fewer tokens. Keep the answer.

Context Forge quantifies the cost/quality tradeoff of context-compression strategies on real public data - RAG QA, agent traces, and long-context summarization - and reports the Pareto frontier of tokens saved vs. quality retained vs. latency for the tokenizers production apps actually pay for: GPT-4o (o200k), GPT-4/3.5 (cl100k), and GPT-2. Every number is generated from a reproducible run; nothing is hard-coded.

Explore the frontier ↓ Read the code
-
Tokenizers
-
Strategies
-
Task families
-
Examples
-
Measurements

Headline findings

The best strategy on the frontier for each task family, picked for the strongest quality at ≥40% tokens saved.

The Pareto frontier

Each point is a (strategy × ratio) configuration. Up and to the right is better: more tokens saved, more quality retained. Bubble size encodes compression latency. Ringed points are Pareto-optimal - nothing beats them on all three axes.

◯ ringed = on Pareto frontier

Quality is a task-specific retention score (answer-span recall for QA, reference-keyword coverage for summarization, tail-trace coverage for agents) - a compression-preservation proxy, not a model-graded answer score. See Method.

Strategy scorecard

Averaged across every task, tokenizer, and ratio in the run. Latency is the median per-call compression time.

All configurations

Mean metrics for the current task/tokenizer filter. Highlighted rows are on the Pareto frontier.

Task Tokenizer Strategy Ratio Tokens saved Quality Latency N

Learned token-droppability classifier

A small logistic-regression model that predicts, per token, whether it can be dropped - the seed of a learned compressor. Trained on weak labels from target overlap, with character n-gram + positional features.

Method & reproducibility

No synthetic data, no baked-in numbers. The static site renders whatever the last run wrote to public/results/.

Data (real, public)

  • RAG QA - SQuAD validation: question + context + gold answer span.
  • Agent traces - hermes-agent-reasoning-traces: multi-turn tool-use trajectories.
  • Long-context summarization - GovReport: long US gov. reports + expert summaries.

Strategies

  • hard_prompt_pruning - head/tail token-budget truncation.
  • embedding_chunk_drop - TF-IDF/cosine chunk relevance dropping vs. the query.
  • kv_cache_eviction - prefix + recent-suffix + salient-middle proxy.
  • llmlingua - optional wrap of Microsoft LLMLingua-2 (extra dep).

Metrics

  • Tokens saved - measured per tokenizer (tiktoken / HF).
  • Quality retained - task-specific retention proxy in [0,1].
  • Latency - median wall-clock per compression call.
  • Frontier - non-dominated points per (task, tokenizer).

Reproduce

  • pip install -e .
  • compress-bench run --limit-per-task 12
  • compress-bench train-classifier
  • compress-bench plots