Benchmarks

Every default in RedHop traces to a measured finding. This page collects the headline numbers. Each row links to a finding with the full hypothesis, setup, metrics (CIs where we have them), and a command to reproduce it.

How we measured

Every benchmark below states exactly what was run (dataset, retrieval, budget, metric) so you can compare like-for-like and draw your own conclusions.

What the metric means. Gold evidence is the dataset’s annotated answer span (CUAD clause spans, HotpotQA supporting sentences). Word-recall is the fraction of that span’s words present in the assembled context, a lexical retention proxy, robust to a clause being split across chunks. “≥0.8 on 88%” means 88% of queries had a context whose word-recall was at least 0.8. This is not answer quality. That’s the separate gpt-4o-mini F1/EM tier.

First-stage vs end-to-end. Retrieval ceiling = top-k candidates, no pruning (the most retrieval could keep). End-to-end = candidates + Auto prune to budget (what the product actually serves). The headline −80% / 88% are end-to-end, not the ceiling.

	CUAD document runtime	Framework comparison
dataset	CUAD v1 sample: 50 contracts, 644 clause queries, ~9.3k tok each	CUAD + HotpotQA
retrieval	BM25, lexical only (no embeddings)	BM25 for all three (isolates assembly, not the engine)
chunking	default	RedHop at 128-tok default
token budget	2,048 tok (`candidate_k`=20, `strategy=auto`)	CUAD 2,000 tok · HotpotQA 400 tok
retention metric	gold-span word-recall, no LLM, deterministic	same, n=300
answer metric	—	gpt-4o-mini, SQuAD-style F1/EM, n=150
latency / RAM	warm in-memory BM25 index (lazy-built on first query), single local machine, CPU-only	not measured

Scope of the numbers. (1) Latency is single-machine, CPU-only, warm-index. The exact CPU isn’t published yet, so read the shape (sub-2ms, flat in doc size) rather than the absolute milliseconds. (2) Retention is a lexical proxy. Downstream answer quality is the separate gpt-4o-mini F1/EM tier below.

On a real contract (the `Document` runtime)

RedHop’s product path, Document.from_text(text).context(query), run on 50 real CUAD contracts / 644 clause queries (BM25, budget 2,048 tok, Auto), free, local, deterministic. Numbers are end-to-end (after Auto pruning):

metric	value
token reduction (end-to-end)	−80% (9,322 → 1,909 avg)
gold evidence retained, ≥0.8 word-recall (end-to-end)	88% of queries (≥0.5: 96%)
retrieval ceiling, ≥0.8 (top-20, no prune)	98% (0.99 mean recall)
per-query latency (warm in-memory, single CPU)	p50 1.7ms / p95 3.3ms
chunk + index a whole contract	p50 1.0ms / p95 4.5ms
`Auto` decision	pruned 608 / 644 (94%)

That 80% token cut keeps gold evidence in 88% of queries against a 98% retrieval ceiling, a deliberate trade of ~6 points of retention for ~4× fewer tokens. Latency is flat in document size: 6.7k → 467k tokens barely moves per-query time (~1.8ms), and a 467k-token document indexes in <40ms at ~90MB peak RSS, because BM25 lookup is independent of corpus size.

Reproduce: cargo run -p redhop-examples --example eval_cuad_documents --release

Head-to-head vs LangChain & LlamaIndex

Same documents, BM25 for all three (comparing context assembly, not retrieval engines), per-dataset budget below.

Evidence retention: RedHop vs LangChain vs LlamaIndex on HotpotQA multi-hop and CUAD contracts. RedHop leads multi-hop at 80%. On CUAD contracts a 4th striped bar shows RedHop reaching 90.7% with the Stripper + workload-curated Vocabulary chain (the same preprocessing is not applied to LlamaIndex), beating LlamaIndex's 86% by 4.7 points.

Evidence retention: share of queries with gold-span word-recall ≥0.8, no LLM, n=300 (CUAD budget 2,000 tok · HotpotQA 400 tok, latest rerun 2026-06-06). RedHop column is its best strategy variant (raw_topk), and reasoning_preserving is within a few points (CUAD 77%, HotpotQA 77%). Strategy barely moves retention:

dataset	RedHop	LangChain	LlamaIndex
HotpotQA (multi-hop)	80%	71%	72%
CUAD (raw 24-word template)	81.3%	73%	86%
CUAD (template-stripped query)†	87.7%	—	—
CUAD (Stripper + Vocabulary)‡	90.7%	—	—

† Same RedHop runtime, with Stripper(boilerplate) (a compiled, token-level rewrite that drops CUAD’s fixed 24-word template before retrieval), keeping the quoted clause name + Details: elaboration that carry the signal. We did not apply the same preprocessor to LangChain/LlamaIndex, so the striped bar in the chart is not apples-to-apples with theirs. The mechanism (BM25 boilerplate dilution) and recipe live in CUAD_RECALL_GAP.

‡ Same RedHop runtime, with Stripper(boilerplate) plus a hand-curated Vocabulary({...}) dictionary (34 keys, 121 syns) compiled once and run through doc.context_with_rewrites(query, [stripper, vocab]). High-IDF discriminators (e.g., Change of Control → merger, successor, acquisition) selectively raise the BM25 score of the gold-bearing chunk. Mechanism is the opposite of unweighted PRF (which adds low-IDF corpus boilerplate and was falsified, see CUAD_PRF_NULL). Full worked example, dict, and 4-arm probe in CUAD_CLAUSE_EXPANSION. This row reflects the detect → compile → run-through-rewrites → A/B workflow at its Pareto-optimal point on CUAD.

Does this mean you have to write a stripper? Only if your queries follow a fixed template (legal QA, support-ticket triage, form-filled queries from a UI). For variable natural-language queries it’s a no-op. RedHop ships the primitives that compose the full workflow at the public API surface: analyze_query_set detects the pattern on your data. Stripper(...) is the compiled token-level boilerplate removal. Vocabulary({...}) is the compiled workload-curated synonyms. doc.context_with_rewrites(...) runs them as a chain with audit trail on ctx.report.query_rewrites. And evaluate scores the lift deterministically against your gold sample, no LLM judge required (see EVALUATE_API). Cross-workload probe ruled out false positives on HotpotQA + MuSiQue, see QUERY_SET_ANALYZER. Decision rule + runnable recipe on the Choosing a configuration page → “Templated queries with heavy boilerplate”.

The one-knob alternative. If you’d rather not write a stripper or maintain a dict, retrieval="hybrid" recovers most of the lift automatically: +5.3 points on the raw CUAD template query at ~10ms/query. But, measured on this same workload, Stripper + Vocabulary on default BM25 still beats hybrid + cross-encoder (90.7% / 2.5ms vs 89.0% / 683ms): the two paths are substitutes, not complements. See CUAD_HYBRID_RERANK for the 6-arm probe and the “pick one, don’t combine” rule.

Answer quality: gpt-4o-mini, SQuAD-style F1 / EM, n=150:

dataset	RedHop	LangChain	LlamaIndex
HotpotQA	0.51 / 0.41	0.50 / 0.39	0.50 / 0.42
CUAD	0.34 / 0.17	0.25 / 0.11	0.35 / 0.16

Results. On multi-hop (HotpotQA), RedHop leads on both retention (80%) and answer F1 (0.51). On contract extraction (CUAD), LlamaIndex edges ahead on retention with the raw 24-word template query (LlamaIndex 86% vs RedHop 82% ≥0.8). The gap is mechanism-known (BM25 boilerplate dilution). Stripper(boilerplate) on RedHop lifts ≥0.8 retention to 87.7%, and adding a hand-authored 34-key clause-name Vocabulary reaches 90.7%.

Fair-preprocessing footnote (n=300, 2026-06-08). Applying the same Stripper to every system’s query lifts every system: LlamaIndex 86% → 94%, RedHop 82% → 88%, LangChain 73% → 79%. LlamaIndex actually benefits more from the same Stripper than RedHop does. The 90.7% RedHop number adds Vocabulary on top of Stripper, but that Vocabulary recipe was not applied to LlamaIndex, and given LlamaIndex’s bigger lift from the Stripper step, an unmeasured-but-likely outcome is that LlamaIndex with the same Vocabulary would match or beat 90.7%. The CUAD recipe’s value to a RedHop user is the reproducible in-process workflow with audit trail and Decision Report, not an architectural retrieval advantage. Reproduce both arms with bench/.venv/bin/python bench/compare.py. See CUAD_RECALL_GAP and CUAD_CLAUSE_EXPANSION for the three-arm RedHop run.

On answer F1 the two are level, and both lead LangChain. Across the board the three are close on answer quality, so weigh the scenario that matches your workload. Full breakdown on the comparison page.

Reproduce: bench/.venv/bin/python bench/compare.py (retention), bench/.venv/bin/python bench/tier3.py --n 150 (answers).

The semantic tier: global dense (`retrieval="semantic"`)

The benchmarks above isolate assembly on a BM25 engine. This measures what the dense tier buys on queries BM25 misses. Dense embeds every chunk once (cached) and cosines the query against all of them: exact, no ANN, no vector index.

Recall on natural data, on the global HotpotQA pool (3,957 paragraphs, n=400):

recall@3	BM25	dense
semantic-heavy	0.49	0.80
all queries	0.59	0.80

Answers (gpt-4o-mini F1): semantic-heavy 0.27 → 0.50, all 0.37 → 0.54.

Where it really separates from BM25 is a controlled semantic-mismatch probe (engineered low-overlap answers + lexical traps, n=25), where the answer shares no terms with the query:

recall@1	BM25	dense
overall	20%	88% (96% @3)

BM25 can’t find what shares no words. semantic (global dense) scores every chunk by meaning. On the lexical-overlap control slice both tie. It doesn’t hurt the easy case. Note the hybrid tier (BM25-prune → rerank) lands at 32% on this adversarial probe (capped by BM25’s pool), so when you want every paraphrase caught, prefer semantic. hybrid’s value is the opposite regime: a whole folder of files, where it gives semantic ranking that scales, no vector DB.

Latency (CUAD contracts, setup = embed-all once, warm = per-query):

corpus	dense setup	dense warm/query
~13k tokens (1 contract)	~2s	~6ms
~38k tokens (5 contracts)	~7s	~6ms
~189k tokens (15 contracts)	~17s	~6ms

Per query, dense is ~6ms: the query embedding dominates and exact cosine over the cached vectors is fast. The cost is the one-time embed-everything at setup. lexical stays the default because most queries don’t need a model at all.

Reproduce: bench/.venv/bin/python bench/semantic_modes.py (recall), bench/.venv/bin/python bench/speed_compare.py (latency).

Code retrieval (type-aware indexing)

Indexing is type-aware: code files are kept verbatim and, under hybrid, routed to lexical retrieval (exact identifiers matter, and general embedders are weak on code), while prose gets the dense rerank. To check it, we index RedHop’s own Rust source (2,469 chunks) and ask natural-language questions whose answer is a specific function. Recall@3:

mode	recall@3
`lexical` (BM25)	91%
`hybrid` (type-aware)	83%
`semantic` (dense over everything)	75%

Two things stand out: lexical leads and dense-over-everything trails, so for code, keyword retrieval is the right default. And the type-aware routing helps measurably: in an A/B, sending code to BM25 (rather than embedding it and cosine-reranking) lifted the hybrid tier from 66% → 83% recall@3. The prior behavior was reordering correct lexical hits with noisy code embeddings.

Scope: a hand-built probe on our own source (n=12), so read the direction (and the A/B delta), not the absolute percentage. It isn’t a standardized code-search benchmark. Reproduce: bench/.venv/bin/python bench/code_retrieval.py.

Speed

Speed and latency have their own page now. See Speed → for setup time, warm per-query latency, and how it scales to thousands of pages.

The core findings

Finding	Status	Headline
Second-hop tax	Confirmed (n=1327, CIs)	Every relevance-based selection taxes the multi-hop second hop. A 0.30 filter keeps only 44% of second hops.
Reasoning preservation	Confirmed (4 models, n=300)	Aggressive filtering is net-harmful on all 4 models (−0.06 to −0.15). The rescued subset gains +0.15 to +0.23.
Context dilution	Confirmed (conditional) (3 models, n=200)	At ~30k-token contexts, stuffing-it-all-in collapses accuracy. Pruning recovers it where dilution bites (gpt-4o-mini +0.21), null on dilution-robust models.
CUAD contracts	Confirmed (50 contracts, 644 q)	−80% tokens with gold evidence retained (≥0.8 on 88%) at ~1.7ms/query.
Chunk granularity	Confirmed (sweep vs LC/LI)	Granularity, not strategy, is the lever: 256→128 lifts multi-hop ≥0.8 retention 54%→77%.
Framework comparison	Measured (n=150 answers · n=300 retention, latest 2026-06-07)	Leads on multi-hop (80% retention, +8 over LlamaIndex, F1 0.51). CUAD raw-template 4-point gap to LlamaIndex is mechanism-known (BM25 boilerplate dilution). The Stripper + Vocabulary workflow puts RedHop at 90.7% (+4 over LlamaIndex). See CUAD_RECALL_GAP + CUAD_CLAUSE_EXPANSION.
Templated-workload detection (QUERY_SET_ANALYZER)	Confirmed (3 workloads × n=300)	`analyze_query_set` correctly flags templated CUAD-shape queries (share 0.66, fires) without false positives on diverse natural language (HotpotQA 0.00, MuSiQue 0.12). Ships in the public API across Rust / Python / Node.
Clause-name vocabulary (CUAD_CLAUSE_EXPANSION)	Confirmed (n=300, +3.0 over the template-stripped baseline)	`Vocabulary({...})` lifts CUAD ≥0.8 retention 87.7% → 90.7% with a 34-key clause-name → synonyms dict, run through `doc.context_with_rewrites(...)` with the audit trail on `ctx.report.query_rewrites`. Mechanism is the opposite of unweighted PRF (falsified, low-IDF re-injection). Same workload-specific discipline as `Stripper`: ship the mechanism, caller supplies the dict.
Chunk-side enrich falsification on CUAD (CUAD_ENRICH_DEFINITIONS_NULL)	Falsified, with measured regression (n=300, −2.0 pts vs the 90.7% workflow)	Tested whether `Vocabulary.enrich(...)` on auto-extracted per-contract Definitions sections lifts retention past the shipped workflow on a prose corpus. Regressed to 88.7%. On the 17 of 50 contracts with extractable Definitions, the affected subset dropped ~90.7% → ~67%. Confirms the VOCABULARY_ENRICH regime rule’s negative side: CUAD chunks are long prose, neither short nor opaque, outside the regime. The chunk-side parallel to CUAD_PRF_NULL: workload-pervasive vocabulary dilutes the term-IDF distribution.
Chunk-side enrich confirmed on schemas (SPIDER_ENRICH)	Confirmed (n=30, candidate_k=10)	The positive-side validation for `Vocabulary.enrich(...)`. On a Spider-shape schema-retrieval sample, curated chunk-side enrichment lifts mean column recall 0.77 → 0.97 (+0.19) and ≥0.8 retention from 63% → 93% (+30 pts). Auto-derived enrichment (cleaned name + type + table) lifts to 0.90 (+0.13). Workload-curated synonyms (`Age` → `"old young years"`, `Population` → `"people residents"`) add another +0.07. Same workload-curated discipline as CUAD_CLAUSE_EXPANSION on the query side, mirrored to the chunk side. Closes the VOCABULARY_ENRICH rule’s previously-unmeasured positive side.
In-process evaluation (EVALUATE_API)	Shipped (Rust + Python + Node, 10 / 11 / 9 tests)	`redhop.evaluate(query, ctx, gold)` returns recall / precision / answer-token recall + self-eval (mean_grounding, low_confidence, evidence_density, …), composed into a single `overall`. Zero LLM calls: uses the same primitives the runtime uses to make its Decision Report, so eval and runtime never disagree by construction. Closes the A/B step in the rewrites workflow.
Multilingual analyzer (MULTILINGUAL_ANALYZER)	Confirmed (5 languages)	`analyze_query_set` + `Stripper` work end-to-end on French / German / Spanish / Chinese / Japanese. CJK queries get phrase-segmented via punctuation 「」（）、。 instead of word-segmented. The token-level matcher in `Stripper` preserves Latin word-boundary safety across all scripts.
Hybrid + cross-encoder on CUAD (CUAD_HYBRID_RERANK)	Confirmed substitute, not stack (n=300, 6 arms)	`retrieval="hybrid"` and the Stripper + Vocabulary workflow are substitutes for boilerplate-induced lexical mismatch: they fix the same problem by different mechanisms and don’t compose. Hybrid+CE on CUAD maxes at 89.0% / 683ms. BM25 + Stripper + Vocabulary is Pareto-optimal at 90.7% / 2.5ms. Pick one.
Sub-IDF auto-drop (SUB_IDF_AUTO_DROP_NULL)	Falsified (3 workloads × 4 thresholds)	Auto-dropping low-IDF query tokens using corpus statistics doesn’t lift CUAD (+0.7 vs +6.4 from user-side stripping) and regresses HotpotQA −1.4 to −5.0 / MuSiQue −0.7 to −2.7. With CUAD_PRF_NULL, CUAD_ENRICH_DEFINITIONS_NULL, and SPIDER_ENRICH, contributes to the four-corner observation (not rule, n≤2 datasets per corner with author-curator overlap on the positive arms): workload-pervasive signal manipulation predictably fails on either side of the pipeline.
Semantic mismatch	Confirmed (conditional)	BM25 fails completely under paraphrase/synonymy (R@1 0%). Dense (BGE, exact cosine) recovers (R@1 68%). Naive RRF hybrid is worse than dense. Motivates lexical-first + a dense tier.
Global dense	Confirmed (semantic-mismatch probe, n=25)	Where the answer shares no terms with the query, BM25 (and any BM25-pruned approach) ≈ 20–32% recall@1. Global dense (`retrieval="semantic"`, exact cosine over all chunks, no ANN) hits 88% / 96% @1/@3. ~7ms/query.
Embedding bake-off	Confirmed	A real embedder (BGE) lifts recall +99% vs hashing, when it’s in the action path.
Type-aware code retrieval	Measured (own source, n=12)	For code, lexical leads (91%) and dense-everything trails (75%). Routing code→BM25 under `hybrid` lifts recall@3 66%→83% vs embedding+reranking it.

How the evidence is structured

Every finding follows the same shape (Hypothesis → Status → Setup → Metrics → Failure cases → Interpretation → Caveats → What changed afterward) and ships with a reproduce command and a captured raw-output report. The defaults grounded in this evidence:

strategy="reasoning_preserving" ← the second-hop tax + reasoning-preservation findings
strategy="auto" as a size-gated dilution pruner ← the context-dilution finding
selective (not uniform) reranker escalation ← the reranking-limits finding

Scope & caveats

So you can weigh the numbers precisely:

Answer-quality tiers use gpt-4o-mini, one budget per dataset, two datasets.
The framework comparison isolates context assembly (BM25 for all three). The frameworks’ default vector retrievers are a separate question.
The CUAD contract numbers are evidence retention (word-recall), a proxy for, not the same as, downstream answer quality.
CUAD extraction F1 is low in absolute terms (it’s a hard task). The relative ranking across tools is the signal.

The full evidence, every finding including the falsified hypotheses, lives in the project’s evidence layer on GitHub.