Benchmarks
Every default in RedHop traces to a measured finding. This page collects the headline numbers. Each row links to a finding with the full hypothesis, setup, metrics (CIs where we have them), and a command to reproduce it.
How we measured
Section titled “How we measured”Every benchmark below states exactly what was run — dataset, retrieval, budget, metric — so you can compare like-for-like and draw your own conclusions.
What the metric means. Gold evidence is the dataset’s annotated answer span (CUAD clause spans; HotpotQA supporting sentences). Word-recall is the fraction of that span’s words present in the assembled context — a lexical retention proxy, robust to a clause being split across chunks. “≥0.8 on 88%” means 88% of queries had a context whose word-recall was at least 0.8. This is not answer quality — that’s the separate gpt-4o-mini F1/EM tier.
First-stage vs end-to-end. Retrieval ceiling = top-k candidates, no pruning
(the most retrieval could keep). End-to-end = candidates + Auto prune to budget
(what the product actually serves). The headline −80% / 88% are end-to-end, not
the ceiling.
| CUAD document runtime | Framework comparison | |
|---|---|---|
| dataset | CUAD v1 sample — 50 contracts, 644 clause queries, ~9.3k tok each | CUAD + HotpotQA |
| retrieval | BM25, lexical only (no embeddings) | BM25 for all three (isolates assembly, not the engine) |
| chunking | default | RedHop at 128-tok default |
| token budget | 2,048 tok (candidate_k=20, strategy=auto) | CUAD 2,000 tok · HotpotQA 400 tok |
| retention metric | gold-span word-recall, no LLM, deterministic | same, n=300 |
| answer metric | — | gpt-4o-mini, SQuAD-style F1/EM, n=150 |
| latency / RAM | warm in-memory BM25 index (lazy-built on first query), single local machine, CPU-only | not measured |
Scope of the numbers. (1) Latency is single-machine, CPU-only, warm-index; the exact CPU isn’t published yet, so read the shape (sub-2ms, flat in doc size) rather than the absolute milliseconds. (2) Retention is a lexical proxy; downstream answer quality is the separate gpt-4o-mini F1/EM tier below.
On a real contract (the Document runtime)
Section titled “On a real contract (the Document runtime)”RedHop’s product path — Document.from_text(text).context(query) — run on 50
real CUAD contracts / 644 clause queries (BM25, budget 2,048 tok, Auto), free,
local, deterministic. Numbers are end-to-end (after Auto pruning):
| metric | value |
|---|---|
| token reduction (end-to-end) | −80% (9,322 → 1,909 avg) |
| gold evidence retained, ≥0.8 word-recall (end-to-end) | 88% of queries (≥0.5: 96%) |
| retrieval ceiling, ≥0.8 (top-20, no prune) | 98% (0.99 mean recall) |
| per-query latency (warm in-memory, single CPU) | p50 1.7ms / p95 3.3ms |
| chunk + index a whole contract | p50 1.0ms / p95 4.5ms |
Auto decision | pruned 608 / 644 (94%) |
That 80% token cut keeps gold evidence in 88% of queries against a 98% retrieval ceiling — a deliberate trade of ~6 points of retention for ~4× fewer tokens. Latency is flat in document size — 6.7k → 467k tokens barely moves per-query time (~1.8ms), and a 467k-token document indexes in <40ms at ~90MB peak RSS, because BM25 lookup is independent of corpus size.
Reproduce: cargo run -p redhop-examples --example eval_cuad_documents --release
Head-to-head vs LangChain & LlamaIndex
Section titled “Head-to-head vs LangChain & LlamaIndex”Same documents, BM25 for all three (comparing context assembly, not retrieval engines), per-dataset budget below.
Evidence retention — share of queries with gold-span word-recall ≥0.8, no LLM,
n=300 (CUAD budget 2,000 tok · HotpotQA 400 tok). RedHop column is its best
strategy variant (raw_topk); reasoning_preserving is within ~1 point (CUAD 78%,
HotpotQA 76%) — strategy barely moves retention:
| dataset | RedHop (best) | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA (multi-hop) | 77% | 71% | 72% |
| CUAD (contracts) | 82% | 73% | 86% |
Answer quality — gpt-4o-mini, SQuAD-style F1 / EM, n=150:
| dataset | RedHop | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA | 0.51 / 0.41 | 0.50 / 0.39 | 0.50 / 0.42 |
| CUAD | 0.34 / 0.17 | 0.25 / 0.11 | 0.35 / 0.16 |
Results. On multi-hop (HotpotQA), RedHop leads on both retention (77%) and answer F1 (0.51). On contract extraction (CUAD), LlamaIndex edges ahead on retention and the two are level on F1; both lead LangChain. Across the board the three are close on answer quality — weigh the scenario that matches your workload. Full breakdown on the comparison page.
Reproduce: bench/.venv/bin/python bench/compare.py (retention),
bench/.venv/bin/python bench/tier3.py --n 150 (answers).
The semantic tier: global dense (retrieval="semantic")
Section titled “The semantic tier: global dense (retrieval="semantic")”The benchmarks above isolate assembly on a BM25 engine. This measures what the dense tier buys on queries BM25 misses. Dense embeds every chunk once (cached) and cosines the query against all of them — exact, no ANN, no vector index.
Recall on natural data — global HotpotQA pool (3,957 paragraphs, n=400):
| recall@3 | BM25 | dense |
|---|---|---|
| semantic-heavy | 0.49 | 0.80 |
| all queries | 0.59 | 0.80 |
Answers (gpt-4o-mini F1): semantic-heavy 0.27 → 0.50; all 0.37 → 0.54.
Where it really separates from BM25 — a controlled semantic-mismatch probe (engineered low-overlap answers + lexical traps, n=25), where the answer shares no terms with the query:
| recall@1 | BM25 | dense |
|---|---|---|
| overall | 20% | 88% (96% @3) |
BM25 can’t find what shares no words; semantic (global dense) scores every chunk by
meaning. On the lexical-overlap control slice both tie — it doesn’t hurt the easy case.
Note the hybrid tier (BM25-prune → rerank) lands at 32% on this adversarial
probe — capped by BM25’s pool — so when you want every paraphrase caught, prefer
semantic. hybrid’s value is the opposite regime: a whole folder of files, where
it gives semantic ranking that scales, no vector DB.
Latency (CUAD contracts; setup = embed-all once, warm = per-query):
| corpus | dense setup | dense warm/query |
|---|---|---|
| ~13k tokens (1 contract) | ~2s | ~6ms |
| ~38k tokens (5 contracts) | ~7s | ~6ms |
| ~189k tokens (15 contracts) | ~17s | ~6ms |
Per query, dense is ~6ms — the query embedding dominates and exact cosine over the
cached vectors is fast. The cost is the one-time embed-everything at setup. lexical
stays the default because most queries don’t need a model at all.
Reproduce: bench/.venv/bin/python bench/semantic_modes.py (recall),
bench/.venv/bin/python bench/speed_compare.py (latency).
Code retrieval (type-aware indexing)
Section titled “Code retrieval (type-aware indexing)”Indexing is type-aware: code files are kept verbatim and, under hybrid, routed to
lexical retrieval (exact identifiers matter, and general embedders are weak on
code), while prose gets the dense rerank. To check it, we index RedHop’s own Rust
source (2,469 chunks) and ask natural-language questions whose answer is a specific
function — recall@3:
| mode | recall@3 |
|---|---|
lexical (BM25) | 91% |
hybrid (type-aware) | 83% |
semantic (dense over everything) | 75% |
Two things stand out: lexical leads and dense-over-everything trails — so for code,
keyword retrieval is the right default. And the type-aware routing helps measurably:
in an A/B, sending code to BM25 (rather than embedding it and cosine-reranking) lifted
the hybrid tier from 66% → 83% recall@3 — the prior behavior was reordering
correct lexical hits with noisy code embeddings.
Scope: a hand-built probe on our own source (n=12), so read the direction (and
the A/B delta), not the absolute percentage — it isn’t a standardized code-search
benchmark. Reproduce: bench/.venv/bin/python bench/code_retrieval.py.
Speed and latency have their own page now — see Speed → for setup time, warm per-query latency, and how it scales to thousands of pages.
The core findings
Section titled “The core findings”| Finding | Status | Headline |
|---|---|---|
| Second-hop tax | Confirmed (n=1327, CIs) | Every relevance-based selection taxes the multi-hop second hop; a 0.30 filter keeps only 44% of second hops. |
| Reasoning preservation | Confirmed (4 models, n=300) | Aggressive filtering is net-harmful on all 4 models (−0.06 to −0.15); the rescued subset gains +0.15 to +0.23. |
| Context dilution | Confirmed (conditional) (3 models, n=200) | At ~30k-token contexts, stuffing-it-all-in collapses accuracy; pruning recovers it where dilution bites (gpt-4o-mini +0.21), null on dilution-robust models. |
| CUAD contracts | Confirmed (50 contracts, 644 q) | −80% tokens with gold evidence retained (≥0.8 on 88%) at ~1.7ms/query. |
| Chunk granularity | Confirmed (sweep vs LC/LI) | Granularity, not strategy, is the lever: 256→128 lifts multi-hop ≥0.8 retention 54%→77%. |
| Framework comparison | Measured (n=150) | Leads on multi-hop (retention + F1); level with LlamaIndex on contracts; both ahead of LangChain on answers. |
| Semantic mismatch | Confirmed (conditional) | BM25 fails completely under paraphrase/synonymy (R@1 0%); dense (BGE, exact cosine) recovers (R@1 68%); naive RRF hybrid is worse than dense. Motivates lexical-first + a dense tier. |
| Global dense | Confirmed (semantic-mismatch probe, n=25) | Where the answer shares no terms with the query, BM25 (and any BM25-pruned approach) ≈ 20–32% recall@1; global dense (retrieval="semantic", exact cosine over all chunks, no ANN) hits 88% / 96% @1/@3. ~7ms/query. |
| Embedding bake-off | Confirmed | A real embedder (BGE) lifts recall +99% vs hashing — when it’s in the action path. |
| Type-aware code retrieval | Measured (own source, n=12) | For code, lexical leads (91%) and dense-everything trails (75%); routing code→BM25 under hybrid lifts recall@3 66%→83% vs embedding+reranking it. |
How the evidence is structured
Section titled “How the evidence is structured”Every finding follows the same shape — Hypothesis → Status → Setup → Metrics → Failure cases → Interpretation → Caveats → What changed afterward — and ships with a reproduce command and a captured raw-output report. The defaults grounded in this evidence:
strategy="reasoning_preserving"← the second-hop tax + reasoning-preservation findingsstrategy="auto"as a size-gated dilution pruner ← the context-dilution finding- selective (not uniform) reranker escalation ← the reranking-limits finding
- the conservative, zero-harm adaptive controller ← the retriever-coupling findings
Scope & caveats
Section titled “Scope & caveats”So you can weigh the numbers precisely:
- Answer-quality tiers use gpt-4o-mini; one budget per dataset; two datasets.
- The framework comparison isolates context assembly (BM25 for all three); the frameworks’ default vector retrievers are a separate question.
- The CUAD contract numbers are evidence retention (word-recall) — a proxy for, not the same as, downstream answer quality.
- CUAD extraction F1 is low in absolute terms (it’s a hard task); the relative ranking across tools is the signal.
The full evidence — every finding, including the falsified hypotheses — lives in the project’s evidence layer on GitHub.