Retrieval & context tips

These are operational laws RedHop’s experiments converged on, measured across four model families, reported with bootstrap 95% confidence intervals, and reproducible from the evidence layer. They’re useful whatever tool you use. RedHop just applies them for you.

Should I optimize this context at all?

Small & focused (fits comfortably, few distractors) → pass it through. Pruning here is neutral-to-harmful: you risk dropping reasoning evidence for no gain.
Large or junk-heavy (diluted) → prune to budget. Pruning recovers accuracy lost to attention dilution (lost-in-the-middle).

redhop.Document.from_text(text).context(query) uses strategy="auto" by default. It decides this from input size and reports which it chose and why.

Know your workload shape

Before any knob, ask one question: how does a correct answer get assembled from your corpus? Almost every workload falls into one of three shapes, and the shape, not the file format, tells you which lever to pull. You usually know which one you are just by reading a handful of your own queries.

Your workload looks like…	The answer is…	The failure mode to expect	Reach for
Single-hop extraction: contracts, runbooks, API refs, financial filings, “find the clause / value / section that says X”	in one place, and the query usually shares words with it	if every query has a fixed wrapper, the boilerplate dilutes the real terms	lexical default. `strategy="raw_topk"` for single-doc extraction. Strip the template (below)
Multi-hop reasoning: “what’s the nationality of the director of film X?” or anything that chains two+ facts	spread across several chunks, and the later “hop” is often low-relevance to the question	a relevance filter prunes the bridge chunk, so the model can’t complete the chain	`strategy="reasoning_preserving"` (default). Don’t hard-filter. Don’t apply a cross-encoder uniformly
Paraphrase / vocabulary mismatch: support & HR KBs where users phrase things nothing like the docs	in one place, but worded with different words than the query	BM25 keyword matching misses it entirely	`retrieval="semantic"` (dense), or add `rerank="cross-encoder"` (verify on your corpus first)

Two notes that save pain:

The shapes can combine. A templated and single-hop workload (legal review) wants both raw_topk and a template stripper. A multi-hop workload with paraphrased questions wants reasoning_preserving and semantic. Read the table as “which rows apply,” not “pick one.”
When unsure, you’re probably multi-hop-ish, so under-filter. The expensive mistake is treating a multi-hop workload as single-hop and hard-filtering away the second hop. strategy="auto" (the default) is built for exactly this uncertainty.

The laws below are the evidence behind each of these calls.

The laws

These come in two kinds: a few that reframe how to think about the problem, and a few that are just what the measurements forced us to accept. Each links to the finding it rests on, including the ones we falsified.

How to think

Under-filter: the cost is asymmetric. This is the spine of everything else, in three beats:
- Relevance ≠ reasoning usefulness. A chunk can be low-relevance to the query yet essential: the multi-hop “second hop” (second-hop tax).
- So removing the wrong chunk costs more than keeping junk. Across models, aggressive filtering was net-harmful: the lost reasoning evidence cost more than the distractors removed (distractor robustness).
- So make “do nothing” the default and intervention the exception. Avoiding damage beats chasing average lift (reasoning preservation).
The practical rule: don’t hard-filter by query relevance, and keep low-relevance chunks that are linked to relevant ones.
The decision is the value, not a magic optimizer. Naive top-k captures most of the gain, and no pruning algorithm dominates. Getting the when right matters more than the how (context economics).

What we measured

Optimize under dilution, not by raw length. 20k focused tokens beat 5k noisy ones. The driver is junk fraction / evidence density, not size alone (context dilution).
Stronger rerankers aren’t universally safer. A cross-encoder applied uniformly on multi-hop can lower recall by demoting the bridge evidence (reranking limits).
Optimization is model-aware. Frontier models tolerate distractors. Smaller/open models are more sensitive. The same policy isn’t optimal for all (distractor robustness).
Defaults are priors, not guarantees: measure on your own corpus. Every default here is the average call across our benchmarks, and your workload is n=1 to us. Before trusting a non-default knob (a cross-encoder, semantic, a template stripper), confirm it lifts your numbers, not ours.

Retrieval: BM25 first, dense for vocabulary gaps

This one is about finding candidates, not optimizing them, a different pipeline stage from the laws above, and the “paraphrase / vocabulary mismatch” row of the workload table.

BM25 (zero-dependency) is the default and best for lexical/keyword queries, but misses paraphrase / low-overlap ones. To recover those, opt into dense (retrieval="semantic"). It embeds every chunk and cosines the query against all of them by meaning, exact and ANN-free (just name a model, still no vector DB). Measured on HotpotQA: BM25 ≈ 0.49 → dense ≈ 0.80 (recall@3). On a synonym-mismatch probe BM25 20% → dense 88% recall@1 (semantic mismatch).

Short, noisy queries: char-ngram, not dense

A different failure from the one above. When queries are short and noisy — 2-5 token product references, dealer orders, voice/OCR transcripts where lays arrives as 1ays — the gap is lexical, not semantic, and dense doesn’t help (it can’t rescue a 1-2 token query; the zero-dep ceiling is ~0.56). The lever is subword lexical matching with no model: language="char_ngram". On brand-typo’d queries it held early precision ~0.98 where word-BM25 fell to ~0.10 (CATALOG_REGIME). The caveat: it’s a recall booster, not a drop-in — its clean set-coverage erodes at scale, so pair it with word-BM25, don’t replace it.

Catalog disambiguation: measure the whole set, sweep field weights

When a query maps to a set (all variants of a product), recall@k against a single gold hides a half-retrieved family — a recall@20 of 1.000 can sit on top of 0.25 strict set-coverage. Measure the real thing with gold_families → set_coverage, and make that the number you optimize. Per-field BM25 weights (bm25_field_weights=[text, source, heading]) are a domain lever, but an honest one: boosting a field the near-duplicates share is inert (a measured null, CATALOG_REGIME Panel D). A boost helps only when the boosted field separates the answer from its near-duplicates, so sweep on a held-out set and keep it only if set-coverage rises. Full walkthrough: Retrieval for catalogs & noisy queries.

Templated queries: strip the boilerplate first

This one’s worth a plain-English walkthrough, because the words “boilerplate” and “stripping” hide a very simple idea.

When does it apply? When every query in your workload is the same shape and only a word or two changes. Legal review, support triage, anything fed from a structured form looks like this:

"Highlight the parts of this contract a lawyer should review regarding Termination."
"Highlight the parts of this contract a lawyer should review regarding Governing Law."
"Highlight the parts of this contract a lawyer should review regarding Indemnification."

Boilerplate is the copy-paste wrapper that repeats in every query: Highlight the parts of this contract a lawyer should review regarding …. It tells the retriever nothing, because it matches every query and every document equally. The only words that actually distinguish one query from another (the discriminators) are Termination, Governing Law, Indemnification.

Stripping just means deleting that wrapper before you search, so the retriever sees only the words that matter:

Before:  Highlight the parts of this contract a lawyer should review regarding Termination
After:    Termination

Why it helps: BM25 weights every query word, so the 19 boilerplate words dilute the signal from the 1–2 real ones. Strip them and the match sharpens onto the real target. On CUAD this is a measured 81.3% → 87.7% retention lift.

Delete the bad: list the wrapper and erase it (re.sub(boilerplate, "", q)). Simple, but breaks when the wrapper wording drifts. Keep the good: describe the structure of what to extract and pull only that out. More robust when the boilerplate varies but the skeleton (a quoted name, a Details: section) stays constant:

import re

def strip_cuad_template(q: str) -> str:
    # Keep the quoted clause name and the "Details:" elaboration; drop the rest.
    clause  = re.search(r'"([^"]+)"', q)
    details = re.search(r'Details?:\s*(.+?)\s*$', q, re.S)
    parts = [clause.group(1) if clause else "", details.group(1) if details else ""]
    return " ".join(p for p in parts if p).strip() or q   # `or q` = safe fallback

Two things to keep honest:

Your boilerplate isn’t CUAD’s, so your stripper isn’t this one. The strip_cuad_template name has the workload baked in on purpose: you’d write strip_support_template, strip_invoice_template, etc., one per workload. RedHop deliberately ships no built-in strip_template(): templates are workload-specific, and a built-in would make the wrong call on the next one.
For single-doc extraction also set strategy="raw_topk". On contract-shape tasks the Auto-routed reasoning_preserving strategy solves a multi-hop problem you don’t have. raw_topk beats it by ~4 points.

Full mechanism, numbers, and the runnable recipe: Choosing a config → Templated queries with heavy boilerplate.

The one-knob alternative: just turn on `retrieval="hybrid"`

If writing a stripper sounds like more work than you want to do, here’s the honest alternative: you can get most of the same lift by flipping a single flag. Hybrid retrieval reads each chunk as semantic content (via a small embedding model) instead of just counting tokens, so the boilerplate ratio stops mattering: the model knows the wrapper words are uninformative without anyone having to tell it.

doc = redhop.Document.from_file("contract.pdf", options=redhop.DocumentOptions(retrieval="hybrid", model="bge-small"))
ctx = doc.context(user_query)   # raw query, no preprocessing

Measured on the same CUAD setup: hybrid on the raw template query gives +5.3 points (81.3% → 86.7%), close to the +6.4 that template stripping gives on its own. You pay ~10ms per query (vs ~2.5ms BM25) and an 80MB model download on first use.

So when do you reach for which?

Lowest-effort, near-best: turn on retrieval="hybrid". One config flag, ~+5 points automatic, no dict to maintain.
Best-quality and fastest: stay on BM25 default, compile a Stripper(...) (or use analyze_query_set to surface the boilerplate for you), and pair it with a workload Vocabulary({...}) (the next section). Run both through doc.context_with_rewrites(query, [stripper, vocab]). On CUAD this gets to 90.7% at ~2.5ms, higher retention and lower latency than hybrid+CE.

The trade is straightforward: hybrid saves you the dict-and-stripper work but caps your headroom at what the embedder can do unsupervised (~86–88%). Stripping + expansion takes more setup but stacks productively and runs at native BM25 speed.

Expand the discriminators (when stripping isn’t enough)

Once the boilerplate is gone, your query is small but might still miss the gold passage, because the clause uses different words than the query asks for. Ask about Change of Control in a contract and the relevant clause probably talks about a merger, a successor, or an acquisition. The query and the gold span are semantically identical but lexically disjoint, so BM25 can’t connect them.

The fix is the mirror image of stripping. Stripping subtracts low-IDF noise (the wrapper that fires on everything). Expansion adds high-IDF discriminators (the rare terms that appear in the gold but not in the query).

import redhop

# YOUR workload's taxonomy. Build the dict by reading a handful of your gold
# spans and noting the recurring high-IDF terms — they're surprisingly stable
# per topic in any "fixed-taxonomy" workload (legal clauses, support-ticket
# categories, HR-policy buckets, etc.).
vocab = redhop.Vocabulary({
    "change of control": ["merger", "successor", "acquisition"],
    "non-compete":       ["restraint", "non-competition"],
    "indemnification":   ["hold harmless", "defend", "liability"],
})

stripper = redhop.Stripper(my_boilerplate)
ctx = doc.context_with_rewrites(user_query, [stripper, vocab])

# The audit trail makes every transformation observable.
for rec in ctx.report.query_rewrites:
    print(rec.stage, rec.matched, rec.added)

Vocabulary is token-level (an "ip" vocabulary key does NOT fire on "recipient"), doesn’t recursively chain, and dedupes synonyms across overlapping matches. The original query is preserved verbatim: the synonyms are appended, never substituted. Vocabulary.bidirectional({...}) gives symmetric matching (PTO ↔ “paid time off” ↔ “vacation”).

Why this isn’t PRF. Pseudo-relevance feedback (“read the top chunks, add their most common words to the query”) looks superficially similar but fails on boilerplate-heavy corpora. See CUAD_PRF_NULL. PRF re-injects the same low-IDF terms you just spent the strip step removing, because the top chunks share a lot of corpus boilerplate. Your hand-curated dict is the opposite: the synonyms are chosen to be high IDF on your corpus (rare across non-matching documents), so they sharpen the ranking instead of washing it out.

Measured on CUAD. With only a Stripper, ≥0.8 retention on the framework comparison is 88% (already past LlamaIndex’s 86%). Add a 34-key clause-name Vocabulary and it’s 90.7%, 4 points past LlamaIndex.

arm	≥0.8 retention
raw 24-word template	81.3%
Stripper	87.7%
Stripper + Vocabulary	90.7%
raw + Vocabulary (control)	86.3%

Honest scope, two things worth knowing:

Hand-curated synonyms ≠ a recipe for synonyms. The dict was built by reading CUAD gold spans and noting recurring terms. An unfamiliar workload needs the same domain inspection. RedHop deliberately ships no automated synonym miner here: that’s the falsified PRF arc.
The mechanism direction matters. Adding the right terms (high-IDF, workload-curated) gives the lift. Adding the wrong terms (low-IDF, corpus-frequency-derived) takes it away. If you’re not sure your dict is high-IDF on your corpus, A/B it with redhop.evaluate against a small gold sample before committing.

Full mechanism, the CUAD dict, and the four-arm probe that justifies the numbers: CUAD_CLAUSE_EXPANSION.

Vocabulary.enrich(...): for short, opaque retrieval units

There’s also a chunk-side primitive: vocab.enrich(chunk_text) applies the same compiled vocabulary to chunks at ingest time. Use it when your retrieval units are short and opaque, like schema columns (emp_compensation), error codes (ERR_4012), API symbols (usrSvc), or defined contract terms, and you have a decoding dictionary that maps those tokens to natural language. The mechanism is the doc2query family: append the decoding tokens to each chunk so natural-language queries that paraphrase the meaning can land on chunks whose original surface text shared no words with the query.

Where it has been measured to lift: schema-style retrieval. Curated chunk-side enrichment on a Spider-shape sample lifted mean column recall 0.77 → 0.97 (+0.19) and ≥0.8 retention from 63% to 93% (SPIDER_ENRICH).

Where it has been measured to hurt: long prose chunks. CUAD prose regressed −2.0 pts when enriched with auto-extracted definition vocabulary (CUAD_ENRICH_DEFINITIONS_NULL) because the appended workload-pervasive vocabulary diluted the term-IDF distribution. So don’t reach for enrich on prose corpora.

Always A/B with redhop.evaluate(...) against your gold set before production adoption. Full regime rule, use cases, and failure modes: VOCABULARY_ENRICH.

Knobs (and sane defaults)

Knob	Where	Default	When to change
`chunk_size`	`from_text` (index-time)	128	smaller for very tight budgets
`strategy`	`from_text`	`"auto"`	rarely
`budget`	`context()` (query-time)	doc default	per-query, freely
`language`	`from_text` (index-time)	raw pipeline (no stemming)	`"english"` for code search / inflection-heavy content (CamelCase split + Snowball stem). Language code (`"german"`, `"french"`, …) for non-English
`code_neighbors_default`	`from_text`/`from_file` (index-time)	`1`	`0` for memory-tight code search where BM25 already surfaces body chunks. `2`/`3` at loose budgets to recover function bodies further from the seed (CODE_NEIGHBORS_DEFAULT)
`prose_heading_default`	`from_text`/`from_file` (index-time)	`true`	`false` to skip auto-attaching section headings, measurably helpful at typical budgets (+7pt ≥0.8) but a wash on categorical ”## Setup”-style headings (PROSE_HEADING_DEFAULT)

chunk_size is fixed at construction (it’s how the index is built). budget is per-query and free to vary without re-indexing.

Why the default is a minimal analyzer (and when to opt back in)

Up through 0.3.1 the default analyzer applied English Snowball stemming (so "highlighted" matched "highlight"), plus CamelCase splitting and stopword filtering. In 0.3.2 the default flipped to a minimal pipeline (Unicode tokenization, lowercase, ASCII fold, nothing else) because measurement said the heavier pipeline was hurting more than helping:

Workload	english ≥0.8	raw (new default) ≥0.8	english p50	raw p50
CUAD	86%	91% (+5)	6.4 ms	3.8 ms
HotpotQA	100%	100% (tied)	2.9 ms	2.3 ms
MuSiQue	90%	97% (+7)	3.4 ms	2.3 ms

Stemming was hurting via false-positive stem collisions ("settles" / "settling" / "settled" all → "settl"), inflating BM25 scores on chunks that shared any form and drowning out the discriminating proper nouns.

# 0.3.2 default — raw pipeline, no extra arguments needed:
doc = redhop.Document.from_text(text)

# Opt back in to English Snowball (camelCase + stopwords + stemmer):
doc = redhop.Document.from_text(text, options=redhop.DocumentOptions(language="english"))

When to opt back in to language="english":

Code search. The CamelCase splitter is what makes "compressVideo" matchable via "compress".
Heavy paraphrase between query and doc: queries about “acquisitions” against doc text mentioning “acquired”, “acquiring”. Test with redhop.evaluate(..., gold_chunks=...) on a sample.

Non-English content: use the language code ("german", "french", …). The language-specific stemmer handles morphology your content needs.

Full evidence + workload-specific recommendations: RAW_ANALYZER.

Bring your own chunker, if the workload calls for it

RedHop’s chunker is well-tuned for sentence-aware prose (MULTIHOP_CHUNK_SIZE_NULL shows bigger chunk_size regresses on multi-hop, and smaller doesn’t help much). But if you’ve measured a different chunker that fits your workload (semantic chunkers, AST-aware code chunkers, schema-aware splitters for tabular data, or any third-party Markdown / LaTeX / academic-paper splitter), wire it in via Document.from_chunks(...):

# Use any chunker you want; just hand RedHop the resulting strings as Chunks.
from your_chunker import chunk_into_sections
sections = chunk_into_sections(open("paper.tex").read())
chunks = [
    redhop.Chunk(text, source="paper.tex", id=f"sec-{i}", metadata={"section": title})
    for i, (title, text) in enumerate(sections)
]
doc = redhop.Document.from_chunks(chunks)
ctx = doc.context("What is the main contribution?")

The constant-chunking matrix (MULTIHOP_CONSTANT_CHUNKING) showed two things worth knowing before you spend time on this:

The chunker dominates (it’s the lever: RedHop’s BM25 vs LangChain’s vs LlamaIndex’s is essentially flat on the same chunks, and the chunker choice is where ±20pts of retention live).
There’s no universally-best chunker. RedHop’s sentence-aware chunker wins on HotpotQA’s short-paragraph shape. LangChain’s char-recursive chunker ties on MuSiQue’s compositional multi-hop. If your workload is something else (legal cross-references, scientific papers, structured data), test on your own corpus with redhop.evaluate(..., gold_chunks=...) before committing.

The full evidence behind each law, including the hypotheses that were falsified, lives in the project’s evidence layer on GitHub.

Next: vs LangChain / LlamaIndex (the same contract question, three ways) · Benchmarks (every number, reproducible).