Retrieval & context tips
These are operational laws RedHop’s experiments converged on — measured across four model families, reported with bootstrap 95% confidence intervals, and reproducible from the evidence layer. They’re useful whatever tool you use; RedHop just applies them for you.
Should I optimize this context at all?
Section titled “Should I optimize this context at all?”- Small & focused (fits comfortably, few distractors) → pass it through. Pruning here is neutral-to-harmful: you risk dropping reasoning evidence for no gain.
- Large or junk-heavy (diluted) → prune to budget. Pruning recovers accuracy lost to attention dilution (lost-in-the-middle).
redhop.Document.from_text(text).context(query) uses strategy="auto" by
default — it decides this from input size and reports which it chose and why.
The laws
Section titled “The laws”- Relevance ≠ reasoning usefulness. A chunk can be low-relevance to the query yet essential — the multi-hop “second hop.” Don’t hard-filter by query relevance; keep low-relevance chunks linked to relevant ones.
- Removing the wrong chunk is worse than keeping extra junk. Across models, aggressive filtering was net-harmful — the lost reasoning evidence cost more than the distractors removed. Bias toward under-filtering.
- Optimize under dilution, not by raw length. 20k focused tokens beat 5k noisy ones. The driver is junk fraction / evidence density, not size alone.
- The decision is the value, not a magic optimizer. Naive top-k captures most of the gain; no pruning algorithm dominates. Getting the when right matters more than the how.
- Stronger rerankers aren’t universally safer. A cross-encoder applied uniformly on multi-hop can lower recall by demoting the bridge evidence.
- Optimization is model-aware. Frontier models tolerate distractors; smaller/open models are more sensitive. The same policy isn’t optimal for all.
- Safe optimization is asymmetric. Make “do nothing” the default and intervention the exception. Avoiding damage beats chasing average lift.
- BM25 is the default; reach for dense on semantic queries. BM25
(zero-dependency) is best for lexical/keyword queries but misses paraphrase /
low-overlap ones. To recover those, opt into dense (
retrieval="semantic") — it embeds every chunk and cosines the query against all of them by meaning, exact and ANN-free (just name a model; still no vector DB). Measured on HotpotQA: BM25 ≈ 0.49 → dense ≈ 0.80 (recall@3); on a synonym-mismatch probe BM25 20% → dense 88% recall@1.
Knobs (and sane defaults)
Section titled “Knobs (and sane defaults)”| Knob | Where | Default | When to change |
|---|---|---|---|
chunk_size | from_text (index-time) | 128 | smaller for very tight budgets |
strategy | from_text | "auto" | rarely |
budget | context() (query-time) | doc default | per-query, freely |
chunk_size is fixed at construction (it’s how the index is built); budget is
per-query and free to vary without re-indexing.
The full evidence behind each law — including the hypotheses that were falsified — lives in the project’s evidence layer on GitHub.
Next: vs LangChain / LlamaIndex — the same contract question, three ways · Benchmarks — every number, reproducible.