Skip to content

Retrieval & context tips

These are operational laws RedHop’s experiments converged on — measured across four model families, reported with bootstrap 95% confidence intervals, and reproducible from the evidence layer. They’re useful whatever tool you use; RedHop just applies them for you.

  • Small & focused (fits comfortably, few distractors) → pass it through. Pruning here is neutral-to-harmful: you risk dropping reasoning evidence for no gain.
  • Large or junk-heavy (diluted) → prune to budget. Pruning recovers accuracy lost to attention dilution (lost-in-the-middle).

redhop.Document.from_text(text).context(query) uses strategy="auto" by default — it decides this from input size and reports which it chose and why.

  1. Relevance ≠ reasoning usefulness. A chunk can be low-relevance to the query yet essential — the multi-hop “second hop.” Don’t hard-filter by query relevance; keep low-relevance chunks linked to relevant ones.
  2. Removing the wrong chunk is worse than keeping extra junk. Across models, aggressive filtering was net-harmful — the lost reasoning evidence cost more than the distractors removed. Bias toward under-filtering.
  3. Optimize under dilution, not by raw length. 20k focused tokens beat 5k noisy ones. The driver is junk fraction / evidence density, not size alone.
  4. The decision is the value, not a magic optimizer. Naive top-k captures most of the gain; no pruning algorithm dominates. Getting the when right matters more than the how.
  5. Stronger rerankers aren’t universally safer. A cross-encoder applied uniformly on multi-hop can lower recall by demoting the bridge evidence.
  6. Optimization is model-aware. Frontier models tolerate distractors; smaller/open models are more sensitive. The same policy isn’t optimal for all.
  7. Safe optimization is asymmetric. Make “do nothing” the default and intervention the exception. Avoiding damage beats chasing average lift.
  8. BM25 is the default; reach for dense on semantic queries. BM25 (zero-dependency) is best for lexical/keyword queries but misses paraphrase / low-overlap ones. To recover those, opt into dense (retrieval="semantic") — it embeds every chunk and cosines the query against all of them by meaning, exact and ANN-free (just name a model; still no vector DB). Measured on HotpotQA: BM25 ≈ 0.49 → dense ≈ 0.80 (recall@3); on a synonym-mismatch probe BM25 20% → dense 88% recall@1.
KnobWhereDefaultWhen to change
chunk_sizefrom_text (index-time)128smaller for very tight budgets
strategyfrom_text"auto"rarely
budgetcontext() (query-time)doc defaultper-query, freely

chunk_size is fixed at construction (it’s how the index is built); budget is per-query and free to vary without re-indexing.

The full evidence behind each law — including the hypotheses that were falsified — lives in the project’s evidence layer on GitHub.

Next: vs LangChain / LlamaIndex — the same contract question, three ways · Benchmarks — every number, reproducible.