Retrieval options

Retrieval is a ladder: start at the cheapest rung that works, climb only when your queries demand it.

The default is BM25 (retrieval="lexical"): zero dependencies, fully offline. The documents you reason over are often keyword-dense (contracts, API references, specs, manuals, logs, legal docs), where the words in the question are usually the words in the answer, so BM25 handles them with no model at all. Most document QA starts and ends here.

But BM25 matches words. When a query shares no vocabulary with its answer (“why did the employee leave?” won’t find “the staff member was terminated”), climb to a semantic tier. There are two:

retrieval="hybrid": BM25 prunes to a candidate pool, a dense model reorders only that pool. It embeds just the pool per query, so it scales to a whole folder of files.
retrieval="semantic": embeds every chunk and scores them all by meaning, for the highest recall when the question and answer share no words.

Hybrid: rerank a BM25 pool by meaning

retrieval="hybrid" runs two stages: BM25 narrows thousands of chunks to a small candidate pool (the prune depth is candidate_pool, default 50), then the dense model embeds only that pool + the query and reorders by cosine. Because per-query embedding is bounded by the pool, not the corpus, it stays fast over large local collections and there’s nothing to persist. The one cap: it can only reorder what BM25 surfaced, so a pure-synonym query whose answer shares zero terms can slip through (use semantic for those).

Code is retrieved lexically, automatically. When a folder mixes code and prose, hybrid is type-aware: code files (.py, .ts, .rs, …) are ranked by BM25 only (exact identifiers and API names are what matter, and general-purpose embedders are weak on code) while prose/docs get the dense rerank. The two are merged with reciprocal rank fusion. So code is never needlessly embedded (faster), and an exact symbol match isn’t buried by a fuzzy semantic score.

Semantic: exhaustive, highest recall

The semantic tier embeds every chunk once (cached) and, for each query, cosines the query against every chunk (exact, brute force), ranking by semantic similarity. A paraphrase answer that shares no words with the query is still found, because the match is by meaning, not terms.

It’s just cached vectors scanned by exact cosine: nothing to tune or persist. Per query it’s a brute-force scan whose cost is dominated by embedding the query, so it stays fast in practice. See the GLOBAL_DENSE finding.

Embedding every chunk once is a real one-time cost (comparable to building any vector index), after which warm queries are ~6ms (Speed). The BM25 default skips it entirely. That’s the tier to reach for when setup speed matters.

How to enable it

Nothing to install: the semantic engine is built into the package. It’s just one line: name a model and RedHop downloads it on first use (cached after):

import redhop

# default is "lexical" (BM25, no model). Opt into dense by name:
doc = redhop.Document.from_text(text, options=redhop.DocumentOptions(retrieval="semantic", model="bge-small"))
ctx = doc.context("a paraphrased / semantic query")

const { Document } = require("redhop");

// default is "lexical" (BM25, no model). Opt into dense by name:
const doc = Document.fromText(text, { retrieval: "semantic", model: "bge-small" });
const ctx = doc.context("a paraphrased / semantic query");

use redhop::{text, LoadOptions};

// default is "lexical" (BM25, no model). Opt into dense by name:
let mut doc = text(input, &LoadOptions {
    retrieval: Some("semantic".into()),
    model: Some("bge-small".into()),
    ..Default::default()
})?;
let ctx = doc.context("a paraphrased / semantic query")?;

That’s it: no file paths, no export step. model="bge-small" fetches a small, well-tested model on first use and caches it. It’s still free and local. The built-in names:

`model=`	quality	notes
`"bge-small"`	strong	recommended, small and fast to index
`"bge-base"`	higher	larger / slower to index

For a fully offline install, pre-warm the cache once (or set HF_HUB_OFFLINE=1 after the first download).

Bring your own model (advanced)

Already have an embedding model, or need one that isn’t built in? Point RedHop at a local bi-encoder yourself: give the model + tokenizer files, its output dimension, and its pooling ("cls" for BGE, "mean" for MiniLM/GTE/E5):

doc = redhop.Document.from_text(text, options=redhop.DocumentOptions(
    retrieval="semantic",
    embedder_model="model/model.onnx",       # a local bi-encoder you provide
    embedder_tokenizer="model/tokenizer.json",
    embedder_dim=384,                        # the model's output size
    embedder_pooling="cls",                  # "cls" or "mean"
))

const doc = Document.fromText(text, {
  retrieval: "semantic",
  embedderModel: "model/model.onnx",        // a local bi-encoder you provide
  embedderTokenizer: "model/tokenizer.json",
  embedderDim: 384,                         // the model's output size
  embedderPooling: "cls",                   // "cls" or "mean"
});

let mut doc = text(input, &LoadOptions {
    retrieval: Some("semantic".into()),
    embedder_model: Some("model/model.onnx".into()),       // a local bi-encoder
    embedder_tokenizer: Some("model/tokenizer.json".into()),
    embedder_dim: Some(384),                               // the model's output size
    embedder_pooling: Some("cls".into()),                  // "cls" or "mean"
    ..Default::default()
})?;

For asymmetric models (the E5 family), which prefix queries and documents differently, also pass embedder_query_prefix="query: " and embedder_passage_prefix="passage: ". Symmetric models (BGE, MiniLM, GTE) need neither.

Cross-encoder reranking (optional second stage)

The tiers above rank with a bi-encoder (embed query and passages separately, compare by cosine). A cross-encoder scores each (query, passage) pair jointly. That’s more accurate, because it reads the query and passage together, but it can’t cache: it runs the model once per candidate at query time. In our method comparison the cross-encoder was the most reliable second stage, so it’s available as an opt-in rerank step on any tier:

# BM25 → bge rerank, then a cross-encoder re-scores the top pool:
doc = redhop.Document.from_text(text, options=redhop.DocumentOptions(retrieval="hybrid", rerank="cross-encoder"))
# also valid on the plain lexical tier — no embedder needed for the first stage:
doc = redhop.Document.from_text(text, options=redhop.DocumentOptions(rerank="cross-encoder"))

let doc = Document.fromText(text, { retrieval: "hybrid", rerank: "cross-encoder" });
// also valid on the plain lexical tier:
doc = Document.fromText(text, { rerank: "cross-encoder" });

use redhop::{text, LoadOptions};

let mut doc = text(input, &LoadOptions {
    retrieval: Some("hybrid".into()),
    rerank: Some("cross-encoder".into()),
    ..Default::default()
})?;

rerank="cross-encoder" fetches the MS-MARCO MiniLM reranker on first use (cached, like the embedding models) and reorders the first-stage candidate pool down to the chunks you keep. It costs a model call per candidate per query, so it’s off by default. Reach for it when you want maximum precision and the per-query latency is acceptable. One caveat from our findings: a cross-encoder applied uniformly can occasionally demote the low-relevance bridge evidence a multi-hop question needs, so measure on your own queries rather than assuming it’s always a win.

When to use which

Query style	Use	Why
Keyword / exact terms	`retrieval="lexical"` (default, BM25)	fastest, zero-dep, already best here
Templated workload with heavy boilerplate	`retrieval="lexical"` + `Stripper` + `Vocabulary` via `context_with_rewrites(...)`	the cheapest path that wins (see the “two paths” rule below)
Semantic over many files / a folder	`retrieval="hybrid"` (BM25 prune → rerank)	embeds only the pool/query, scales to folders
Highest recall by meaning	`retrieval="semantic"` (global dense)	scores every chunk by meaning, finds answers BM25 never surfaces

Three tiers in RedHop: lexical / hybrid / semantic. The split between hybrid and semantic is mechanism: hybrid embeds only the BM25 candidate pool, so it scales to a whole folder. semantic embeds every chunk for the highest recall when the question and answer share no words.

Templated workloads: the “two paths” rule. When every query in your workload follows a fixed wrapper (legal QA, support triage, form-filled queries), the lift you’d get from retrieval="hybrid" is substitutable with what you’d get from fixing the query at the boundary. On CUAD, hybrid + cross-encoder maxes at 89.0% / 683ms. BM25 default + analyze_query_set → Stripper(boilerplate) + Vocabulary({...}) via doc.context_with_rewrites(...) reaches 90.7% at ~2.5ms, higher retention AND 270× lower latency. The two paths fix the same underlying problem (boilerplate crowds out the discriminator) by different mechanisms, and running both gives diminishing returns. Pick one. See CUAD_HYBRID_RERANK for the 6-arm probe and Choosing a config → Templated queries for the full decision rule.

Why dense, not BM25, for meaning. BM25 matches words. On a pure-synonym query that shares zero terms with the answer, BM25 simply can’t find it. Dense scores every chunk by semantic similarity, so a lexically-disjoint answer is still reachable. On a controlled semantic-mismatch probe (engineered low-overlap answers + lexical traps): recall@1: BM25 20% → dense 88% (96% @3). On the lexical-overlap control slice both tie, so dense doesn’t hurt the easy case.

Performance: dense costs ~the same per query as plain BM25-pool reranking would (~7ms): the query embedding dominates and exact cosine over the cached vectors is fast. lexical stays the default because most queries don’t need a model at all. Full evidence (the trade-offs for each tier) is in the evidence layer.

Next: retrieval & context tips (operational laws for getting more from your context) · how RedHop compares to LangChain and LlamaIndex.