Skip to content

Retrieval options

Retrieval is a ladder: start at the cheapest rung that works, climb only when your queries demand it.

The default is BM25 (retrieval="lexical") — zero dependencies, fully offline. The documents you reason over are often keyword-dense — contracts, API references, specs, manuals, logs, legal docs — where the words in the question are usually the words in the answer, so BM25 handles them with no model at all. Most document QA starts and ends here.

But BM25 matches words. When a query shares no vocabulary with its answer — “why did the employee leave?” won’t find “the staff member was terminated” — climb to a semantic tier. There are two:

  • retrieval="hybrid" — BM25 prunes to a candidate pool, a dense model reorders only that pool. It embeds just the pool per query, so it scales to a whole folder of files.
  • retrieval="semantic" — embeds every chunk and scores them all by meaning, for the highest recall when the question and answer share no words.

retrieval="hybrid" runs two stages: BM25 narrows thousands of chunks to a small candidate pool (the prune depth is candidate_pool, default 50), then the dense model embeds only that pool + the query and reorders by cosine. Because per-query embedding is bounded by the pool — not the corpus — it stays fast over large local collections and there’s nothing to persist. The one cap: it can only reorder what BM25 surfaced, so a pure-synonym query whose answer shares zero terms can slip through (use semantic for those).

Code is retrieved lexically, automatically. When a folder mixes code and prose, hybrid is type-aware: code files (.py, .ts, .rs, …) are ranked by BM25 only — exact identifiers and API names are what matter, and general-purpose embedders are weak on code — while prose/docs get the dense rerank. The two are merged with reciprocal rank fusion. So code is never needlessly embedded (faster), and an exact symbol match isn’t buried by a fuzzy semantic score.

The semantic tier embeds every chunk once (cached) and, for each query, cosines the query against every chunk — exact, brute force — ranking by semantic similarity. A paraphrase answer that shares no words with the query is still found, because the match is by meaning, not terms.

It’s just cached vectors scanned by exact cosine — nothing to tune or persist. Per query it’s a brute-force scan whose cost is dominated by embedding the query, so it stays fast in practice. See the GLOBAL_DENSE finding.

Embedding every chunk once is a real one-time cost — comparable to building any vector index — after which warm queries are ~6ms (Speed). The BM25 default skips it entirely; that’s the tier to reach for when setup speed matters.

Nothing to install — the semantic engine is built into the package. It’s just one line: name a model and RedHop downloads it on first use (cached after):

import redhop
# default is "lexical" (BM25, no model). Opt into dense by name:
doc = redhop.Document.from_text(text, retrieval="semantic", model="bge-small")
ctx = doc.context("a paraphrased / semantic query")

That’s it — no file paths, no export step. model="bge-small" fetches a small, well-tested model on first use and caches it; it’s still free and local. The built-in names:

model=qualitynotes
"bge-small"strongrecommended — small and fast to index
"bge-base"higherlarger / slower to index

For a fully offline install, pre-warm the cache once (or set HF_HUB_OFFLINE=1 after the first download).

Already have an embedding model, or need one that isn’t built in? Point RedHop at a local bi-encoder yourself — give the model + tokenizer files, its output dimension, and its pooling ("cls" for BGE, "mean" for MiniLM/GTE/E5):

doc = redhop.Document.from_text(
text,
retrieval="semantic",
embedder_model="model/model.onnx", # a local bi-encoder you provide
embedder_tokenizer="model/tokenizer.json",
embedder_dim=384, # the model's output size
embedder_pooling="cls", # "cls" or "mean"
)

For asymmetric models (the E5 family), which prefix queries and documents differently, also pass embedder_query_prefix="query: " and embedder_passage_prefix="passage: ". Symmetric models (BGE, MiniLM, GTE) need neither.

Cross-encoder reranking (optional second stage)

Section titled “Cross-encoder reranking (optional second stage)”

The tiers above rank with a bi-encoder (embed query and passages separately, compare by cosine). A cross-encoder scores each (query, passage) pair jointly — more accurate, because it reads the query and passage together, but it can’t cache: it runs the model once per candidate at query time. In our method comparison the cross-encoder was the most reliable second stage, so it’s available as an opt-in rerank step on any tier:

# BM25 → bge rerank, then a cross-encoder re-scores the top pool:
doc = redhop.Document.from_text(text, retrieval="hybrid", rerank="cross-encoder")
# also valid on the plain lexical tier — no embedder needed for the first stage:
doc = redhop.Document.from_text(text, rerank="cross-encoder")

rerank="cross-encoder" fetches the MS-MARCO MiniLM reranker on first use (cached, like the embedding models) and reorders the first-stage candidate pool down to the chunks you keep. It costs a model call per candidate per query, so it’s off by default — reach for it when you want maximum precision and the per-query latency is acceptable. One caveat from our findings: a cross-encoder applied uniformly can occasionally demote the low-relevance bridge evidence a multi-hop question needs, so measure on your own queries rather than assuming it’s always a win.

Query styleUseWhy
Keyword / exact termsretrieval="lexical" (default, BM25)fastest, zero-dep, already best here
Semantic over many files / a folderretrieval="hybrid" (BM25 prune → rerank)embeds only the pool/query — scales to folders
Highest recall by meaningretrieval="semantic" (global dense)scores every chunk by meaning — finds answers BM25 never surfaces

Three tiers in RedHop — lexical / hybrid / semantic. The split between hybrid and semantic is mechanism: hybrid embeds only the BM25 candidate pool, so it scales to a whole folder; semantic embeds every chunk for the highest recall when the question and answer share no words.

Why dense, not BM25, for meaning. BM25 matches words. On a pure-synonym query that shares zero terms with the answer, BM25 simply can’t find it. Dense scores every chunk by semantic similarity, so a lexically-disjoint answer is still reachable. On a controlled semantic-mismatch probe (engineered low-overlap answers + lexical traps): recall@1 — BM25 20% → dense 88% (96% @3); on the lexical-overlap control slice both tie, so dense doesn’t hurt the easy case.

Performance: dense costs ~the same per query as plain BM25-pool reranking would (~7ms) — the query embedding dominates and exact cosine over the cached vectors is fast. lexical stays the default because most queries don’t need a model at all. Full evidence — the trade-offs for each tier — is in the evidence layer.

Next: retrieval & context tips — operational laws for getting more from your context · how RedHop compares to LangChain and LlamaIndex.