Skip to content

Overview

RedHop is a reasoning-preserving context runtime. It sits between your documents and an LLM: you hand it text and a question, and it returns the context the model should actually see — chunking, retrieving, and allocating internally, and explaining what it did.

Retrieval quality is not the same thing as reasoning quality. Transformers tolerate irrelevant context better than they tolerate missing reasoning links.

So RedHop optimizes for keeping the evidence a question actually needs, and it only intervenes when intervention is measured to help — large, diluted contexts get pruned; small ones are left alone.

  • Loading + parsing — text, PDF/DOCX/PPTX/XLSX, and whole folders (with citations)
  • Chunking
  • Internal retrieval (lexical by default; optional dense retrieval — see below)
  • Context allocation under a token budget
  • Reasoning-safe, conditional optimization
  • Observability and token economics

Retrieval is a ladder — start cheap, climb only when you must

Section titled “Retrieval is a ladder — start cheap, climb only when you must”

Begin at the cheapest rung that works, and step up only when your queries demand it.

Rung (retrieval=)DependencyReach for it when
"lexical" — BM25 (default)none — zero model, fully offlinethe answer shares words with the query (most document QA)
"hybrid" — BM25 prune → dense reranka model name (model="bge-small" auto-downloads)semantic search over many files / a folder
"semantic" — global densesamehighest recall — scores every chunk by meaning

1 · Lexical handles most document QA. The documents you reason over are often keyword-dense — contracts, API references, specs, manuals, logs, legal docs — where the words in the question are the words in the answer. BM25 handles those with zero model, zero infra, indexing in milliseconds. Most document QA starts and ends here.

2 · Semantic, when words stop matching meaning — two tiers:

  • "hybrid" — re-ranks a BM25 candidate pool by meaning; it only embeds that pool per query, so it scales to a whole folder.
  • "semantic" — ranks every chunk by meaning, for the highest recall when the question and the answer share no words.

You just name an embedding model; how each tier actually ranks — cosine, fusion, and optional reranking — is in How the search works below.

Retrieval options — the tiers, how to enable them, and what each asks of you.

  • Lexical (lexical) ranks by BM25 — classic term-frequency scoring over an in-process inverted index. No model, no embeddings.
  • Hybrid (hybrid) runs two stages: BM25 narrows the corpus to a candidate pool (default 50 chunks), then a local embedding model encodes that pool and the query and reorders by cosine similarity. Only the pool is embedded per query.
  • Semantic (semantic) encodes every chunk once, caches the vectors, and ranks them all by exact cosine against the query — a brute-force scan (no approximate-nearest-neighbour index), so per-query cost is dominated by embedding the query, not the corpus size.
  • Mixed corpora (code + prose) under hybrid are merged with reciprocal rank fusion: code is ranked lexically (exact identifiers matter; general embedders are weak on code), prose by cosine, and the two ranked lists are fused.
  • Optional cross-encoder rerank (rerank="cross-encoder") adds a precise second stage on any tier — it jointly encodes each (query, passage) pair and reorders the pool, more accurate than cosine, at a model call per candidate.

You supply an embedding model only for the dense tiers — named once and auto-downloaded on first use; the lexical default needs none.

doc = redhop.Document.from_text(text) # documents
ctx = doc.context(query) # + queries → context

You think in documents and queries. Retrieval is an implementation detail.


Under the Python surface, RedHop is a Rust library for retrieval infrastructure: chunking, retrieval, reranking, and diagnostics. It does not generate text or bundle an embedding model — embedding plugs in through a trait boundary. RedHop’s contribution is the orchestration between these stages and the diagnostics engine that makes retrieval quality observable from text alone.

RedHop ships as a single redhop crate (one Python wheel, one npm package, one Cargo crate). Internally it is organized as modules; each layer above the core depends only on the trait surface below it, not on sibling implementations.

redhop single published crate
├── document high-level façade (Document, read_file, …)
├── context budget-aware assembly + Decision Report
├── chunking
├── retrieval
├── reranking (under feature "semantic")
├── embeddings (under feature "semantic")
├── files (under feature "files")
└── core traits + types

redhop::core (re-exported as redhop::traits) defines the pluggable abstractions — the entire contract a caller has to understand:

TraitOwns
TokenizerBackendToken counting, sentence segmentation, truncation.
ChunkerDocument → Vec<Chunk>.
EmbeddingProvider&[String] → Vec<Embedding>.
RetrieverQuery → Vec<RetrievalResult> + ingest.
RerankerReorder candidate results.
DiagnosticsEngine(Query, &[RetrievalResult]) → DiagnosticsReport.
Document(s)
→ chunker.chunk_batch → Vec<Chunk> (optionally + Embedding)
→ retriever.index → [state]
→ retriever.retrieve(q, k) → Vec<RetrievalResult> (score + ScoreBreakdown)
→ reranker.rerank (optional)→ reordered top_k
→ diagnostics.diagnose → DiagnosticsReport

Hybrid retrieval fans the query out to several sub-retrievers in parallel and fuses them with Reciprocal Rank Fusion by default — rank-based and scale-free, the right pick for heterogeneous score distributions. Weighted-sum fusion with min-max normalization is available when scores are commensurable.

Embeddings aren’t bundled. Forcing one model into the library ties users to a single quality/latency/cost point and pulls in heavy runtime dependencies. The EmbeddingProvider trait is async and batch-friendly, so any backend plugs in cleanly — and the default Document path needs none.

Diagnostics are first-class. Retrieval failure modes are observable from text alone — you don’t need the LLM to know you served it a context full of distractors. The engine computes its metrics on every query with no model dependence, and emits machine-readable warning codes (low_lexical_grounding, high_distractor_ratio, retrieval_saturated) for monitoring and adaptive routing.

Chunking is core. Chunk boundaries determine evidence density and topical purity, which dominate the metrics that matter — and chunk granularity is the measured lever (see benchmarks). AdaptiveChunker is the long-term home for evidence-aware chunking; today it pairs sentence segmentation with a Jaccard cohesion gate.

A proven BM25 engine. Lexical retrieval is solved: production analyzers, fast scoring, an in-memory index for embeddable use. We build on a mature engine behind the Retriever trait rather than reinventing it.

Exact cosine, in process. Dense retrieval scores the query against locally computed embeddings with exact cosine — no ANN, no external index. Correct by construction, and it keeps the whole pipeline embeddable with nothing to operate.

  • Fake-AI boundary detection in chunking — a conservative lexical-cohesion gate ships today; the rest is roadmapped, not faked.
  • Speculative topology / knowledge-graph retrieval / semantic-continuity heuristics — research, not infrastructure.
  • LLM integrations — once retrieval returns, RedHop is done; what comes after is the caller’s problem.

To recap: a from_text → context surface, a lexical default that needs no model, and a diagnostics engine on every call. One tier asks for a dependency — dense retrieval, for the semantic and paraphrase queries BM25 misses. It’s a real trade-off (a one-time embed cost and a model download), so it gets its own page:

Retrieval options — when to reach for dense retrieval, how to enable it, and exactly what it asks of you.

Next: Retrieval options · retrieval & context tips · how RedHop compares to LangChain and LlamaIndex.