Overview

RedHop is a reasoning-preserving context runtime. It sits between your documents and an LLM: you hand it text and a question, and it returns the context the model should actually see, chunking, retrieving, and allocating internally, and explaining what it did.

The one idea

Retrieval quality is not the same thing as reasoning quality. Transformers tolerate irrelevant context better than they tolerate missing reasoning links.

So RedHop optimizes for keeping the evidence a question actually needs, and it only intervenes when intervention is measured to help: large, diluted contexts get pruned, small ones are left alone.

What it owns

Loading + parsing: text, PDF/DOCX/PPTX/XLSX, and whole folders (with citations)
Chunking
Internal retrieval (lexical by default, with optional dense retrieval covered below)
Context allocation under a token budget
Reasoning-safe, conditional optimization
Observability and token economics

Retrieval is a ladder: start cheap, climb only when you must

Begin at the cheapest rung that works, and step up only when your queries demand it.

Rung (`retrieval=`)	Dependency	Reach for it when
`"lexical"`, BM25 (default)	none (zero model, fully offline)	the answer shares words with the query (most document QA)
`"hybrid"`, BM25 prune → dense rerank	a model name (`model="bge-small"` auto-downloads)	semantic search over many files / a folder
`"semantic"`, global dense	same	highest recall, scores every chunk by meaning

1 · Lexical handles most document QA. The documents you reason over are often keyword-dense (contracts, API references, specs, manuals, logs, legal docs), where the words in the question are the words in the answer. BM25 handles those with zero model, zero infra, indexing in milliseconds. Most document QA starts and ends here.

1.5 · Before climbing, sharpen the query. When lexical seems to plateau, the remaining gap is often query-side dilution rather than true semantic mismatch. On templated workloads (legal QA, support triage, form-filled queries) the same wrapper repeats across every query, drowning the discriminator in BM25’s score. That’s fixable at the query boundary with three primitives that ship in the API: analyze_query_set (detects the template), Stripper(boilerplate) (compiled token-level boilerplate removal), and Vocabulary({key: [syns]}) (workload-curated high-IDF synonyms). They run as a chain through doc.context_with_rewrites(query, [stripper, vocab]), with the per-stage audit trail surfaced on ctx.report.query_rewrites so every rewrite is observable. On CUAD, BM25 + Stripper + a workload-curated Vocabulary (34 keys, 121 synonyms compiled against CUAD’s clause names) lifts RedHop’s retention to 90.7% at ~2.5ms/query, higher than hybrid + cross-encoder (89%) and 270× faster. Stripper alone gets you to 87.7% with ~12 stopword-like terms. The Vocabulary lift is the part that takes workload-specific authoring. Note (n=300, fair preprocessing): the same Stripper applied to other systems’ queries also helps them: LlamaIndex 86% → 94%. The recipe’s value is the reproducible in-process workflow with audit trail, not an architectural retrieval lead. See CUAD_HYBRID_RERANK for the measurement and Choosing a configuration for the workflow. The deciding question isn’t what tier, it’s what shape your queries and corpus are. If the gap is template dilution, climb the query, not the tier.

2 · Semantic, when words stop matching meaning: two tiers, for when sharpening the query isn’t enough and the answer really does share no words with the question (paraphrase, synonym mismatch, vocabulary drift):

"hybrid" re-ranks a BM25 candidate pool by meaning. It only embeds that pool per query, so it scales to a whole folder. Also a one-knob alternative to template stripping on boilerplate-heavy workloads. See the “two paths” guidance for the trade.
"semantic" ranks every chunk by meaning, for the highest recall when the question and the answer share no words.

You just name an embedding model. How each tier actually ranks (cosine, fusion, and optional reranking) is in How the search works below.

→ Retrieval options: the tiers, how to enable them, and what each asks of you.

How the search works

Lexical (lexical) ranks by BM25, classic term-frequency scoring over an in-process inverted index. No model, no embeddings.
Hybrid (hybrid) runs two stages: BM25 narrows the corpus to a candidate pool (default 50 chunks), then a local embedding model encodes that pool and the query and reorders by cosine similarity. Only the pool is embedded per query.
Semantic (semantic) encodes every chunk once, caches the vectors, and ranks them all by exact cosine against the query, a brute-force scan (no approximate-nearest-neighbour index), so per-query cost is dominated by embedding the query, not the corpus size.
Mixed corpora (code + prose) under hybrid are merged with reciprocal rank fusion: code is ranked lexically (exact identifiers matter, and general embedders are weak on code), prose by cosine, and the two ranked lists are fused.
Optional cross-encoder rerank (rerank="cross-encoder") adds a precise second stage on any tier. It jointly encodes each (query, passage) pair and reorders the pool, more accurate than cosine, at a model call per candidate.

You supply an embedding model only for the dense tiers, named once and auto-downloaded on first use. The lexical default needs none.

The mental model

doc = redhop.Document.from_text(text)   # documents
ctx = doc.context(query)                # + queries → context

const doc = Document.fromText(text);    // documents
const ctx = doc.context(query);         // + queries → context

let mut doc = redhop::Document::from_text("doc", text)?;  // documents
let ctx = doc.context(query)?;                            // + queries → context

You think in documents and queries. Retrieval is an implementation detail.

Under the hood

Under the Python surface, RedHop is a Rust library for retrieval infrastructure: chunking, retrieval, reranking, and diagnostics. It does not generate text or bundle an embedding model. Embedding plugs in through a trait boundary. RedHop’s contribution is the orchestration between these stages and the diagnostics engine that makes retrieval quality observable from text alone.

Layering

RedHop ships as a single redhop crate (one Python wheel, one npm package, one Cargo crate). Internally it is organized as modules. Each layer above the core depends only on the trait surface below it, not on sibling implementations.

redhop                        single published crate
   ├── document               high-level façade (Document, read_file, …)
   ├── context                budget-aware assembly + Decision Report
   ├── chunking
   ├── retrieval
   ├── reranking              (under feature "semantic")
   ├── embeddings             (under feature "semantic")
   ├── files                  (under feature "files")
   └── core                   traits + types

The trait surface

redhop::core (re-exported as redhop::traits) defines the pluggable abstractions, the entire contract a caller has to understand:

Trait	Owns
`TokenizerBackend`	Token counting, sentence segmentation, truncation.
`Chunker`	`Document → Vec<Chunk>`.
`EmbeddingProvider`	`&[String] → Vec<Embedding>`.
`Retriever`	`Query → Vec<RetrievalResult>` + ingest.
`Reranker`	Reorder candidate results.
`VectorIndex`	Add + nearest-neighbor search over embeddings.

Data flow

Document(s)
  → chunker.chunk_batch       → Vec<Chunk>   (optionally + Embedding)
  → retriever.index           → [state]
  → retriever.retrieve(q, k)  → Vec<RetrievalResult>  (score + ScoreBreakdown)
  → reranker.rerank (optional)→ reordered top_k
  → build_context             → BuiltContext + ContextReport

Hybrid retrieval fans the query out to several sub-retrievers in parallel and fuses them with Reciprocal Rank Fusion by default: rank-based and scale-free, the right pick for heterogeneous score distributions. Weighted-sum fusion with min-max normalization is available when scores are commensurable.

Why these choices

Embeddings aren’t bundled. Forcing one model into the library ties you to a single quality/latency/cost point and pulls in heavy runtime dependencies. The EmbeddingProvider trait is async and batch-friendly, so any backend plugs in cleanly, and the default Document path needs none.

Diagnostics are first-class. Retrieval failure modes are observable from text alone. You don’t need the LLM to know you served it a context full of distractors. The engine computes its metrics on every query with no model dependence, and emits machine-readable warning codes (low_lexical_grounding, high_distractor_ratio, retrieval_saturated) for monitoring and adaptive routing.

Chunking is core. Chunk boundaries determine evidence density and topical purity, which dominate the metrics that matter, and chunk granularity is the measured lever (see benchmarks). AdaptiveChunker is the long-term home for evidence-aware chunking. Today it pairs sentence segmentation with a Jaccard cohesion gate.

A proven BM25 engine. Lexical retrieval is solved: production analyzers, fast scoring, an in-memory index for embeddable use. We build on a mature engine behind the Retriever trait rather than reinventing it.

Exact cosine, in process. Dense retrieval scores the query against locally computed embeddings with exact cosine: no ANN, no external index. Correct by construction, and it keeps the whole pipeline embeddable with nothing to operate.

What we explicitly avoided

Fake-AI boundary detection in chunking: a conservative lexical-cohesion gate ships today. The rest is roadmapped, not faked.
Speculative topology / knowledge-graph retrieval / semantic-continuity heuristics: research, not infrastructure.
LLM integrations: once retrieval returns, RedHop is done. What comes after is the caller’s problem.

Where to go next

To recap: a from_text → context surface, a lexical default that needs no model, and a diagnostics engine on every call. One tier asks for a dependency: dense retrieval, for the semantic and paraphrase queries BM25 misses. It’s a real trade-off (a one-time embed cost and a model download), so it gets its own page:

→ Retrieval options: when to reach for dense retrieval, how to enable it, and exactly what it asks of you.

Next: Retrieval options · retrieval & context tips · how RedHop compares to LangChain and LlamaIndex.