Speed
RedHop runs in your process over a Rust core — no network round-trip, no service to call — so the numbers below are dominated by real work (parsing, indexing, scoring), not overhead. All measurements are CPU-only on a single machine with a warm index; absolute milliseconds drift ~10–15% run-to-run, so read the shape, not the last digit.
Lexical (the default) — instant
Section titled “Lexical (the default) — instant”The default lexical tier (BM25) needs no model and no embedding step, so a document
is queryable almost immediately:
contract.pdf path, ~189k tokens | RedHop (BM25) |
|---|---|
| time to first answer | 0.02s |
| warm per-query | ~1ms |
Each query also prunes to budget and emits a Decision Report, so it does more than a bare retriever and still answers in about a millisecond.
Reproduce: cargo run -p redhop-examples --example eval_cuad_documents --release
Semantic — a one-time cost, then fast forever
Section titled “Semantic — a one-time cost, then fast forever”The opt-in semantic / hybrid tiers embed your chunks once (cached), then score every
query by exact cosine over those cached vectors. So the cost is setup once, fast
forever:
| corpus | embed-all (one-time setup) | warm per-query |
|---|---|---|
| ~13k tokens (1 contract) | ~2s | ~6ms |
| ~38k tokens (5 contracts) | ~7s | ~6ms |
| ~189k tokens (15 contracts) | ~17s | ~6ms |
Warm queries land at ~6ms — the query embedding dominates, and exact cosine over the
cached vectors is cheap. The only real cost is embedding everything up front, and you
pay it only if you opt into a dense tier — the lexical default skips it entirely.
With from_folder(persist=True) the embeddings are written to disk, so the embed-all is
paid once and reloaded on every later run.
Reproduce: bench/.venv/bin/python bench/speed_compare.py
Latency stays flat as documents grow
Section titled “Latency stays flat as documents grow”The most important property for interactive use: per-query time barely moves as the document gets bigger — BM25 lookup is independent of corpus size, so a 4,000-page PDF answers as fast as a 1-page one once it’s loaded. Time-to-first-answer is dominated by parsing the PDF (~2.5ms/page, linear), with chunking, indexing, and the query negligible on top:
| Pages | Chunks | Time to first answer | Warm query |
|---|---|---|---|
| 1,000 | 1,000 | 2.3s | ~2ms |
| 2,000 | 2,000 | 5.0s | ~2ms |
| 4,000 | 4,000 | 11.5s | ~2ms |
A thousands-of-page document is fully interactive after its one-time load. (Adding the
semantic tier adds the embed-all — ~11s per 1,000 chunks — which persist=True makes a
one-time cost.) Measured on synthetic PDFs via from_file on the lexical default — a
latency measurement (parse + index + query), not an answer-quality one.
Reproduce: bench/.venv/bin/python bench/large_pdf.py ·
bench/.venv/bin/python bench/large_pdf.py --semantic
Speed is one axis; answer quality and evidence retention are the other — those, with the head-to-head against LangChain and LlamaIndex, live on the Benchmarks page.