Skip to content

RedHop vs LangChain vs LlamaIndex

We’d rather you trust the numbers than the marketing. Below is the same contract question done three ways, then a full, reproducible benchmark across scenarios — so you can judge it against your own workload.

You have a contract.pdf and one question: “What is the governing law?” Here’s the code path in each library to get the LLM the right context.

import redhop
from openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)
# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
)
print(response.choices[0].message.content)

What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.

RedHopLangChainLlamaIndex
Document parsingbuilt-in (from_file)a loadera reader
Chunking strategyinternal defaultyou tune ityou tune it
Embedding modeloptional (off by default)requiredrequired
Vector store / ANNnone, at any tierFAISS / etc.built-in index
Retriever wiringnonemanualquery engine
Cost to index$0, ~1ms (BM25)1 embed call/chunk1 embed call/chunk
Why it kept a passageDecision Reportopaqueopaque

That’s the categorical difference: RedHop is one bounded step (from_file → context) with no vector database, at any tier — the frameworks are pipelines you assemble, embed into, and operate. RedHop’s default needs no model at all, so out of the box it’s queryable instantly with no embedding step.

On speed, RedHop is queryable instantly on its lexical default (no embedding step) and answers warm queries in ~1–6ms in-process — the full numbers are on the Speed page. But speed isn’t the pitch: RedHop’s real draw is the runtime — the bounded API, conditional pruning, the Decision Report, no infrastructure. The fair question is whether that simplicity costs answer quality — so we measured it, head to head, below.

Same documents, BM25 for all three (so we compare context assembly, not retrieval engines), same token budget. Two datasets — CUAD (real contracts) and HotpotQA (multi-hop) — across two tiers: evidence retention (no LLM) and downstream answer quality (gpt-4o-mini).

Evidence retention (gold-evidence recall ≥0.8, n=300):

datasetRedHopLangChainLlamaIndex
HotpotQA (multi-hop)77%71%72%
CUAD (contracts)82%73%86%

Answer quality (gpt-4o-mini, F1 / EM, n=150):

datasetRedHopLangChainLlamaIndex
HotpotQA0.51 / 0.410.50 / 0.390.50 / 0.42
CUAD0.34 / 0.170.25 / 0.110.35 / 0.16

On a real contract (the contract.pdf path itself)

Section titled “On a real contract (the contract.pdf path itself)”

We ran RedHop’s Document.from_text → context() path on 50 real CUAD contracts (644 clause questions) — BM25, budget 2,048 tok, the exact path the code above uses. Numbers are end-to-end (after Auto pruning); “retained” means gold-span word-recall, a lexical retention proxy — not downstream answer quality:

  • −80% tokens — a ~9.3k-token contract becomes a ~1.9k-token context.
  • Gold evidence retained at ≥0.8 word-recall on 88% of queries (≥0.5 on 96%); the no-prune retrieval ceiling is 98%, so pruning costs ~6 points.
  • ~1.7ms/query p50 (warm in-memory index, single local CPU), ~1ms to chunk+index a whole contract — the default BM25 path.
  • Auto chose to prune on 94% of queries — real contracts are large, so the regime where pruning is measured to help is the common case.

Full conditions and the skeptic’s checklist are on the benchmarks page.

  • RedHop leads multi-hop retention and is ≈ LlamaIndex / ahead of LangChain on answers. LlamaIndex edges RedHop on contract extraction (its node parsing seems to suit legalese). No system dominates — and we won’t pretend otherwise.
  • Retention is a loose proxy for answers — RedHop’s bigger retention lead shrinks to a near-tie on answer quality, because at a sensible budget every system gives the model enough to roughly tie. We show both numbers.
  • LangChain’s deficit is mostly refusals (CUAD 59% vs ~47%): its chunking surfaced the answer span less often, so the model bailed more.
  • These are BM25-vs-BM25 results; the frameworks’ default vector retrievers aren’t covered here.

Answer quality is in the same band across all three (the numbers above) — so the deciding factors are what the frameworks don’t offer:

  1. A Decision Report for every call — what it did, why, and why it chose not to intervene. No black box.
  2. Conditional optimization — prunes only when large/diluted (measured to help); passes small contexts through untouched.
  3. An evidence layer — every default traces to a measured finding, including the experiments that failed.
  4. A tiny, bounded surfaceDocument.from_text(...).context(query), no vector infrastructure to run.

The big frameworks give you the full pipeline kit — many loaders, retrievers, vector stores, agents — when you want to assemble and tune that machinery yourself. RedHop is the opposite bet: document-centric retrieval as one bounded, in-process step, where simplicity and explainability matter more than wiring.

Terminal window
python3 -m venv bench/.venv
bench/.venv/bin/pip install redhop rank-bm25 langchain-community llama-index-core llama-index-retrievers-bm25
bench/.venv/bin/python bench/compare.py # retention (free)
bench/.venv/bin/python bench/tier3.py --n 150 # answer quality (needs OPENROUTER_API_KEY)
  • gpt-4o-mini only; one budget per dataset; two datasets. CUAD extraction F1 is low in absolute terms (hard task) — the relative ranking is the signal.
  • LlamaIndex’s contract edge is real and not yet fully explained (likely its node parsing / tokenization on legalese).
  • RedHop’s reasoning_preserving strategy does not beat plain top-k downstream — its value is the runtime decisions and transparency, not a better ranking algorithm.
  • The CUAD contract numbers above are evidence retention (word-recall), not downstream answer quality; the token reduction and latency are end-to-end.

Next: Benchmarks — every number, reproducible, with full methodology.