RedHop vs LangChain vs LlamaIndex
We’d rather you trust the numbers than the marketing. Below is the same contract question done three ways, then a full, reproducible benchmark across scenarios — so you can judge it against your own workload.
The same question, three ways
Section titled “The same question, three ways”You have a contract.pdf and one question: “What is the governing law?” Here’s
the code path in each library to get the LLM the right context.
import redhopfrom openai import OpenAI
query = "What is the governing law?"
ctx = redhop.Document.from_file("contract.pdf").context(query)# parsed, chunked, retrieved, and token-budgeted internally
response = OpenAI().chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],)print(response.choices[0].message.content)What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.
from langchain_community.document_loaders import PyMuPDFLoaderfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_openai import OpenAIEmbeddings, ChatOpenAIfrom langchain_community.vectorstores import FAISSfrom langchain_core.prompts import ChatPromptTemplatefrom langchain.chains import create_retrieval_chainfrom langchain.chains.combine_documents import create_stuff_documents_chain
query = "What is the governing law?"
pages = PyMuPDFLoader("contract.pdf").load()chunks = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200,).split_documents(pages)
store = FAISS.from_documents(chunks, OpenAIEmbeddings())retriever = store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template( "Answer using only the context.\n\n{context}\n\nQuestion: {input}")combine = create_stuff_documents_chain(ChatOpenAI(model="gpt-4o-mini"), prompt)chain = create_retrieval_chain(retriever, combine)
print(chain.invoke({"input": query})["answer"])What you stand up: a splitter (you choose
chunk_size/overlap), an embedding model, a FAISS vector
store, a retriever, a prompt template, and a retrieval chain — six wired pieces,
and embeddings cost a call per chunk.
from llama_index.core import VectorStoreIndex, Settingsfrom llama_index.core.node_parser import SentenceSplitterfrom llama_index.readers.file import PyMuPDFReaderfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.llms.openai import OpenAI
query = "What is the governing law?"
Settings.embed_model = OpenAIEmbedding()Settings.llm = OpenAI(model="gpt-4o-mini")
docs = PyMuPDFReader().load(file_path="contract.pdf")
index = VectorStoreIndex.from_documents( docs, transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=50)],)
engine = index.as_query_engine(similarity_top_k=4)print(engine.query(query))What you stand up: a node parser, an embedding model, a vector index, and a query engine. Cleaner than LangChain, but still an embed-and-index pipeline you own and pay for.
What each approach makes you own
Section titled “What each approach makes you own”| RedHop | LangChain | LlamaIndex | |
|---|---|---|---|
| Document parsing | built-in (from_file) | a loader | a reader |
| Chunking strategy | internal default | you tune it | you tune it |
| Embedding model | optional (off by default) | required | required |
| Vector store / ANN | none, at any tier | FAISS / etc. | built-in index |
| Retriever wiring | none | manual | query engine |
| Cost to index | $0, ~1ms (BM25) | 1 embed call/chunk | 1 embed call/chunk |
| Why it kept a passage | Decision Report | opaque | opaque |
That’s the categorical difference: RedHop is one bounded step (from_file → context)
with no vector database, at any tier — the frameworks are pipelines you assemble,
embed into, and operate. RedHop’s default needs no model at all, so out of the box
it’s queryable instantly with no embedding step.
On speed, RedHop is queryable instantly on its lexical default (no embedding step) and answers warm queries in ~1–6ms in-process — the full numbers are on the Speed page. But speed isn’t the pitch: RedHop’s real draw is the runtime — the bounded API, conditional pruning, the Decision Report, no infrastructure. The fair question is whether that simplicity costs answer quality — so we measured it, head to head, below.
The benchmark
Section titled “The benchmark”Same documents, BM25 for all three (so we compare context assembly, not retrieval engines), same token budget. Two datasets — CUAD (real contracts) and HotpotQA (multi-hop) — across two tiers: evidence retention (no LLM) and downstream answer quality (gpt-4o-mini).
Evidence retention (gold-evidence recall ≥0.8, n=300):
| dataset | RedHop | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA (multi-hop) | 77% | 71% | 72% |
| CUAD (contracts) | 82% | 73% | 86% |
Answer quality (gpt-4o-mini, F1 / EM, n=150):
| dataset | RedHop | LangChain | LlamaIndex |
|---|---|---|---|
| HotpotQA | 0.51 / 0.41 | 0.50 / 0.39 | 0.50 / 0.42 |
| CUAD | 0.34 / 0.17 | 0.25 / 0.11 | 0.35 / 0.16 |
On a real contract (the contract.pdf path itself)
Section titled “On a real contract (the contract.pdf path itself)”We ran RedHop’s Document.from_text → context() path on 50 real CUAD contracts
(644 clause questions) — BM25, budget 2,048 tok, the exact path the code above
uses. Numbers are end-to-end (after Auto pruning); “retained” means gold-span
word-recall, a lexical retention proxy — not downstream answer quality:
- −80% tokens — a ~9.3k-token contract becomes a ~1.9k-token context.
- Gold evidence retained at ≥0.8 word-recall on 88% of queries (≥0.5 on 96%); the no-prune retrieval ceiling is 98%, so pruning costs ~6 points.
- ~1.7ms/query p50 (warm in-memory index, single local CPU), ~1ms to chunk+index a whole contract — the default BM25 path.
Autochose to prune on 94% of queries — real contracts are large, so the regime where pruning is measured to help is the common case.
Full conditions and the skeptic’s checklist are on the benchmarks page.
How to read this
Section titled “How to read this”- RedHop leads multi-hop retention and is ≈ LlamaIndex / ahead of LangChain on answers. LlamaIndex edges RedHop on contract extraction (its node parsing seems to suit legalese). No system dominates — and we won’t pretend otherwise.
- Retention is a loose proxy for answers — RedHop’s bigger retention lead shrinks to a near-tie on answer quality, because at a sensible budget every system gives the model enough to roughly tie. We show both numbers.
- LangChain’s deficit is mostly refusals (CUAD 59% vs ~47%): its chunking surfaced the answer span less often, so the model bailed more.
- These are BM25-vs-BM25 results; the frameworks’ default vector retrievers aren’t covered here.
So why pick RedHop?
Section titled “So why pick RedHop?”Answer quality is in the same band across all three (the numbers above) — so the deciding factors are what the frameworks don’t offer:
- A Decision Report for every call — what it did, why, and why it chose not to intervene. No black box.
- Conditional optimization — prunes only when large/diluted (measured to help); passes small contexts through untouched.
- An evidence layer — every default traces to a measured finding, including the experiments that failed.
- A tiny, bounded surface —
Document.from_text(...).context(query), no vector infrastructure to run.
The big frameworks give you the full pipeline kit — many loaders, retrievers, vector stores, agents — when you want to assemble and tune that machinery yourself. RedHop is the opposite bet: document-centric retrieval as one bounded, in-process step, where simplicity and explainability matter more than wiring.
Reproduce it yourself
Section titled “Reproduce it yourself”python3 -m venv bench/.venvbench/.venv/bin/pip install redhop rank-bm25 langchain-community llama-index-core llama-index-retrievers-bm25bench/.venv/bin/python bench/compare.py # retention (free)bench/.venv/bin/python bench/tier3.py --n 150 # answer quality (needs OPENROUTER_API_KEY)Scope & caveats
Section titled “Scope & caveats”- gpt-4o-mini only; one budget per dataset; two datasets. CUAD extraction F1 is low in absolute terms (hard task) — the relative ranking is the signal.
- LlamaIndex’s contract edge is real and not yet fully explained (likely its node parsing / tokenization on legalese).
- RedHop’s
reasoning_preservingstrategy does not beat plain top-k downstream — its value is the runtime decisions and transparency, not a better ranking algorithm. - The CUAD contract numbers above are evidence retention (word-recall), not downstream answer quality; the token reduction and latency are end-to-end.
Next: Benchmarks — every number, reproducible, with full methodology.