Intro to RAG: how retrieval actually works

This is a from-scratch tour of RAG and the part that does the real work: retrieval. No prior background needed. We’ll start with the problem RAG solves, then build up the retrieval toolkit one idea at a time, with diagrams and examples.

The problem: an LLM can’t read everything

LLMs are powerful, but they read through a fixed-size window: the context window. You can only fit so many tokens (words, roughly) into a single prompt.

Now suppose you want answers grounded in your data: a 500-page contract, a folder of manuals, your whole codebase. You can’t just paste it all in:

  your documents                      the LLM's context window
  ┌────────────────────────┐          ┌───────────────┐
  │ 500 pages of contracts │   ──▶    │  fits ~a few   │   ✗ won't fit
  │ (hundreds of 1000s of  │          │   pages worth  │
  │  tokens)               │          └───────────────┘
  └────────────────────────┘

And even when a lot does fit, stuffing everything in is a bad idea: it’s slow, it’s expensive (you pay per token), and models get distracted by mostly-irrelevant text and miss the one sentence that mattered.

The fix: send only the relevant pieces (that’s RAG)

So instead of sending the whole haystack, you find the few pieces relevant to the question and put only those in the prompt. That’s the whole idea of Retrieval-Augmented Generation:

  question ──┐
             ├──▶  RETRIEVE the relevant pieces  ──▶  put them in the prompt  ──▶  LLM  ──▶  answer
  your docs ─┘

Retrieval: pick the relevant pieces out of your data.
Augmented Generation: hand those pieces to the LLM so its answer is grounded in them, not in vague training memory.

The “generation” half is just calling an LLM. The half that decides whether the answer is right or a hallucination is retrieval, so that’s what the rest of this guide is about.

Retrieval comes in three families

“Find the relevant pieces” sounds simple, but relevant can mean different things, and each meaning needs different machinery. There are three families, and almost everything in RAG is one of them or a combination:

RETRIEVAL
│
├─ 1. Database lookup   → match exact values         (the B-tree index in SQL)
│
├─ 2. Keyword search    → match the words, ranked     (inverted index + BM25)
│
└─ 3. Semantic search   → match the meaning           (embeddings + cosine)

We’ll take them in order. Each one fixes a limitation of the one before it.

Family 1: Database lookup (and why it’s not enough)

Start with something familiar: a database. Ask it SELECT * FROM users WHERE id = 42 over ten million rows and it answers instantly. It doesn’t scan every row. It walks an index, usually a B-tree: a sorted tree that’s great at two things.

  Exact lookup:   "id = 42"            → jump straight to it
  Range lookup:   "age between 30–40"  → find the start, walk in order

That’s perfect for structured, exact questions. But RAG questions aren’t like that. You don’t ask “find the row whose text equals cheap car”. You ask “find documents about cheap cars,” ranked by how relevant they are, across millions of words in any order. A B-tree can’t do “about.” It matches values, not relevance over text.

The gap: databases find exact, ordered values. Search needs to rank messy text by relevance. That’s why search engines exist: Family 2.

Family 2: Keyword search (lexical / BM25)

The trick that makes text search fast is the inverted index. Instead of mapping document → its words, you map word → the documents that contain it:

  "refund"  → [doc3, doc7, doc41, doc88]
  "window"  → [doc7, doc12, doc41]
  "30days"  → [doc41, doc55]

To answer “refund window,” you look up each word, intersect the lists, and instantly have the candidate documents, without touching the ones that contain neither word. This is what Lucene, Elasticsearch, and Tantivy (the engine under RedHop’s keyword tier) are built on.

But to build that index, two things have to happen first, and they matter for every kind of search, so it’s worth slowing down here.

Tokenization: turning text into “words”

Computers don’t see words. They see characters. Tokenization splits raw text into the terms you actually index:

  "Cheap, fast cars!"   ──tokenize──▶   ["cheap", "fast", "cars"]
                                          │        │       │
                                   lowercase, drop punctuation,
                                   maybe stem:  cars → car

Typical steps: split on spaces/punctuation, lowercase, optionally drop stop words (the, a, of), optionally stem (running → run, cars → car). The key consequence: two texts only match if they share tokens after this step. Hold onto that: it’s exactly what Family 3 is invented to escape.

(Tokenization is also where code is special: you do not want getUserById lowercased and split into get user id. Identifiers need a different tokenizer than prose.)

Chunking: the unit you retrieve

You also rarely index a whole document as one blob: a hit that says “somewhere in these 500 pages” is useless. So you split documents into chunks, and a chunk becomes the thing you retrieve and hand to the LLM.

  one big document ──chunk──▶  [chunk 1][chunk 2][chunk 3] …   each ~a paragraph

Chunk size is a real trade-off:

too big → the chunk mixes several topics, and you waste context budget on filler.
too small → you cut the answer off from the context that explains it.

Scoring: BM25

Finding candidates is step one. Ranking them is step two. BM25 is the decades-old workhorse, and its intuition is simple:

a word you searched for that appears often in a document → more relevant.
but a word that’s rare across all documents (indemnification) is a stronger signal than a common one (the).
with two sensible corrections: extra repeats count for less (diminishing returns), and long documents don’t get to win just for being long.

No model, no training, fully offline, and stronger than people expect.

Where keyword search shines, and where it breaks

It’s excellent when the exact words are the signal:

  query:  ERR_CONN_RESET
  → BM25 nails it; an "AI" approach would drift to "connection errors" in general

Code, logs, error codes, API names, filenames, legal clause numbers: keyword search wins, because exact token identity is the whole point.

But it matches words, not meaning. When the question and the answer use different words for the same idea, it goes blind:

  query:     "cheap car"
  document:  "affordable automobile"
                ↑              ↑
            cheap≠affordable  car≠automobile     → zero overlap, BM25 misses it

  query:     "how do I leave the agreement early?"
  document:  "this contract may be terminated prior to expiration"
                                          → a human sees the match; BM25 doesn't

This is the vocabulary-mismatch problem, and it’s what Family 3 solves.

Family 3: Semantic search (embeddings)

The idea behind semantic search is bold: turn meaning into coordinates.

An embedding model reads a piece of text and outputs a list of numbers, a vector (say 384 of them). It’s trained so that texts with similar meaning land at nearby points in space, regardless of the exact words:

  "car"        → ●┐
  "automobile" → ●┘  close together   (similar meaning)

  "doctor"     → ●┐
  "physician"  → ●┘  close together

  "banana"     → ●     far from all of the above

Now “search” becomes “geometry”: embed the query, and the best documents are the ones whose vectors are closest. cheap car and affordable automobile land near each other even with no shared words, exactly the case BM25 missed.

Measuring “close”: cosine similarity

To compare two vectors you measure the angle between them: cosine similarity. Small angle → same direction → similar meaning:

        ▲
        │      ↗ automobile
        │    ↗
        │  ↗  car          small angle  → cosine ≈ 1   (similar)
        │ ↗
        │╱______________▶
        │
        └──────▶ banana    wide angle   → cosine ≈ 0   (unrelated)

Cosine runs from 1 (same meaning) to 0 (unrelated). We use the angle, not the distance, because we care about direction (meaning), not how long the vector is.

Tokenization & chunking still apply, but differently

Embeddings tokenize too (the model has its own tokenizer). The big difference is chunking. In keyword search every token is indexed on its own, so an oversized chunk just adds noise. In semantic search the whole chunk is squeezed into one vector, a single point standing for everything it says:

  keyword:   chunk → [ every token indexed separately ]
  semantic:  chunk → [ one vector for the whole thing ]

So chunk size bites harder: too big and the vector becomes a blurry average of several topics that matches nothing sharply. Too small and a fragment’s vector loses the context that gave it meaning. Semantic search wants one idea per chunk.

Where semantic search breaks

It’s the mirror image of keyword search. It struggles when exact tokens matter, and it tends to over-generalize:

  query:  "password reset"
  semantic may return:  security policy · auth docs · account settings
  instead of:           the actual reset instructions

It smears a precise query into a neighborhood of vibes, great for paraphrase, risky for exactness.

The scale problem → ANN

Scoring by cosine means comparing the query against every chunk. For a document, a repo, or a folder (thousands of chunks) that’s a few milliseconds, exact and simple. But at millions of vectors, comparing against all of them per query is too slow.

So large systems switch to Approximate Nearest Neighbor (ANN) search: find the almost-closest vectors, far faster, accepting an occasional miss:

  exact:  compare query to ALL N vectors         → correct, O(N), fine for 1000s
  ANN:    cluster / graph-walk to likely matches  → ~correct, much faster, for millions

ANN (via HNSW graphs, IVF clustering, PQ compression) is what a vector database (FAISS, Qdrant, pgvector, LanceDB, …) is fundamentally for. The trade-off is honest: exact cosine is correct and needs no server but doesn’t scale to millions. ANN scales but adds an index to build, tune, and operate.

Putting it together: hybrid + rerankers

Two families with opposite strengths → use both. Hybrid retrieval runs keyword and semantic search and merges the results (a common, robust merge is Reciprocal Rank Fusion: combine by rank position, so you don’t have to reconcile incompatible scores).

Then a reranker can sharpen the top results. First-stage retrieval is built for speed over the whole corpus. A cross-encoder reranker reads the (query, passage) pair together and judges relevance far more precisely: too slow for the whole corpus, perfect for a final pass over a few dozen candidates:

  query ─▶ keyword + semantic ─▶ ~50 candidates ─▶ cross-encoder rerank ─▶ top 5 ─▶ LLM
            (fast, wide)                            (slow, precise)

The real lesson: they fail differently

It’s tempting to ask “which is best?”, but the durable insight is that keyword and semantic search fail in opposite places:

  keyword  fails on:  paraphrase, synonyms           (cheap car vs affordable automobile)
  semantic fails on:  exact tokens, over-generalizes  (password reset → security policy)

That’s why hybrid and rerankers exist: their strengths are complementary. And it’s why the interesting question isn’t “which embedding model is best?” but the deeper one: where does matching words help, and where does matching meaning help? Choosing the right tool per query and per content type matters more than squeezing points out of one model.

Where RedHop fits

RedHop is built around that question. It starts at the cheapest rung and climbs only when a query needs it:

lexical: BM25 keyword search, the default (no model, fully offline).
hybrid: BM25 narrows the field, embeddings rerank by meaning. Code stays keyword and the lists are fused.
semantic: exact cosine over every chunk, for bounded corpora where you want top recall and no ANN to run.
rerank="cross-encoder": the optional precise final pass, on any tier.

Everything in this guide (tokenization, chunking, BM25, cosine, fusion, reranking, and the exact-vs-ANN trade-off) is a knob or a default in that design.

→ See it applied: Retrieval options · How the search works · Retrieval & context tips.