RAG Retrieval Is Filtering, Not Search.

I have been building RAG pipelines for two years. The mental model I started with was wrong, and reading Angela Shi’s article “Retrieval Is Filtering, Not Search” on Towards Data Science this week made the fix click.

The standard framing of RAG retrieval is “find the passages most similar to the query.” That framing is misleading because it imports the wrong mental model. Retrieval is not a Google-style search across unstructured text. It is a filtering problem on structured tables. The closer mental model is a SQL query, not a Google search.

This is the article that should have existed when I started. Here is what I learned, and what I am changing in my own RAG pipelines because of it.

What "filtering on structured tables" means

The argument starts at the parsing stage. Before retrieval runs, the document has already been parsed into DataFrames. Two tables carry most of the load.

line_df, the dense table. One row per line of the document. Tens of thousands of rows for a long contract. Columns include the text, the page number, the line number, the bounding box, and a section_id that links each line to its section in the table of contents. This is the dense table. Every line is a candidate. Filtering needs to be cheap per row because there are so many of them.

toc_df, the sparse table. One row per section in the table of contents. 20 to 100 rows for most enterprise documents. Columns include the section title, the level (1, 2, 3), the page range, the parent section, and a stable section_id. This is the sparse table. Each row covers a large block of the document. Filtering can be expensive per row because there are not that many.

The size difference reshapes what is even possible. A method that is feasible on toc_df (passing the whole table to an LLM, embedding every entry, running multi-hop reasoning) may be entirely infeasible on line_df. A method that is natural on line_df (regex over thousands of lines, fast keyword scoring) is wasteful on toc_df because there is not enough data to discriminate.

When I read this, my first reaction was “yes, that is what I am actually doing.” My second reaction was “I have been describing it wrong in every meeting for two years.” I have been calling it “vector search” when the right words are “structured-table filtering.” The words matter because they change what methods you reach for.

The anchor and context distinction

The part of the article that changed how I think about my own pipelines is the anchor/context split. These are two different design decisions.

The anchor is the row where the matching signal lands. It is small and precise. A single line of line_df. A title in toc_df. A sentence. The anchor is what you score and rank. Smaller anchors give cleaner signals.

The context is the chunk you pass downstream to generation. It is large and sufficient. A paragraph. A section. A window of N lines. The context is what the LLM actually reads when it generates an answer.

These are independent. You might anchor on a single line of line_df that mentions “premium,” but pass the whole surrounding section to generation so the LLM sees the value in context. You might anchor on a toc_df title like “Section 5: Specific Exclusions,” but the context is the entire section’s body lines.

I have been collapsing these into a single “chunk size” parameter for the last two years. The article made me realize that I was conflating two decisions that need to be made separately. The right setup is two parameters: anchor scope (line, title, sentence) and context scope (paragraph, section, window). My old pipelines had one parameter. That is half the degrees of freedom it should have.

The four question types

The article gives a concrete way to think about which context size to use. Four question types show up in real enterprise document work, and each one wants a different context scope.

Type 1: needle in a haystack. “What is the policy number?” The answer is a single token, probably in the header. The context should be page 1, maybe 5 lines. This is the question class that the Needle-in-a-Haystack benchmark tests, and frontier models score near-perfectly on it.

Type 2: point lookup. “What is the annual premium?” The answer is a clause that mentions “premium” or “prime” or “cotisation” with an amount. The context should be the section that contains that clause, 50 to 200 lines.

Type 3: listing. “What are all the obligations of the seller?” The answer is multiple passages scattered through the document. The context should be the entire “Obligations” section, plus any other section that contains an obligation. 500 to 2000 lines.

Type 4: scoped synthesis. “Summarize the warranty section.” The answer is the entire warranty section body, exhaustively. 200 to 1500 lines.

A pipeline that uses cosine similarity with top-k=5 for all four types will be wrong on at least three. The right context size depends on which question class is being asked. The article made me realize I have been hardcoding one context size (usually the type 2 size, around 100 lines) and hoping it would cover the others. It does not. Type 3 and type 4 questions need their own context sizing.

What I am changing in my own RAG

Three concrete things changed in my pipelines after I read this.

First, I now parse documents into structured tables before retrieval runs. My previous setup was a vector store with embeddings and chunked text. That setup treated the document as free text, which is exactly the framing the article warns against. The new setup parses the document at ingestion into line_df and toc_df (and page_df and image_df if I want them), stores them as actual tables, and runs retrieval as filter operations on those tables. This is more work upfront. It pays off when the same document gets queried at different anchor/context scopes because the tables are already there.

Second, I split the anchor/context decision into two parameters. My old code had one parameter called chunk_size. My new code has anchor_scope (line, title, sentence) and context_scope (paragraph, section, window). The retrieval pipeline picks anchor first, then sizes the context around it. Most of the time, anchor and context are not the same unit. They should not be the same parameter.

Third, I classify the question type before picking context scope. I added a step in the question-parsing stage that classifies the question as needle, point lookup, listing, or scoped synthesis. The context size is then set by the question type, not by a default. Type 1 questions get 5 lines of context. Type 4 questions get 1000. The classification is a single LLM call on a 30-token input. It costs almost nothing. It fixes the failure mode where type 3 and type 4 questions get a type 2-sized context and the answer comes back truncated.

The mental model shift

The article frames the whole thing as “amplify the expert.” Codify the expert’s workflow, then do it better than they can manually. The expert opens a PDF, hits Ctrl+F, types a keyword, jumps to the right paragraph. The expert also navigates by TOC when the keyword fails. The article’s pipeline does both of those things programmatically and adds three lifts: multi-keyword co-occurrence detection, OCR for scanned pages, and structured TOC-content joins.

The shift for me is not technical. It is framing. When I think about retrieval as “vector search,” I reach for embedding models, top-k thresholds, and similarity scores. When I think about retrieval as “filtering on structured tables,” I reach for SQL-style operations: filter this column, join that table, scope this section. The methods are different. The failure modes are different. The right answer is different.

The right answer for “what is the policy number” is a 5-line context with a single anchor. The right answer for “summarize the warranty section” is a 1000-line context scoped to one section. Both questions go through the same RAG pipeline. The pipeline just decides the context scope per question. That is the point.

The takeaway

Retrieval in enterprise RAG is a filtering problem on structured tables, not a search problem across unstructured text. The mental model is closer to SQL than to Google. Two tables carry most of the load: a dense line_df (one row per document line) and a sparse toc_df (one row per section). The anchor (small precise row where the match lands) and the context (larger block passed to generation) are independent design decisions, not a single chunk-size parameter. Four question types show up in real work, and each one wants a different context size.

If you are building RAG and your pipeline treats the document as free text and the retrieval as vector search, this article is worth 21 minutes of your time. The shift in mental model changes the methods you reach for. The shift in methods changes the answers you get back.

If you want to see how the rest of the framework plays out, the author published this as Part II #7A of an “Enterprise Document Intelligence” series. Article 7B is anchor detection. Article 7C is the LLM arbiter that ranks the candidates. I am reading both this week and will write about whichever one changes my pipeline the most.

References

– Angela Shi, “Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG”, Towards Data Science, Jun 23 2026. – Angela Shi, “Enterprise Document Intelligence: A Series on Building RAG Brick by Brick, From Minimal to Corpus Scale”, Towards Data Science. – Greg Kamradt, “Needle in a Haystack – Pressure Testing LLMs”, 2023.

If you want to see how I built the previous version of this kind of pipeline, AI Wrote 80% in 10 Minutes. The Last 20% Took 6 Hours. walks through how I think about context-window management for LLM agents. The RAG pattern here is the same shape: classify the question, pick the right scope, do not hardcode one size for everything.

Read more: Why I Stopped Optimizing My AI Agent and Started Shipping It on shipping vs tuning, and API vs MCP: When REST Ends and Agents Begin on how the retrieval layer connects to agent tool calls.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.