May 8, 2023

Retrieval Is Data Engineering

As LLM products became more practical, retrieval felt immediately familiar. The interface was new, but many of the problems were classic data engineering problems.

What content is eligible? How fresh is it? Who owns it? How is it chunked? What metadata controls access? How do we evaluate whether the retrieved context is useful?

A retrieval system is only as good as the data product behind it. Embeddings and ranking matter, but so do document lifecycle, permissions, deduplication, and observability.

This made me optimistic about the role of data engineers in AI products. A lot of product quality depends on the unglamorous work of making knowledge reliable and retrievable.

The Retrieval Data Product

A retrieval system needs a source of truth about documents. That includes document identity, ownership, freshness, permissions, canonical URLs, deletion state, and the transformation history that produced chunks or embeddings. Without that metadata, retrieval quality becomes hard to debug.

Chunking is a good example. It can look like an ML or application detail, but it is also a data modeling decision. Chunk too small and context loses meaning. Chunk too large and ranking gets noisy. Ignore structure and tables, headings, or policy sections become harder to retrieve correctly.

Permissions are even more clearly a data engineering concern. The retrieval layer should not invent access rules. It should enforce rules from trusted systems and carry those rules through indexing, ranking, caching, and logging.

The more I work through retrieval problems, the more I see them as data platform problems with a conversational interface. The model may answer the question, but the data system decides what evidence the model is allowed to see.

Debugging Bad Answers

When an AI answer is bad, I want to separate retrieval failure from generation failure. Did the right document exist? Was it indexed? Was it eligible for the user? Was the relevant chunk retrieved? Did the model ignore the useful context?

Those questions require logs that connect the user request to retrieved context and source metadata. Without that trail, debugging becomes subjective. People paste examples into a chat and argue about what the model “should” have known.

Retrieval also needs freshness checks. A stale index can make a model sound confidently wrong. If source documents change, the system should know when the index last reflected those changes.

This is why retrieval belongs in the data engineering conversation. It is a production data pipeline where the output is not a table; the output is context for a product experience.