May 6, 2024

Designing for Data Minimization

Data minimization can sound restrictive until it is treated as a design tool. Asking “what is the smallest useful data we need?” often produces cleaner architecture.

Raw events may be necessary at ingestion, but many consumers do not need raw identifiers, full payloads, or indefinite history. Derived tables can be narrower, aggregated, or purpose-built with access boundaries that match the use case.

This improves privacy, but it also improves comprehension. Smaller datasets with clearer purpose are easier to document, test, and govern.

I have come to see minimization as a forcing function. It asks teams to explain why data exists, who it serves, and when it should disappear.

Purpose-Built Data Surfaces

Minimization works best when teams have good alternatives to raw data. If the only useful table contains every event and identifier, people will keep asking for it. A better platform provides purpose-built surfaces: analytics marts, aggregated features, permission-safe document indexes, and narrow operational views.

That design requires understanding the consumer. An analyst may need group-level trends. A model may need stable behavioral features. A support workflow may need recent account context. Those use cases do not all need the same data shape or sensitivity level.

I also think minimization improves change management. Narrow datasets are easier to version, test, and deprecate. They have clearer owners and fewer accidental consumers. When a dataset has a specific purpose, it is easier to know when that purpose has ended.

The privacy benefit is obvious, but the engineering benefit is just as real. Purpose-built data surfaces reduce cognitive load. They let teams work with data that says what it is for instead of forcing every consumer to reinterpret the raw world.

Minimization and AI Context

AI products make minimization more important because context can travel quickly. A retrieved document, prompt, or tool result may be sent into a model and influence a response. The system should be thoughtful about what context is necessary.

For retrieval, minimization may mean indexing only approved sources, filtering by permissions, excluding sensitive fields, or creating summaries that preserve utility without exposing raw details. These choices are product and infrastructure decisions at the same time.

Minimization also improves evaluation. When the system has clear data surfaces, it is easier to understand which source influenced an answer. If everything is available everywhere, debugging and governance become harder.

The most useful framing for me is purpose. If a piece of data has a clear purpose, it is easier to protect, measure, and retire. If it has no clear purpose, it probably should not be in the system.