Aug 8, 2022

Deleting Data Is Engineering

Data teams like to talk about ingestion. We talk less about deletion, but deletion is engineering too.

Retention rules, subject deletion, derived datasets, caches, backups, and model training data all complicate the idea of “remove this record.” If deletion is not designed into the system, it becomes a manual investigation every time.

I started looking at datasets through lifecycle questions. Why do we keep this? How long should it live? What downstream data inherits deletion requirements? Can we prove what happened?

Privacy engineering made this feel urgent, but reliability benefits too. Systems that know how to delete data usually have better lineage, ownership, and operational discipline.

Deletion as a First-Class Workflow

The hard part of deletion is that data rarely stays in one place. It gets copied into derived tables, indexes, exports, caches, training sets, and backups. If those relationships are not tracked, deletion becomes a manual search problem.

I started to think of deletion like backfill in reverse. It needs scope, lineage, execution, validation, and evidence. Which records were affected? Which derived assets were updated? Which systems are out of scope? How do we prove the workflow ran?

This pushed me toward lifecycle-aware design. Datasets should have retention expectations when they are created, not after they become risky. Sensitive fields should have fewer downstream paths. Derived datasets should be clear about whether they inherit deletion requirements.

The broader lesson is that responsible data platforms need endings. Ingestion gets attention because it creates visible value. Deletion and retention create trust. They show that the platform can respect constraints even when data has already become useful.

Making Retention Understandable

Retention policies are easier to follow when they are visible in the data model. A dataset should say whether it is temporary, operational, analytical, archival, or derived from sensitive sources. That label should influence storage, access, and expiration.

I also think deletion workflows should produce audit evidence automatically. Engineers should not have to manually prove which downstream assets were touched. The system should record the request, scope, execution, and validation.

This is another place where lineage pays off. Without lineage, deletion is a scavenger hunt. With lineage, it becomes a workflow over known dependencies, even if some edges still require human review.

Good deletion systems make responsibility practical. They turn a legal or policy requirement into an engineering process that teams can actually operate.