Nov 18, 2019

Documenting the Weird Parts

I used to think documentation meant describing every column. That helps, but the documentation I reached for most often explained the weird parts.

Why does this table exclude a certain customer segment? Why is this timestamp in local time? Why does this metric not tie to finance? Why are deletes represented as updates?

Those details are where trust is won or lost. A clean schema with no context can still mislead people if the business rules are surprising.

I began writing documentation around decisions: what the dataset is for, where it should not be used, what assumptions are embedded, and who can answer questions. Good docs reduce the number of traps future teammates have to rediscover.

Writing for Future Debugging

The documentation that aged best was written with future debugging in mind. I started adding notes like “this metric excludes refunds because finance reconciles them separately” or “this timestamp is source-system local time until 2020-03-01.” Those details look small until someone is trying to explain a discrepancy under pressure.

I also learned that documentation should be close to the thing it describes. A beautiful wiki page nobody can find is less useful than a short note linked from the table, job, or dashboard. The less context-switching required, the more likely people are to maintain it.

Good documentation also made reviews better. When a transformation encoded a strange business rule, writing it down forced the team to ask whether the rule was still true. Sometimes the answer was no, and the documentation exercise became a design review in disguise.

This shaped how I think about reliable data systems: they should preserve not just outputs, but reasoning. The weird parts are often where the reasoning lives.

Documentation as Risk Reduction

The most useful documentation reduces risk at the moment of change. When someone edits a transformation, backfills history, migrates a dashboard, or reuses a dataset for ML, the docs should reveal the assumptions that could bite them.

I began to prefer short, living documentation over long static pages. A few accurate paragraphs near the dataset are better than a comprehensive guide that nobody trusts. The goal is to keep the explanation close enough that maintainers update it when behavior changes.

I also like documenting “do not use this for” cases. Positive descriptions explain intent, but negative guidance prevents misuse. A dataset may be fine for directional product analytics but unsafe for billing, experimentation, or model training.

That kind of writing is not glamorous, but it compounds. Every weird part that gets named is one less hidden edge case waiting for the next person.