May 9, 2018

Schemas Are Promises

Early in my data engineering work, I thought of schemas as implementation detail. They lived in warehouses, migrations, or parser code. Eventually I realized a schema is closer to a promise.

When a producer publishes a field, downstream teams build assumptions around it. They make dashboards, joins, models, and decisions. If the field changes without warning, the cost shows up somewhere else.

That changed how I approached pipeline design. I began writing down what each dataset meant, who owned it, which fields were stable, and which fields were experimental. None of that was fancy, but it made conversations easier.

The technical part of schemas is straightforward. The harder part is getting teams to respect them as shared interfaces.

The schema itself is only the visible part of the agreement. The real contract includes meaning, cadence, ownership, and change expectations. A column named status is not helpful if nobody knows which statuses are valid, whether old statuses remain in history, or who is allowed to introduce a new value.

I started to appreciate that most schema problems are communication problems wearing technical clothes. A producer team may be improving its own system when it renames a field or collapses two states into one. Downstream, the same change can break a report, confuse a model, or make a metric look like it moved overnight.

The practical habit I took from this period was to write small interface notes beside important datasets. What does each field mean? What are known caveats? Which fields are stable? Which fields are convenient but not guaranteed? The notes did not need to be perfect to be useful. They gave people a shared place to point during change discussions.

This is also where I began to connect data engineering with software engineering. APIs have versioning, ownership, and compatibility expectations. Data interfaces deserve the same respect, especially when they become product or ML dependencies.

What Makes a Promise Useful

A promise is only useful if both sides understand it the same way. In data systems, that means a schema needs examples, constraints, and a process for change. It is not enough to say a field is a string if consumers need to know whether it is nullable, stable, unique, or safe to join on.

I also learned that schema promises should be tested close to publication. If a producer claims a field is required, the publishing workflow should verify that before downstream jobs discover the problem. The earlier a contract breaks, the cheaper it is to fix.

Versioning matters here. Some datasets can evolve with additive fields. Others need explicit versions because meaning changes over time. A version may feel heavy at first, but it gives consumers a way to migrate instead of waking up to a changed interface.

The big lesson for me was that schemas are not paperwork. They are a practical way to make trust scalable. When the promise is clear, teams can build independently without treating every upstream change as a surprise.

Schemas Are Promises

The Social Side of a Schema

What Makes a Promise Useful