Feb 12, 2018

Learning From My First Broken Pipeline

The first data pipeline I remember really learning from was not impressive. It moved files, transformed a few fields, and loaded a table that people used for reporting. It also failed in a way that looked like success.

The job completed. The table existed. The dashboard refreshed. But one upstream file had changed shape, and the pipeline silently filled a column with bad defaults.

That failure taught me a basic lesson that still matters: completion is not correctness. A useful pipeline needs checks around shape, volume, freshness, and business meaning. At the time, even a row count assertion would have saved a lot of confusion.

I started to think of data engineering less as moving bytes and more as preserving trust across handoffs.

What I Missed

The thing I missed was that the pipeline had no opinion about the shape of the data it expected. It trusted the input because the input had usually been fine. That is a fragile kind of trust. The source system had no reason to know which downstream assumptions mattered, and my job had no way to object when those assumptions changed.

If I were building that same workflow again, I would start with a few plain checks before reaching for any framework. I would validate required columns, count rows by partition, check whether important fields suddenly became null, and record the source file name that produced each load. Those details would have made the incident shorter and less mysterious.

The experience also changed how I communicated about failures. Instead of saying “the job succeeded,” I learned to separate operational success from data success. A scheduler can tell me a task exited cleanly, but only domain-aware checks can tell me whether the output is useful.

That distinction still feels foundational. The more teams depend on a dataset, the more the pipeline needs to explain its own confidence. Early in my career, that lesson felt like a bug fix. Now I see it as the beginning of data reliability engineering.

How I Would Handle It Now

Today I would treat this kind of failure as a design gap, not just an incident to patch. The immediate fix would still matter, but I would also ask what made the silent failure possible. Was there no schema validation? No sample inspection? No downstream reconciliation? No owner for the source contract?

I would also separate detection from prevention. Some checks should block publication because the output is clearly unsafe. Other checks should warn because the data is unusual but possibly valid. That distinction helps avoid both extremes: fragile pipelines that stop too often and reckless pipelines that publish anything.

The interesting part is that the technical solution is modest. A few assertions, useful logs, and a small runbook can dramatically improve confidence. The harder shift is cultural: getting everyone to agree that data correctness is part of the product, even when the product is “just” a table.

That early broken pipeline gave me a language for later work. Whenever I build something now, I still ask: if this fails quietly, how would we know?