Feb 11, 2019

Data Quality Is Product Quality

In 2019, I started to see data quality as more than an internal engineering concern. When a metric is wrong, a dashboard is confusing. When an attribute is stale, personalization is worse. When an experiment assignment is missing, product decisions get weaker.

That connection made quality feel less abstract. A broken dataset was not just a failed job; it was a user experience problem with extra steps.

I began asking a different question during pipeline reviews: who feels this if it is wrong? Sometimes the answer was an analyst. Sometimes it was a model. Sometimes it was a customer-facing feature.

That question helped prioritize checks. Not every table needs the same rigor, but every important table needs a clear reason for the rigor it has.

Finding the Product Surface

The useful exercise was tracing a dataset until it reached a human or a product behavior. A transformation that looked internal might feed a pricing view, a customer segmentation job, or a feature that changed what someone saw in the app. Once I understood that path, quality stopped feeling like an abstract engineering standard.

This also made prioritization easier. A one-off exploratory table does not need the same investment as a dataset that drives executive reporting or model decisions. The difference is not that one deserves care and the other does not. The difference is the blast radius when it is wrong.

I started to classify checks by consequence. Some checks protected schema compatibility. Some protected freshness. Some protected business meaning. Some protected trust in a metric that many teams used. That framing helped me avoid adding generic tests everywhere while still being serious about important systems.

The broader lesson was that data quality has to be connected to user impact. If a check fails, the alert should tell me more than “table bad.” It should help me understand who might be making a decision on stale, incomplete, or misleading information.

Prioritizing Quality Work

Quality work can become endless if it is not tied to risk. Every dataset could use more tests, more docs, and more monitoring. The question is where those investments change outcomes.

I started using a simple mental model: importance, volatility, and visibility. Important datasets affect decisions or product behavior. Volatile datasets change often or depend on unstable sources. Visible datasets create user-facing or leadership-facing trust. When all three are high, quality work deserves priority.

That framing helped in planning conversations. Instead of arguing that “data quality matters” in a generic way, I could point to a specific workflow and explain what failure would cost. That made it easier to justify time for checks, reconciliation, and better ownership.

It also kept me from over-engineering low-risk tables. Good engineering is not maximum rigor everywhere. It is appropriate rigor where the system needs to be trusted.