Feb 17, 2020

Reliability During Uncertainty

Early 2020 reminded me that data systems live inside changing environments. Historical assumptions can become stale quickly when user behavior, business operations, or external conditions shift.

I started paying more attention to freshness and distribution changes, not just job failures. A pipeline can be green while the world it describes has changed underneath it.

That pushed me toward monitors that asked whether data still looked plausible. Are volumes within range? Did category mix shift? Are new null patterns appearing? Is a metric moving because the product changed or because the pipeline changed?

Reliability is not a fixed property. It is a relationship between the system and the reality it is supposed to represent.

When Green Is Not Enough

This period made me more skeptical of dashboards that only showed task status. Green jobs are comforting, but they can hide the more important question: is the data still describing the world accurately enough for the decisions being made?

I started looking for signals that captured movement, not just failure. A sudden drop in activity might be a source outage, a product change, or a real behavior shift. A category that suddenly dominates a table might be a bug, a campaign, or a new customer segment. The system cannot always know the answer, but it can make the question visible.

That led me toward monitors that were less binary. Instead of every alert being “broken” or “fine,” I wanted tiers: needs investigation, likely source issue, downstream impact confirmed, safe to ignore. That kind of language helps teams respond proportionally.

Uncertainty also made ownership more important. When reality changes quickly, data teams need fast paths to domain context. The best reliability signal may be an engineer and a product person looking at the same evidence and deciding what changed.

Building Monitors That Invite Judgment

The best monitors do not pretend to know everything. They give people enough evidence to make a judgment quickly. That means showing the current value, historical range, recent changes, and links to upstream activity.

I became more interested in anomaly detection that respected context. A spike may be normal after a launch. A drop may be expected after a policy change. The monitor should make unusual movement visible, but the team still needs the domain context to interpret it.

This is where annotation becomes powerful. If a product launch, migration, or known outage is recorded near the metric, future engineers can understand why the data moved. Without annotations, every unusual shape looks like a fresh mystery.

Reliability during uncertainty is less about eliminating surprises and more about shortening the path from surprise to explanation. Data systems should help teams orient quickly when the outside world changes.