Nov 9, 2020

Observability for Humans

I have seen plenty of dashboards that technically contained the answer but were still hard to use during an incident. Observability is not just about collecting signals. It is about helping a tired human make a good next decision.

For data systems, that means alerts should include context: affected datasets, recent changes, upstream dependencies, and likely owners. Dashboards should separate symptoms from causes. Runbooks should answer the first five questions an operator will ask.

This was a turning point in how I evaluated tooling. More charts did not automatically mean better operations.

The best observability feels like a teammate saying, “Here is what changed, here is what is affected, and here is where I would look next.”

Designing the First Five Minutes

I started evaluating alerts by asking what they did for the first five minutes of an incident. Did the alert tell me the affected dataset? Did it show whether this was freshness, volume, schema, or logic? Did it link to the job, recent deploys, upstream dependencies, and owners?

Those first five minutes matter because they shape the whole response. A vague alert creates scatter. A clear alert creates sequence. It tells the engineer whether to rerun, investigate upstream, notify consumers, or suppress a noisy signal.

For data systems, I also want observability to preserve historical context. If a freshness alert fires today, I want to know whether this is rare or normal. If volume changed, I want nearby comparisons. If a table is missing partitions, I want to know which downstream assets depend on it.

The human test keeps me honest. Observability is not a trophy case for metrics. It is an interface for operations, and the user is often someone trying to make a good decision under time pressure.

Reducing Cognitive Load

A good operational interface reduces cognitive load. It groups related signals, names likely causes, and avoids forcing engineers to jump across five tools before they understand the shape of the problem.

For data systems, I like starting with a dependency view. What upstream assets feed this table? Which checks failed? Which downstream consumers are affected? Has this failure happened before? Those questions form the incident narrative.

I also care about alert quality over alert quantity. A noisy alert teaches people to ignore the system. A precise alert builds trust. The best alerts are not necessarily the most sophisticated; they are the ones that reliably point to action.

Observability becomes mature when it helps teams learn. After an incident, the dashboard, alert, or runbook should improve. Otherwise the same confusion will return wearing a slightly different costume.