Nov 16, 2018

Batch Jobs Need Owners

By the end of 2018, I had seen enough scheduled jobs to understand that cron is not a strategy. A batch job can run every night for months and still be operationally neglected.

The issue was rarely the scheduler itself. The issue was ownership. Who gets paged? Who knows what the output means? Who can decide whether a missed run is acceptable? Who talks to upstream teams when the source changes?

Without answers, failures turn into archaeology. Everyone can see the broken table, but nobody knows the story behind it.

That experience pushed me toward clearer runbooks, alerts with context, and lightweight ownership notes beside important datasets. Reliability is partly technical, but it is also social plumbing.

Making Ownership Concrete

Ownership became more useful to me when I stopped treating it as a name in a spreadsheet. A real owner understands the purpose of the job, the meaning of the output, the upstream assumptions, and the downstream impact of failure. That does not mean one person has to fix everything alone, but it does mean there is a clear first place to go.

For batch jobs, I began to care about a few operational basics. Every important job should have a reason to exist, an expected schedule, a definition of late, a recovery path, and a known consumer. If nobody can name the consumer, the job may not need to exist. If nobody can define late, the alert will either be ignored or panic-inducing.

The other lesson was that runbooks should be written for the moment when the operator is least patient. During an incident, nobody wants a history essay. They want to know what changed, what is affected, what can be rerun, and who should be contacted.

Looking back, this was the year I started moving from “can I build the pipeline?” to “can someone else operate this pipeline at 2 a.m. without guessing?” That question has shaped a lot of my engineering taste since.

Ownership Is a System Design Choice

Ownership should be visible in the system, not only in team memory. A job page, dataset catalog, alert, and runbook should all point toward the same owner. If an engineer has to ask three people who owns a table, the ownership model has already failed.

I also learned that ownership has different layers. One team may own the source application. Another may own ingestion. Another may own a derived metric. Incidents get confusing when those layers are not named. Clear ownership does not remove collaboration; it makes collaboration easier to start.

Batch jobs also need lifecycle ownership. Someone should know when a job can be retired. Otherwise old jobs keep running because nobody is brave enough to delete them. That creates cost, noise, and more surfaces for failure.

The mature version is a system where ownership travels with the asset. If a dataset is important enough to power decisions, it is important enough to have an accountable human path attached to it.