May 9, 2022

Serving Features Reliably

Offline feature pipelines can hide latency and correctness problems that become obvious during serving. A batch table can be late by an hour and still look fine in analysis. A product request path usually does not have that luxury.

This made me more careful about the boundary between offline and online systems. Which features need low latency? Which can be stale? How do we handle missing values? How do we know training and serving logic match?

The hard part is not only infrastructure. It is product expectation. Not every model needs real-time features, but every model needs an honest contract about freshness and availability.

Reliable feature serving starts with deciding what reliability actually means for the use case.

Freshness, Availability, and Fallbacks

Feature serving forced me to separate different reliability dimensions. A feature can be available but stale. It can be fresh but missing for a segment. It can be correct offline but too slow for an online path. Each failure mode needs a different response.

I learned to ask product questions before infrastructure questions. If a feature is twelve hours old, does the model become useless or only slightly worse? If a value is missing, should the system use a default, skip the model, or show a different experience? Those decisions should not be invented during an outage.

Training-serving consistency also became more important. The same business definition should not be reimplemented in two places with small differences. When separate paths are unavoidable, they need comparison checks and shared test cases.

Reliable feature serving is ultimately about making promises explicit. The model team, product team, and platform team should know what the system guarantees, what it does not guarantee, and how it behaves when reality is messy.

Operational Contracts for Serving

Serving systems need operational contracts that are easy to understand. What is the expected latency? What is the freshness target? What percentage of requests can tolerate missing features? Which fallbacks are safe?

These questions should be answered before the model is on a critical path. Otherwise, production incidents become product design meetings under pressure. That is not fair to the engineers or the users.

I also like exposing feature health to product teams. If a feature is degraded, the product owner should understand whether the user experience is affected. A model may still return output, but confidence in that output may be lower.

The lesson is that ML infrastructure cannot be purely behind the scenes. The reliability of features, models, and product behavior are connected. The platform should make those connections visible.