May 5, 2025

The Hard Part of Reliable AI

The hard part of reliable AI is not only that models are probabilistic. It is that everything around them changes too: prompts, tools, source content, retrieval logic, product surfaces, user behavior, and evaluation criteria.

That makes change management central. Teams need to know what changed, when it changed, who approved it, and how quality moved afterward.

This looks a lot like mature data engineering. Version important inputs. Capture outputs. Preserve context. Monitor distributions. Review incidents. Make rollback possible.

AI products can feel new, but the operational lessons are familiar. Reliability comes from making change observable and recoverable.

Reliability Around a Moving Center

The model is only one moving part. The surrounding system often changes faster: new tools, new source documents, prompt edits, product UI changes, revised safety rules, and different user behavior as people learn what the product can do.

That means reliability work has to focus on change isolation. When quality moves, teams need to know which layer changed. Did retrieval retrieve different context? Did the prompt frame the task differently? Did the model version change? Did the user population shift?

I also think rollback matters more than teams admit. If a prompt or retrieval change causes a regression, can we revert cleanly? If an index update includes bad content, can we remove it? If a model version behaves differently, can we compare outputs on known cases?

Reliable AI is not about pretending the system is deterministic. It is about surrounding probabilistic behavior with deterministic operational controls. Versioning, logging, evaluation, and rollback are how teams keep their footing.

Operating With Evidence

Reliable AI teams need evidence habits. Every important output should be traceable to the model, prompt, retrieved context, tools, and relevant policy checks that produced it. Without that, quality work becomes anecdotal.

Evidence also helps teams avoid overreacting. One bad example may reveal a serious flaw, or it may be an edge case. A good data system lets teams find similar cases and understand whether the issue is isolated or systemic.

I care a lot about release discipline here. Changes should be compared against stable evaluation sets, monitored after launch, and rolled back when regressions are clear. This is ordinary engineering discipline applied to a less deterministic product surface.

The teams that do this well will move faster, not slower. Evidence reduces argument. It gives product, engineering, and safety teams a shared ground for decisions.