Why most AI pilots never reach production

Almost every AI pilot we are asked to inherit looks the same when we open the repository. There is a notebook, sometimes two, that produced the demo that won the budget. There is a small Streamlit or Next.js wrapper that lets a non-technical executive run a single happy-path query against an OpenAI key. There is a backlog of tickets that all assume the next phase is 'productionise the pilot'. And there is, almost universally, no eval suite, no contract for what 'good' looks like, and no plan for who owns the system after launch. The pilot wins the demo and then it dies, slowly, because nobody designed the path between a working notebook and a service that can be operated.

This is not a model problem. The model in the notebook is, in 2026, almost always good enough. The frontier models have closed the capability gap that used to make this hard. What has not closed is the gap between a stochastic system that produces an impressive answer once and a stochastic system that produces a defensible answer ten thousand times in a row, under conditions the prompt author never anticipated. That second gap is where pilots die, and it dies for structural reasons, not for technical ones.

The structural reason pilots fail

A pilot is, by definition, a one-off proof of concept. It is owned by whoever wrote it, scoped to whatever the demo audience cared about, and evaluated by whether the room nodded. The implicit contract is that if the demo lands, the project will be re-scoped into a real engineering effort. The unspoken assumption is that re-scoping will happen.

It almost never happens. The pilot author moves on to the next pilot. The executive who funded it moves on to the next quarter. The engineering organisation that would be responsible for productionising the work was not consulted on the original scope, never agreed to inherit it, and is now being asked to take ownership of a notebook that is implicitly already a success. The political cost of saying 'this needs to be rebuilt' is high enough that nobody says it, and the project enters a slow decline where small fixes are made to keep the demo working but no real engineering investment is committed. Eighteen months later, somebody quietly turns it off.

The structural fix is to refuse the pilot framing entirely. The first artifact of an AI engagement should not be a demo. It should be a contract — a written agreement, between the people funding the work and the people who will operate it, about what the system is supposed to do, what it is allowed to do, and how anybody will know whether it is doing it.

What a real evaluation contract looks like

The eval suite is the most under-specified artifact in the modern AI stack. Engineers tend to treat it as something that will be written 'once we have the basic pipeline working', which is the same energy as 'we'll write tests once we have the basic feature working' and ends in the same place. The eval suite should be the second thing written in the engagement, after the data contract, and it should be written before the prompt that produces the demo answer.

A workable eval suite has three components. The first is a fixed set of representative inputs, drawn from real production data with representative noise — long inputs, short inputs, malformed inputs, inputs in the wrong language. The second is a set of expected behaviours for each input, expressed not as 'the right answer' but as a set of properties the answer must hold (does it cite a source, does it refuse when it should, does it stay under the token budget). The third is a way to measure those properties at scale — usually a mix of programmatic checks and a smaller, slower LLM-as-judge layer for the properties that cannot be checked programmatically.

If you cannot describe, in writing, the property your system is supposed to have, you cannot ship a system that has it. You can ship a demo.

The contract piece matters more than the implementation piece. The eval suite is not primarily an engineering artifact — it is a negotiation artifact. It forces the people funding the work to commit, in writing, to what 'success' means. That commitment is what carries the project across the gap between the demo and the production system.

Ownership before scope

The second thing that has to be agreed before the first prompt is written is who will operate the system after launch. Not 'the engineering team' — a specific named person, with a specific named on-call rotation, and a specific named budget for both the API spend and the engineering hours required to maintain the system. If that person does not exist, the engagement should not start.

We have started to refuse engagements where ownership is unclear. The cost is real — these are paid engagements that we are turning down — but the cost of accepting an engagement that will be orphaned six months after launch is higher. Orphaned systems erode trust in AI engineering as a discipline, and that erosion is the thing that will eventually shrink the budget for the work we want to do.

What we put in place before the first prompt

On every engagement, before any prompt-engineering work begins, we agree four artifacts in writing with the customer. The data contract, which describes the inputs the system is allowed to see and the transformations applied to them. The eval suite, which describes the properties the outputs must hold. The operating contract, which names the on-call team and the budget. And the scope ceiling, which describes what the system will refuse to do — the explicit out-of-scope list that prevents quarterly scope creep from turning a focused agent into a general-purpose assistant.

Those four artifacts are usually the deliverable for the first two weeks of the engagement. They are produced before the first commit to the production repository. The customer is, sometimes, irritated by the pace — they wanted a demo by Friday — but the customers who let us run this way ship systems that are still in production three years later. The customers who do not, ship pilots.

The boring conclusion

Most AI pilots fail for the same reason most software projects fail — unclear scope, unclear ownership, no agreed definition of success. The novelty of the technology obscures this for a while. The novelty is wearing off. The teams that will continue to ship AI in production are the teams that treat it as software engineering, with the same discipline around contracts and ownership that they would apply to any other production system. There is no shortcut around that, and pretending there is one is the single most expensive mistake a 2026 organisation can make.

Why most AI pilots never reach production

The structural reason pilots fail

What a real evaluation contract looks like

Ownership before scope

What we put in place before the first prompt

The boring conclusion

More from AI in production.

Choosing a vector store in 2026

One email per month. New essays only.