The demo takes about twenty minutes to build. I've watched it happen dozens of times and I've done it myself more than once. You pick a use case, connect it to a model, write a prompt, feed in some sample data, and it works. Not just works. It's impressive. You show it to stakeholders and the energy in the room is real. Someone says "this is exactly what we needed." Someone else asks how fast you can ship it.

Six months later, the team is rebuilding it from scratch. Not because the idea was wrong. Because the thing that made the demo work is not the same thing that makes a production system work, and nobody designed for the difference.

I want to be specific about what actually goes wrong, because "the demo doesn't scale" is too vague to be useful.

The first thing that breaks is evaluation. In the demo, evaluation is the person running the demo. You look at the output, it looks right, you move on. In production, nobody is watching every output. You need automated evaluation, and you need to have designed for it from the start, which means you needed to define what "good" looks like before you started building. Most teams skip this because in the demo phase it feels obvious. It stops feeling obvious around the time your model starts confidently producing wrong answers at two in the morning and nobody catches it until a user does.

The second thing that breaks is the prompt. Prompts in demos are written to work on the examples you have. They have not been tested against the distribution of actual user inputs, which is always stranger and more varied than whatever you planned for. The first week of real usage surfaces things no demo could have predicted. Suddenly the prompt needs to change, which means your outputs change, which means your downstream logic that assumed a certain output structure breaks, which means everything is less stable than it looked two weeks ago. Welcome to prompt maintenance, which is a real job that most people don't have a process for yet.

The third thing is cost. Demo tokens are free in the sense that you're not tracking them. Production tokens cost money, and the cost math often doesn't close at scale, especially if the original architecture was calling the model in ways that made sense for a demo but are genuinely wasteful in production. I've seen teams do back-of-envelope calculations at the six-month mark and discover their current usage pattern would cost three times their compute budget at their growth projection. That's when the rebuild starts.

The fourth thing is the model itself. You built against whatever was current when you started. A newer model is out now. It's better, and you'd like to use it, except switching models means retesting everything because the same prompt produces different outputs across model versions. If you designed with model portability in mind, this is an afternoon of work. If you didn't, it's a project.

None of these failures are inevitable. They're predictable, which is better than inevitable, because predictable means you can design around them upfront. The teams doing this well tend to share a few habits.

They treat evaluation as infrastructure, not an afterthought. They build evals before they build features, which forces them to define success concretely rather than pointing at a demo and saying "like this." Without evals you're flying on vibes. With evals you have a signal that tells you when a model update or a prompt change made things better or worse. That signal is worth more than most teams realize until they don't have it.

They separate the model from the application logic. The model is a dependency. It has an interface. The rest of the application doesn't know or care which model is behind that interface, which means you can swap, update, and version the model without triggering a rewrite of everything around it. This sounds obvious but I have seen production systems where the model call, the business logic, and the output formatting are all tangled together in the same function, and those systems age badly.

They build cost monitoring in from the start, not as an audit mechanism but as a feedback loop that informs architectural decisions. Token usage is an engineering metric, not just a billing line item. The teams that treat it like a metric make better architecture decisions earlier, before the numbers become uncomfortable.

And they assume the prompt will change. This sounds obvious until you look at how most code is written. If your prompt is a string literal embedded in a function called in ten places, changing the prompt is a dangerous operation. If your prompt is a managed artifact with versioning and rollout controls, changing the prompt is a deployment. Those are very different things to operate.

The demo is proof that the product is worth building. It is not the product. That distinction sounds pedantic right up until you're six months in, the team is exhausted, and the stakeholder who loved the demo is asking why it's taking so long. The answer, almost always, is that someone built the proof of concept twice: once to show it worked, and once to actually ship it. The gap between those two things is where most AI projects stall. Design for the second build from the start and you only have to do it once.