I've watched it happen dozens of times and I've done it myself more than once. You pick a use case, connect it to a model, write a prompt, feed in some sample data, and it works. Not just works. It's impressive. You show it to stakeholders and the energy in the room is real. Someone says 'this is exactly what we needed.' Someone else asks how fast you can ship it.

Six months later, the team is rebuilding it from scratch. Not because the idea was wrong. Because the thing that made the demo work is not the same thing that makes a production system work, and nobody designed for the difference.

The first thing that breaks is evaluation. In the demo, evaluation is the person running the demo. You look at the output, it looks right, you move on. In production, nobody is watching every output. You need automated evaluation, and you need to have designed for it from the start, which means you needed to define what 'good' looks like before you started building.

The second thing that breaks is the prompt. Prompts in demos are written to work on the examples you have. They have not been tested against the distribution of actual user inputs, which is always stranger and more varied than whatever you planned for. The first week of real usage surfaces things no demo could have predicted.

The third thing is cost. Demo tokens are free in the sense that you're not tracking them. Production tokens cost money, and the cost math often doesn't close at scale, especially if the original architecture was calling the model in ways that made sense for a demo but are genuinely wasteful in production.

The fourth thing is the model itself. You built against whatever was current when you started. A newer model is out now. It's better, and you'd like to use it, except switching models means retesting everything because the same prompt produces different outputs across model versions.

The teams doing this well tend to share a few habits. They treat evaluation as infrastructure, not an afterthought. They build evals before they build features, which forces them to define success concretely rather than pointing at a demo and saying 'like this.'

They separate the model from the application logic. The model is a dependency. It has an interface. The rest of the application doesn't know or care which model is behind that interface, which means you can swap, update, and version the model without triggering a rewrite of everything around it.

They build cost monitoring in from the start, not as an audit mechanism but as a feedback loop that informs architectural decisions. Token usage is an engineering metric, not just a billing line item.

The demo is proof that the product is worth building. It is not the product. That distinction sounds pedantic right up until you're six months in, the team is exhausted, and the stakeholder who loved the demo is asking why it's taking so long.