GPT-4 launched in March 2023. By July, teams that put it into production had 6 months of real experience. The gap between benchmark performance and production behaviour is where the learning happened.
The context length problem
GPT-4 launched with an 8K context window (later a 32K preview). Many enterprise documents, codefiles, and conversation histories exceed 8K tokens. Teams built retrieval pipelines to select the relevant excerpts. The retrieval quality became the primary determinant of output quality: if you retrieved the wrong chunks, the model answered confidently from irrelevant context. Garbage in, garbage out, but the garbage is now confident and well-written.
Hallucination patterns
GPT-4 hallucinates less than GPT-3.5 but still produces confident incorrect statements in specific failure modes: citing non-existent papers and URLs, providing incorrect version numbers and API signatures, and fabricating specific details in domains where the training data was sparse. Teams that deployed GPT-4 for factual lookups without citation verification learned this the hard way. The mitigation is citations with verification, not assuming the model is correct.
The system prompt as the application layer
The insight that changed how teams structure LLM applications is that the system prompt is where the application logic lives. The difference between a general-purpose LLM and a useful product is the specificity and quality of the system prompt. Teams that invested in system prompt engineering, testing it against adversarial inputs and edge cases, produced more reliable applications than teams that passed requests directly to the API.
Cost at scale
GPT-4 at $0.03 per 1K input tokens and $0.06 per 1K output tokens is expensive at production scale. A single complex query might cost $0.50-$2.00. Applications with thousands of users require careful prompt engineering to minimise token usage, caching of common responses, and routing of simpler queries to GPT-3.5 Turbo. The applications that optimised their token usage in the first few months scaled at a fraction of the cost of the applications that treated tokens as free.