What We Learned from 6 Months of GPT-4 in Production

GPT-4 launched in March 2023, and by July, teams that put it into production had six months of real experience. The gap between benchmark performance and production behaviour is where the learning happened.

The context length problem proved significant. GPT-4 launched with an 8K context window, later increased to a 32K preview. Many enterprise documents, code files, and conversation histories exceed 8K tokens. Teams built retrieval pipelines to select relevant excerpts. The quality of retrieval became the primary determinant of output quality: if the wrong chunks were retrieved, the model answered confidently from irrelevant context.

In our biggest deployment for a legal‑document assistant we ran into the retrieval bottleneck within the first week. We indexed 2 million paragraphs in a Pinecone vector store, using OpenAI’s ada‑002 embeddings at roughly $0.0004 per 1 K tokens. The initial query latency hovered around 400 ms, which was unacceptable for an interactive UI. Switching to a locally hosted FAISS index reduced the median latency to 180 ms and cut the embedding cost by 70 percent, but we paid the price in operational complexity – we had to manage sharding and periodic re‑indexing as new contracts arrived. The trade‑off was worth it because a 0.2 second delay meant the difference between users staying on the page or abandoning the session.

GPT-4 hallucinates less than GPT-3.5 but still produces confident incorrect statements in specific failure modes. These include citing non-existent papers and URLs, providing incorrect version numbers and API signatures, and fabricating specific details in domains where the training data was sparse. Teams that deployed GPT-4 for factual lookups without citation verification learned this the hard way. The mitigation strategy is to use citations with verification, not assume the model is correct.

A crucial insight that changed how teams structure LLM applications is that the system prompt is where the application logic lives. The difference between a general-purpose LLM and a useful product is the specificity and quality of the system prompt. Teams that invested in system prompt engineering, testing it against adversarial inputs and edge cases, produced more reliable applications than teams that passed requests directly to the API.

We also learned that treating the system prompt as static code is a recipe for brittleness. I started versioning prompts in git alongside the application code and built a tiny harness that runs a suite of 150 canned queries against each version. When we introduced a new clause about “only return JSON with keys ‘answer’ and ‘source’”, the test suite caught a regression where the model slipped back to free‑form text in 12 percent of cases. Using LangChain’s PromptTemplate helped keep placeholders consistent, and we wrapped the prompt in a function call that validates token length before sending it to the API. The extra CI step added about five minutes to each deploy but saved us from nightly alerts where the bot started spitting out HTML tags.

GPT-4 is expensive at production scale, priced at $0.03 per 1K input tokens and $0.06 per 1K output tokens. A single complex query might cost $0.50-$2.00. Applications with thousands of users require careful prompt engineering to minimise token usage, caching of common responses, and routing of simpler queries to GPT-3.5 Turbo. The applications that optimised their token usage in the first few months scaled at a fraction of the cost of those that treated tokens as free.

Cost control became a daily ritual. Early on we logged every token to a Redis stream and aggregated daily spend; the first month we burned through $12 k on GPT‑4 alone. By introducing a cheap‑first routing layer that checks the query length and intent, we could answer 68 percent of requests with GPT‑3.5 Turbo, dropping the average cost per user session from $0.45 to $0.12. We also cached the top 1 000 most common answers in a DynamoDB table with a TTL of 12 hours, which shaved another $1.5 k per month. The lesson was clear: without a disciplined caching and routing strategy the bill grows faster than the user base.

The cost implications of GPT-4 led teams to focus on efficient use. By minimising token usage and optimising their prompts, teams were able to significantly reduce costs. This was particularly important for applications with high traffic.

The experience with GPT-4 highlighted the importance of understanding its limitations. Teams that assumed the model was always correct were surprised by its hallucinations. Verifying citations and outputs became a critical part of using GPT-4 effectively.

The gap between benchmark performance and real-world behaviour revealed key lessons about using frontier models like GPT-4. These lessons will inform how teams approach future model deployments.

The applications that succeeded with GPT-4 were those that invested in understanding its strengths and weaknesses. By doing so, they were able to build reliable and cost-effective applications.