Azure OpenAI Service has been generally available since January 2023. Halfway through the year, I see clear patterns in how enterprises architect applications on top of it.

The dominant pattern for enterprise LLM applications is Retrieval-Augmented Generation. This pattern involves embedding documents into a vector store, retrieving relevant chunks when a user asks a question, injecting them into the prompt as context, and letting the model answer using that context. Azure Cognitive Search added vector search capability in 2023, making it a natural vector store for Azure-native architectures. Cosmos DB also added vector search via MongoDB API. The retrieval layer is becoming a key part of the Azure stack.

I've seen enterprises struggle with the trade-off between retrieval quality and latency. For instance, one company used a 2-second timeout for retrieving chunks from Azure Cognitive Search, but found that 20% of their queries took longer than 1 second. They ended up implementing a caching layer to store recent queries and their corresponding results, which reduced their average latency to 500ms. However, this came at the cost of increased storage usage and the need for more complex cache invalidation logic.

Enterprise applications use the system to define the model's role, restrict its behavior, and inject organizational context. A well-designed system prompt specifies what the assistant can and cannot discuss, what format its responses should take, what to do when it does not know the answer, and what fallback behavior to exhibit for out-of-scope requests. Designing a good system prompt is an engineering task, not just a configuration step.

Enterprises often start with simple system prompts, but as they mature, they realize the need for more sophisticated prompts that can handle a wider range of user inputs. For example, one company started with a basic prompt that simply instructed the model to 'answer the user's question', but later updated it to include specific guidelines for handling out-of-scope requests, such as 'if the user asks about a competitor, respond with a generic statement and do not provide any additional information'. This updated prompt required more engineering effort upfront, but resulted in better user experience and reduced hallucinations.

Azure OpenAI pricing is per token, and long context windows are expensive. To reduce costs without sacrificing quality, enterprises cache responses to identical prompts, use smaller models for simpler tasks - like GPT-3.5 Turbo for classification and GPT-4 for generation - summarize long conversation histories before injecting them as context, and batch similar requests to use Azure OpenAI's batch inference endpoint.

Standard application monitoring does not work for LLM applications. You need to track model latency and token consumption, retrieval quality - did you retrieve the right chunks? - response quality - did the model use the context correctly? - and hallucination rate - is the model stating things not in its context?. Azure Monitor with Application Insights captures infrastructure metrics. Semantic Kernel and LangChain have logging integrations. But evaluating response quality requires human review or an LLM-based evaluation pipeline.