Azure OpenAI Service has been generally available since January 2023. Halfway through the year, the patterns of how enterprises are actually architecting applications on top of it are clear enough to discuss.
The RAG architecture
Retrieval-Augmented Generation is the dominant pattern for enterprise LLM applications. The pattern: embed your documents into a vector store, when a user asks a question retrieve the most relevant chunks, inject them into the prompt as context, and let the model answer the question using that context. Azure Cognitive Search added vector search capability in 2023, making it a natural vector store for Azure-native architectures. Cosmos DB added vector search via MongoDB API. The retrieval layer is becoming a first-class concern in the Azure stack.
The system prompt as policy
Enterprise applications use the system prompt to define the model's role, restrict its behaviour, and inject organisational context. A well-designed system prompt specifies: what the assistant can and cannot discuss, what format its responses should take, what to do when it does not know the answer, and what fallback behaviour to exhibit for out-of-scope requests. System prompt design is an engineering discipline, not just a configuration step.
Prompt caching and token costs
Azure OpenAI pricing is per token. Long context windows are powerful but expensive. The patterns that reduce cost without reducing quality: caching responses to identical prompts, using smaller models for simpler subtasks (GPT-3.5 Turbo for classification, GPT-4 for generation), summarising long conversation histories before injecting them as context, and batching similar requests to take advantage of Azure OpenAI's batch inference endpoint.
Observability for LLM applications
Standard application monitoring does not work for LLM applications. You need to track: model latency and token consumption, retrieval quality (did you retrieve the right chunks?), response quality (did the model use the context correctly?), and hallucination rate (is the model stating things not in its context?). Azure Monitor with Application Insights captures the infrastructure metrics. Semantic Kernel and LangChain have logging integrations. But the ground truth evaluation of response quality requires human review or an LLM-based evaluation pipeline.