Anthropic launched Claude 3 on March 4th with three models: Haiku, Sonnet, and Opus. Claude 3 Opus posts higher benchmark scores than GPT-4 on several evaluations. The more interesting story is the architecture of the family.

The three-tier family

Haiku is the smallest and fastest, designed for tasks where latency matters more than raw capability: customer-facing chatbots, document processing pipelines, classification tasks. Sonnet sits in the middle, competitive with GPT-3.5 Turbo in cost but closer to GPT-4 in capability on most tasks. Opus is the frontier model, where Anthropic benchmarks it above GPT-4 on MMLU, graduate-level reasoning, and coding tasks.

The 200K context window

All three Claude 3 models support a 200,000 token context window. That is approximately 150,000 words, or a full novel. More practically: it is large enough to load a substantial codebase, a long legal contract, or a multi-session research corpus into a single API call. The practical question is whether the model actually attends to content throughout a 200K window or if it suffers from the 'lost in the middle' problem where content far from the beginning and end of the prompt is poorly recalled. Anthropic's evaluation data shows improved recall across the full context length, though it degrades at the extremes.

Benchmark caveats

Claude 3 Opus posting higher MMLU scores than GPT-4 is meaningful, with caveats. MMLU tests knowledge recall and reasoning on academic topics. Real production performance depends on your specific workload. Coding benchmarks, instruction following, tool use reliability, and hallucination rates all vary between models in ways that matter differently depending on what you are building. Single benchmark headlines are a starting point, not a deployment decision.

The safety-capability tradeoff

Anthropic's Constitutional AI approach means Claude 3 models are trained with explicit values and refusal behaviours built in. They are less likely to help with genuinely harmful tasks and more likely to push back on edge cases than some competing models. For enterprise use cases, that predictability is often a feature. For developers trying to push models into novel applications, it occasionally creates friction that requires prompt engineering to work around.