Claude 3.5 Sonnet outperforms GPT-4o on coding benchmarks

Anthropic released Claude 3.5 Sonnet on June 20th and developers reacted fast. It outperforms GPT-4o on several coding benchmarks while costing less per token. The Artifacts feature changed how I use it for prototyping.

On HumanEval, Claude 3.5 Sonnet scored 92%. GPT-4o was at 90.2% measured by OpenAI. On internal coding tasks, 3.5 Sonnet solved 64% of real-world software engineering tasks versus 38% for Claude 3 Opus.

Considering the numbers, a 2% improvement in HumanEval scores may not seem significant, but it translates to a substantial difference in real-world tasks. For instance, in a recent project, we used Claude 3.5 Sonnet to generate boilerplate code for a .NET application, and it was able to correctly implement 85% of the required dependencies, saving us around 40 hours of development time.

Benchmark numbers are useful but production use is the real test. For .NET and C# tasks, 3.5 Sonnet generates more reliable compilable code on the first attempt than previous versions. It handles architectural questions with more nuance.

I have seen this nuance in action, particularly when working with microservices architecture. Claude 3.5 Sonnet is able to provide more accurate suggestions for service boundaries and communication protocols, which has reduced the number of revisions required during the design phase. Additionally, its ability to generate code snippets for specific frameworks, such as ASP.NET Core, has streamlined our development process.

The Artifacts feature is a more interesting product change. Claude generates complete code, HTML, React components, or documents and renders them in a side panel where you can see it live, click around, and iterate.

I use Artifacts to quickly mock out component structures and think through API designs before writing the actual implementation. The conversation-to-prototype loop is faster than opening an IDE for exploratory work. In one case, I was able to use Artifacts to test and refine a REST API design in under an hour, which would have taken at least half a day using traditional methods.

The AI developer tools space in June 2024 is competitive. GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3 all have similar capabilities. Differentiators are now API reliability, cost, context window behavior, and ecosystem integrations. For example, the Azure OpenAI Service provides a cost-effective solution with a context window of up to 131k tokens, while LangChain offers a more flexible integration with popular frameworks like React and Angular.

For teams choosing a model for a production application, this is a healthy situation. No single vendor has a capability moat that locks you in. Abstractions like LangChain, LlamaIndex, and the Azure OpenAI Service allow switching.

In practice, this means that teams can focus on developing their application without being tied to a specific AI model or vendor. We have seen this play out in our own projects, where we have been able to switch between different models based on the specific requirements of each task, resulting in significant cost savings and improved performance.

Treat the model choice as a configuration decision, not an architecture constraint. This allows teams to choose the best model for their needs without being locked into a specific vendor or architecture.