GPT-4 launched in March 2023 with benchmark results showing performance above human averages on the bar exam, AP exams, and SAT. The numbers were widely cited. The interpretation requires more care than the headlines suggested.

What the benchmarks tested

The bar exam performance, 90th percentile among humans, was on the Uniform Bar Exam, which tests legal knowledge and reasoning across multiple choice, short answer, and essay components. GPT-4 did not pass the bar because it does not have a law degree: it produced outputs on a standardised test that scored above the threshold for passing. This is a meaningful capability demonstration. It is not the same as legal expertise in practice.

The Goodhart's law problem

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. AI models trained to perform well on specific benchmarks improve on those benchmarks, but benchmark performance does not always generalise to real-world capability. Models fine-tuned on similar examples to a benchmark can score highly without having the underlying reasoning the benchmark is intended to measure. Interpreting benchmark scores requires understanding what the benchmark actually tests and how similar the test distribution is to your actual use case.

What GPT-4 is actually better at

In practice, GPT-4's improvement over GPT-3.5 is most pronounced in: complex multi-step reasoning, following nuanced instructions across long contexts, code generation for non-trivial tasks, and maintaining consistent persona across a conversation. These improvements are real and impactful for application builders. They are better captured by testing GPT-4 on your actual task than by reading the bar exam score.

The implication for enterprise evaluation

The right benchmark for your enterprise AI application is your actual task, evaluated against your actual quality criteria, on a sample of real inputs. Published benchmarks are useful for initial model selection. They are not a substitute for task-specific evaluation before production deployment.