Three years of GraphQL in production reveal the tradeoffs

When Facebook open‑sourced GraphQL in 2015, most of us thought it would stay a hobby project. By 2019 a handful of large firms were running it in production, and now we have three to four years of hard‑won experience to look back on.

What makes GraphQL stand out is that the client tells the server exactly which fields it wants. The response contains only those fields, no extra payload. In environments with web, mobile and third‑party clients that each need a different slice of the same data, that precision wipes out the need to maintain a zoo of REST endpoints.

The schema’s type system doubles as live documentation. Tools can read the schema and generate autocomplete, validation and even mock servers without any extra effort from developers.

For example, using a tool like GraphiQL, we can instantly generate a mock server for testing and development, which greatly simplifies the development process and reduces the need for additional setup. Additionally, tools like GraphQL-codegen can generate type-safe client code for languages like TypeScript and Swift, which helps catch errors at compile time rather than runtime.

The most frequent performance surprise is the N+1 query problem. A query that asks for a list of items and then a nested field forces the resolver to fire a separate database call for every list element unless we intervene.

Facebook’s DataLoader library solves that by batching and caching field resolutions. Every production GraphQL service I’ve seen bundles a DataLoader‑like component, and skipping it usually ends in database overload when traffic spikes. In one instance, I saw a service that was handling around 500 requests per second, and without a DataLoader, the database was executing over 10,000 queries per second, which is completely unsustainable.

In another case, using a caching layer like Redis with a DataLoader helped reduce the load on the database by around 70%, which greatly improved the overall performance and responsiveness of the service. However, this also introduced additional complexity, as we had to manage cache invalidation and handle cache misses, which added around 10% to our overall latency.

Because a client can request arbitrarily deep graphs, an unchecked query can explode into recursive joins that hammer the database. That attack surface is unlike REST, where each endpoint has a fixed shape. To mitigate this, we can use a query complexity analysis tool like GraphQL‑query‑complexity to analyze and limit the complexity of incoming queries, which helps prevent denial-of-service attacks.

In practice we cap query depth to somewhere between five and ten levels, assign a cost to each field and reject anything that exceeds a threshold, and we still apply traditional rate limiting. Those guards keep the API usable under load. For instance, we can use a library like graphql-rate-limit to limit the number of requests from a single IP address, and also use a service like AWS AppSync to handle the rate limiting and caching for us, which simplifies the overall implementation.

Apollo Federation gave us a way to split a monolithic schema across several services. Each team publishes its own sub‑graph, and a gateway stitches them together at runtime, routing fields to the owning service. This approach has been particularly useful in large organisations, where different teams can work on different parts of the schema independently, without having to coordinate with each other. However, it also introduces additional complexity, as we have to manage the relationships between the different sub-graphs and handle errors that may occur during the stitching process.

Running a federated gateway adds operational overhead – you have to monitor schema changes, version compatibility and network latency – but the pattern scales nicely across large organisations. For example, we can use a tool like Apollo Studio to monitor and manage our federated gateway, which provides features like schema registration, validation and performance monitoring, and also integrates with other tools like GraphQL‑API to provide a unified view of our entire GraphQL ecosystem.

When implementing a federated gateway, it's also important to consider the trade-offs between the different approaches. For instance, we can use a single gateway to handle all the requests, or we can use a distributed gateway approach, where each service has its own gateway. The single gateway approach is simpler to implement, but it can become a bottleneck as the traffic increases, while the distributed approach is more complex, but it provides better scalability and fault tolerance.