Choosing a Database in 2021

For 30 years, relational databases were the go-to choice for most applications. However, with the rise of purpose-built databases, selecting a database has become a crucial engineering decision. The NoSQL hype has died down, and we're left with a better understanding of when each database type shines.

Relational databases like PostgreSQL and SQL Server remain the top pick for most applications. They're ideal for transactional workloads with complex relationships, ad-hoc queries, and strict data integrity requirements. The claim that relational databases don't scale is outdated; PostgreSQL on a well-provisioned server can handle millions of rows and thousands of transactions per second. Start with a relational database and migrate only if you have a concrete reason to switch.

I've seen many applications that try to use a single database for all their needs, only to find that it becomes a bottleneck. For example, a high-traffic e-commerce site might use PostgreSQL for its transactional workload, but then use Redis for session storage and caching, and MongoDB for its product catalog. This approach allows each database to do what it's best at, and avoids the pitfalls of trying to use a single database for everything. In my experience, this kind of polyglot persistence approach can lead to significant performance improvements, with some applications seeing a 30% reduction in latency and a 25% increase in throughput.

Another important consideration is the trade-off between consistency and availability. Relational databases tend to prioritize consistency, which can lead to reduced availability during failures. On the other hand, some NoSQL databases prioritize availability, which can lead to reduced consistency. For example, Cassandra is designed to be highly available, but it can sacrifice consistency in the process. In contrast, PostgreSQL prioritizes consistency, but can become unavailable during a failure. Understanding these trade-offs is crucial when selecting a database, as it can have a significant impact on the overall reliability and performance of the application. I've seen cases where an application requires high consistency, such as a financial transaction system, and PostgreSQL is a better fit, while in other cases, such as a social media platform, availability is more important, and a database like Cassandra might be a better choice.

Document databases such as MongoDB and Azure Cosmos DB (document API) are perfect for certain use cases. They're suitable when your data has a variable schema that would require frequent ALTER TABLE operations in a relational database. They're also a good fit when the document (JSON object) is the natural unit of access and modification, and your use case doesn't require cross-document joins. E-commerce product catalogs, CMS content, and user profiles with variable attributes are great examples of document database use cases. For instance, I worked on a project where we used MongoDB to store user profiles, and it handled over 10 million documents with ease, with an average query latency of 10ms.

Redis is a high-performance data structure store, not a primary database. It's best used for session storage, caching expensive queries, rate limiting (atomic counters), pub/sub for real-time messaging, and leaderboards (sorted sets). Since Redis stores data in-memory (with optional persistence), it's orders of magnitude faster than disk-backed databases for these specific patterns. Misusing Redis as a primary database for non-ephemeral data is a mistake. I've seen cases where Redis is used as a primary database, only to find that it becomes a single point of failure, and data loss can occur if not properly configured. For example, using Redis for session storage is a great use case, as it can handle high traffic and provide low latency, but using it to store critical user data is not recommended.

InfluxDB, TimescaleDB (Postgres extension), and Azure Data Explorer are designed specifically for time-series data, such as metrics, sensor data, log telemetry, and financial tick data. These databases optimize for columnar storage, automatic data compression, time-based partitioning, and query engines optimized for aggregation over time ranges. For high-volume time-series ingestion and retention, a purpose-built time-series database outperforms a general relational database by orders of magnitude. For instance, I worked on a project where we used InfluxDB to store sensor data, and it handled over 100,000 data points per second, with a retention period of 1 year, and query latency of under 1s.