The art and science of scaling out a transactional relational database
Why are we talking about this?
Relational transactional database systems provide data with rich functionality and strong transactional guarantees, and underpin many of the applications we count on every day.
First appearing in the 1970s, the architecture of disaggregating data from the application layer coupled with a model for representing data and the relationships between data has paid huge dividends. It is an architecture that has continuously evolved, but also stood the test of time. The fact that the architecture has been re-invented, re-discovered independently over decades with new application patterns (client-server, event-driven, reactive, etc) and use cases points to something we will consider below.
Each time the industry “loosened” the requirements of the database layer regarding transactional Atomicity, Consistency, Isolation, and Durability (ACID) and the rich relational database model, the requirements have crept back in.
The fact that the requirements keep reasserting is due to real pressures and needs of applications and developers rather than due to vendor marketing and positioning.
The majority of deployed database systems remain single-node, non-distributed systems (PostgreSQL, MySQL, Oracle, and other variants). Think about that for a moment, and it’s problematic.
Why is it problematic? Demand for capacity, performance and data processing have grown far beyond the capacity, throughput, computational and memory resources that single-node database systems can meet. Even where scaling up is perhaps an option, the costs of exotic hardware become prohibitive. Often even the basic requirement of network client connections to a single database server becomes a bottleneck.
Almost every part of an application scales out – except for the database layer. So – what’s the scoop? If the need is apparent, why is there a disconnect on the ground?
The short answer is that building a distributed transactional relational system that scales well is exceptionally difficult. Algorithms that were designed and implemented to work on a single node system often cannot be adapted for a distributed system, which requires a whole new architecture and set of distributed algorithms designed and implemented from scratch.
In the remainder of this post we will cover some work-around solutions to the database scale-out problem and explain their drawbacks and limitations. Then we will spell out what we want from a true scale-out transactional relational database – OLxP – and what it takes to build one.
Existing workarounds are… workarounds.
Sharding
The traditional and most ubiquitous way of satisfying the traditional and most ubiquitous way of satisfying the need for a non-scale-out database to scale past a single node is database sharding: split the data into multiple independent sets of data (AKA shards) held by stand-alone single-node database systems. Those single-node database systems run as siloed stand-alone systems that are not aware of one another. This has an advantage which is that each shard is a well-known, dependable transactional system, but has some profound drawbacks and disadvantages. While public threads aren’t authoritative, they are informative. I don’t recommend starting here, because it’s confusing, messy and in some cases flat out wrong. However, if you want a collection of different views on this, you can read this long post on Y Combinator. The first post is the question – but the long thread of answers contains many voices, and many experiences with the challenges with sharding (and even some interesting comments on the challenges people have found with some of the existing alternatives).
When forced to exceed the performance and capacity of a single node database system, sharding can be used to break those limits. Sharding lets you use the mature single-node database systems you know. Sharding can even be performant in specific scenarios (those which exactly align with the sharding logic and can dispense with ACID transactional properties) but sharding comes with heavy disadvantages which I will now itemize below.
- Siloed nodes. The database functionality is limited to work in scope of a single shard, and provides no cross-shard database functionality. Transactional properties are kept by each shard in the scope of the data it has and the transactions running (only) within that shard. Query processing is completely localized to a shard.
- Increased complexity and cost for database-wide queries. The application has to be aware of how the data is distributed across shards. When a query involves data that spans multiple shards: some queries are flat out impossible; others are only possible with an additional development effort and a significant performance penalty. In addition, the transactional properties are maintained for each shard separately, thus the database layer provides no cross-shard transactional guarantees.
- Inherently imbalanced and inefficient. Choosing the sharding key, and then all of the downstream implications requires careful and complex planning and configuration. Check out this on Stackoverflow and you start to get the idea, even though it’s not the most complex example. Mistakes are very costly, and everyone knows that we all rarely have all the right information up front. Even if one “gets it right the first time” – what if things change?
- Bottlenecked and limited performance. This can manifest due to large query processing and/or many transactions on a single shard while other shards are idle. Another trigger may be an operation that cannot be broken down and spread across the shards becomes lengthy, and until that operation completes, the other shards are idle. In essence – the overall resources of the sharded database are not utilized efficiently, creating both performance issues and cost issues. Either this means you’re spending more money than you should AND/OR performance isn’t as good as it should be on a given infrastructure footprint.
- Rigid configuration. Redistributing a database by re-sharding is a very expensive and disruptive operation. This is a very complex operational task done offline, or via a variety of replication approaches, or at the application layer. In all of those, it’s complex, risky, and of course places an incremental load on the database and the platform teams on top of the work that they were doing in the first place. As much difficulty redistribution of database shards cause operationally, that complexity can be dwarfed by the effort of rewriting application code to align with the different sharding logic with all the associated bugs, testing, QA. This can be many years of development work for no other functional or business purpose – rather simply working around the database layer
Let’s look at an example that illustrates these issues in a practical way:
Let’s start with a database for a given business that has a schema consisting of tables of customers, items, sales, warehouses and inventories (among others).
- Each sale represents the purchase of a certain item by a certain customer. It references the customer and the item that was sold via foreign keys.
- For each customer, the database holds details such as the city where they live (among other attributes).
- For each item, we hold the inventory of the item in each warehouse – in the inventory table that references the warehouse_id and the item_id.
Since future trends in sales data are very unpredictable in nature, in this example, one would choose a safe and simple sharding configuration: each table is sharded by hashing the primary key (which is the ‘id’ field of each table). That makes the most evenly balanced distribution of data and load among the shards.
As simple and universal as this example may seem, let’s examine how sharding this database will result in the loss of transactional properties even for the simplest of transactions. If you don’t want to walk through the example, feel free to jump to section 2.2 (link to section)
Consider a simple business transaction in which a certain customer buys 2 items. This transaction needs to perform the following updates:
- Insert two rows into the sale table, one for each item.
- Update two rows in the inventory table, for the warehouses from which the items are supplied.
These 4 rows could reside in 4 different shards. Therefore, at the database layer they are performed using 4 separate transactions. If anything goes wrong and any of the shards fails, or if the application fails in the middle, the atomicity guarantee is broken and recovery could be very complicated. In a single-node database this would have been a single transaction composed of multiple SQL statements with atomicity guaranteed and error recovery made very simple.
Moreover, queries performed over the database (e.g. scanning inventory level or sales data) are not isolated from this business transaction, in the sense that they might see some of the updates and miss others while this business transaction is in progress. That’s because we decoupled the database transactions from the business transactions. Implementing the transactional behavior that has been lost in the database layer in the application layer is complex, expensive and could result in severe performance and reliability issues.
Now consider how sharding will make even a simple analytical query impossible at the database layer, and in many cases flat-out impossible.
Let’s examine a case where we want to find the number of sales made from each city. On a non-sharded database, this is a simple query:
SELECT customer.city, COUNT(sale.id)
FROM customer JOIN sales ON customer.id = sale.customer_id
GROUP BY customer.city
This simple query cannot be run on the sharded database, because it has to match rows of the two tables that reside in different shards. It requires fetching all the sales and all the customers from all the shards into the application server, and performing the join there locally.
Now, this query would have been made simpler and cheaper on the sharded database if the sale table would have been sharded by a hash of the customer_id. That would make each sale record collocated (in the same shard) with the customer that made the sale. And in that case we can run the above SQL query on each shard and combine the results. That “combining” also requires some programming (summing up the COUNT(sale.id) result from all the shards for each city) but it is feasible.
The problem, however, is that we cannot predict what queries are going to be in popular demand when we create the database and configure the sharding.
And that’s the smaller problem. The bigger problem is that more likely than not, there will be other queries in demand that can be optimized only with a different sharding configuration, so we cannot help one case without hurting the other.
Moreover, for some queries no sharding configuration helps. For example, we may want to find for each item, the revenue earned from its sales in each city. On a non-sharded database that would look like this:
SELECT customer.city, item.id, SUM(sale.num_units * item.price)
FROM customer
JOIN sale ON customer.id = sale.customer_id
JOIN item ON sale.item_id = item.id
GROUP BY customer.city, item.id
Since items may be sold to customers in any city, we cannot collocate matching items and sales in the same shard. If the sales are collocated with the customers that make them, we can join the sales with the customers in the database layer and combine the results from all the shards (as described above). However, this combined result (which could be huge), needs to be joined with the items table in the application server.
These large table joins are typically impossible to perform in a single server in reasonable time due to insufficient cpu and memory. After all, if they would be feasible, it could have been done in a single node database and we wouldn’t need to use sharding to begin with.
There have been noble attempts to try to “hide” internal sharding complexity and rebalancing. These have had varying degrees of success and different branding/marketing terms. They do not change the nature of sharded systems that we’ve discussed. These are not scale-out distributed databases. The core issues can only be resolved by rearchitecting the fundamental layers of the database itself, not by layering an abstraction on top.
Public sharing of learning experiences around the topic are instructive. This great analysis of the Foursquare MongoDB outage (the original post-mortem seems to have been taken down – and while it happened in 2010, the architectural points still hold true to this day) is a great independent view, and echoes many of the points in this post.
Another third party example that helps wrap your head around the complexities are in documentation. There is great documentation from Oracle on sharding (and their mechanisms for routing, query execution etc) here. This document is great in the sense that it’s exhaustive – but it’s also great in reinforcing all of the issues we’ve discussed in this section.
Multiple read-only replicas
Another solution which was used to cope with the scale-out problem that helps with some use cases involves replicating a single-node database and creating N-1 read-only replicas.
With this type of database system, the application can run non-modifying queries (read-only analytical data warehouse type operations) in parallel via the read-only replicas. All modifying operations must continue to go through the primary (single-node) database and be replicated to the N-1 read-only databases.
Regardless of marketing/branding/positioning, these are not “distributed” or “scale-out” systems. Consider:
- Any single transactional query cannot be made faster by running in a distributed manner and utilizing multiple node resources. Fundamentally when it comes to transaction processing, this is a single-node database system.
- There is a penalty that the single primary node incurs. The system requires that every write needs to replicate the data to each replica node, which lowers the net system efficiency, adds additional resource (memory/CPU/network – in addition to the persistence layer) consumption on the primary node, and if you want complete system-wide consistency, will definitionally incur the latency of those N replicas.
- If one tries to think of using the read-only nodes to “scale up” the system capacity for transactional throughput by “offloading” some of the workload, it is important to know that the read-only replica systems may not obey strict serializability. Some operations that create updates that are not yet replicated can create dirty reads and unrepeatable read scenarios due to replica lag if you assume that you can distribute transactional reads across replicas. In this documentation and this PoC guide on AWS Aurora (a relatively modern example of a database with a single writer and multiple read replicas) they discuss this behavior and set an expectation of 10-20 ms of replica lag, and that the use of replicas is really for analytics that are time-decoupled from the transactional workload.
Non-relational data model and non-transactional concurrency control models
Often when discussing database systems and the topics of performance, throughput, capacity limits and scale-out architectures, people think of many innovative database systems that have emerged over recent years that have different (non-relational) data models and relaxed transactional semantics. Since those database examples are NOT transactional relational systems, a deeper discussion is out of scope, and we’ll cover in detail in a future post.
Why is building a true scale-out transactional, relational database hard?
In contrast to all the workarounds we’ve discussed in section 2, a true scale-out transactional relational database is a cluster of nodes working together on a single set of schemas (tables etc.), providing the rich semantics of relational database operations across the whole cluster with ACID guarantees.
In a true scale-out transactional relational database these functionalities (e.g., transactional semantics and query operations) are fully supported regardless of how the data is distributed across the cluster nodes.
In a true scale-out transactional relational database, the abstraction between the application and database layer is not broken: The application is agnostic to the internal data distribution within the database cluster.
The emergence of “NewSQL” databases in the late 2010s was the first wave of attempts to tackle this problem.
The following subsections describe what it takes to do that and why it is very hard to implement.
Maintaining strong transactional properties, at scale, without compromising performance and efficiency.
Relational transactional properties allow developers to run parallel transactions against the database while relying on the database to maintain ACID properties and serve as a source of truth. That allows the application developer to focus on the business logic and avoid having to develop sophisticated synchronization mechanisms and to compromise performance.
Most strong transactional properties are practically impossible to achieve at the application level – as the relationships are all in the database layer itself. If you can’t solve it there, solving it higher up in the stack is going to be even harder and less efficient.
It’s non-trivial to achieve ACID guarantees within a single database node. If doing this well in a single-node system is hard, it’s rocket science to achieve ACID guarantees in a distributed system with reasonable performance guarantees..
It’s also assumed that delivering these behaviors in a distributed system is not only hard, but that it comes with an unavoidable penalty on the overall performance, scalability, and performance efficiency.
The reasons for this are deep and complex, and will be covered in more detail in an upcoming blog focused on Regatta’s concurrency control and consensus mechanisms.
Now, I can’t just say “this is hard” without providing some context. Warning though – people do PhDs on this topic, and invest a lifetime of work on the topic. If you aren’t nodding your head already and want to dive in and learn more, here are some solid resources to learn more:
- Here’s a really good Carnegie Mellon University Computer Science course on databases that they publish online (thank you Andy Pavlo and Jignesh Patel for publishing these). To give you a peek at the complexity of this topic, those are about 3hrs of classes themselves, and you really need to have all the other classes and labs under your belt to grok the topic.
- Starting from the start is best, but if you want to jump to the content focused on concurrency control protocols in non-distributed systems the 15th class is on this topic.
- The 21nd class here and 22nd class here start to discuss the various complexities associated with solving the technical challenges of concurrency control and ACID implementations in distributed systems.
A good, and widely used solution in some of the NewSQL systems are protocols like RAFT (which is not used in Regatta for a number of reasons we will discuss later), and you can read more about that here, which is a useful grounding as we share more about the Regatta Concurrency Control Protocol in a subsequent blog post.
Performance of distributed transactional and analytical processing at scale
A linearly scalable distributed transactional relational SQL system needs to perform a high number of concurrent transactions and large computation tasks. The expectation people have with scale-out systems of all types is that as more you add nodes – the more performance you get, and the increase doesn’t diminish with each added node. This means that database operations run faster (especially apparent for lengthier operations) and more operations can be performed at a given period of time.
Despite this expectation and market need of linear scaling, it remains elusive.
Right now, we’ve shared early scaling testing data in demonstration videos here and here.
Processing queries on large data sets in a distributed manner at large scale efficiently introduces hard challenges:
- Large data sets including intermediate results need to be stored in a distributed manner and propagated in a distributed manner between sets of nodes efficiently. Achieving this with high performance requires sophisticated parallel/distributed algorithms.
- Sophisticated mechanisms are required to adapt to workload demands dynamically and optimize resource utilization globally, avoiding bottlenecks and deadlocks across the cluster of nodes performing distributed transactions. Often a simplification in one part of transaction processing in the distributed cluster has one or more direct trade-offs. An example to consider is finding the right nodes and right number of nodes to do a computation – do you aim to have it all on one node, and what’s the relative efficiency of getting the information to that one node versus running it on ten nodes, twenty nodes? How does one pick the right nodes to run it on, balancing proximity, parallelism, resource utilization, and system constraints?
- For all the distributed transactional relational databases that are true scale-out implementations, contention between: a) fast transactions; b) long complex analytical queries – create particular strain on their locking mechanics. This tends to materially compromise the performance of transactional operations on the data or compromise the strong transactional/ACID guarantees, and causes them to be only suitable for one of the two types of operation.
- Internal database operations which are commonly overlooked (examples like snapshot reclamation, schema updates, index updates/rebalancing) are challenging with non-scale out systems. Those same operations are orders of magnitude more complex in scale-out implementations and can have a material impact on overall database performance and scalability.
Scale-Out != Elasticity
“Scale-out” as an adjective is completely distinct from “Elastic” as an adjective in describing systems.
Consider a scale-out system that is not elastic – rather the one-time configuration of the number of system nodes is predetermined, configured and permanent (without data disruption). The utility of that system would be much lower.
- With a scale-out system, being able to respond to system level changes is more important as the system by definition would be carrying a larger total workload than a single node-system, and therefore having the wrong configuration would have an even greater impact.
- At the same time implementing elasticity is more challenging with larger scale-out systems as there is more total workload, more data, and more complex trade-off considerations. .
Scale-out clustered database systems should be able to dynamically and non-disruptively handle:
- Database configuration changes (e.g., schema changes)
- System/cluster configuration changes (e.g., adding nodes, storage devices)
- Changes in capacity and/or performance that trigger reconfiguration of resources within the cluster.
- For example – one of the applications that runs on the cluster temporarily needs more performance. In that case, the database can reallocate resources (e.g. disks) within that cluster to that application, at the expense of resources (and related performance) of the remaining applications on the same cluster.
- Capacity utilization changes across the cluster’s storage devices.
This is a complex challenge because:
- It requires non-disruptive re-balancing of data stored across the system’s storage devices at a large scale. This needs to be performed in an efficient manner and with minimal resource consumption and minimal performance impact. This is worth reinforcing – any system change that has a severe system-level impact during normal transactional operations (requiring maintenance window planning) is completely at odds with the goals of a scale-out database system. Implementing this well is closely related to concurrency, consensus and data distribution.
- The internal distributed algorithms have to be flexible and adapt non-disruptively to those dynamic changes in where data is stored. Think of it with an example: it is completely possible that while a transaction is executing, data that is read or written could also be in the process of being moved for rebalancing. The database system would have to move data around while transactions are querying it and updating it, without disrupting them, abandoning the transactions (triggering retries which can increase system load and harm overall performance) or compromising transactional properties (ACID).
Another non-trivial requirement is support for heterogeneous node configurations. When operating any large scale-out cluster of many nodes for a long time, it’s very likely that different hardware models become (un)available, or economically ideal or non-viable. Supporting heterogeneous clusters of non-identical nodes adds another layer of necessary sophistication to the internals of a scale-out system.
Very frequently with scale-out systems they are scale-out but are not elastic in absolute or practical terms.
Conclusion
The conclusion to this post is simple at its core:
- The need for a proper scale-out relational transactional SQL system is evident. Yet the fact that most developers still start with single node relational transactional databases when most applications are designed to scale-out points to a huge gap that remains to this day.
- The common workarounds to scale the performance and capacity of single-node non-scale-out relational transactional databases (sharding, read-only replicas) have material downsides.
- Closing the gap requires a relational transactional scale-out database system that:
- Fully supports the relational data model with strong ACID guarantees – and ideally should support strict serializability and external consistency.
- Exhibits linear scalability and efficiency across a broad spectrum of data types, capacities, node counts/types and queries – including both transactional and analytical queries at the same time.
- Uses resources efficiently. If it costs far more to run a given workload on a scale-out system than a sharded configuration – pure economics can force non-ideal alternate choices and workarounds.
- Maintains these behaviors over a broad range of workloads, data types, and queries.
- Is able to do the above elastically – able to adapt to unforeseen changes of all types without materially impacting the overall system behavior.
Regatta is that kind of relational transactional scale-out distributed system – an OLxP system.
Subscribe to the blog as we will be publishing ongoing blog posts that will discuss more about how Regatta accomplishes this system level behavior, along with details about the novel IP involved. In addition, if you look at our Events, we will be doing a series of webcasts where you can learn more!