Regatta’s Concurrency-Control Protocols
Introduction
In this blog I’ll cover Regatta’s concurrency-control protocols, and how our approach allows us to avoid many of the difficulties and pitfalls of transactional properties at scale.
As already described in a previous blog, Regatta’s entire functionality (whether transactional or analytical) is always fully supported across nodes’ boundaries and regardless of how the data is distributed across the cluster. For example, Regatta’s strong ACID guarantees are fully supported even when the transaction includes rows that reside on different nodes; Distributed JOINs are fully supported across node boundaries, etc. This is a fundamental design principle differentiating Regatta from the typical approach of “scale-out-by-sharding” databases.
Regatta supports multiple isolation levels for OLTP/OLAP workloads – all treated as first-class citizens in our implementation. Those isolation levels include the strongest Serializable level (with external consistency), as well as the somewhat more relaxed Read Committed and Repeatable Read levels. All these isolation levels provide their full, no-compromise guarantees across the entire cluster.
Regatta allows transactions to choose their isolation level on a transaction-by-transaction basis, and a concurrent mixture of isolation levels is fully and optimally supported.
Regatta’s Serializable Isolation Level
About Serializability
Regatta supports a true Serializable isolation level. Without getting into formal details, serializability guarantees that the results of concurrently executing transactions will appear as if they executed one after the other (i.e., in an arbitrary serial order), without exposure to various anomalies that are inherently an unavoidable part of the more relaxed isolation levels.
Terminology note: The term Concurrency Control Protocol (CCP) is used to describe the algorithmic approach and mechanisms used by a database to enforce a pertinent isolation level.
Performance Compromise?
Strong transactional semantics in true scale-out databases are almost always perceived as requiring a significant performance compromise:
- Distributed locking (often used by pessimistic CCPs) is meaningfully slower and more complicated than local, within-the-node, locks. As a result, its usage tends to cost in increased transaction latency and resource consumption.
- Snapshots are used by some optimistic (and semi-optimistic) CCPs. Generally, in a true scale-out database, snapshots (often created for each transaction or even for each statement – depending on the exact isolation level) should each be perfectly consistent across the database nodes for the same point in time. Such high-rate, consistent, distributed snapshot creation is extremely challenging and tends not to scale well by the number of nodes and/or transaction numbers and rates – resulting in performance degradation.
Note: Some databases use clock-synchronization techniques to mitigate some of these challenges, as well as other CCP-related different challenges. Clock-synchronization techniques, and especially when not hardware-aided, are challenging to implement and often come with various performance-degradation aspects, such as the need to introduce delays in certain operations (e.g., in a read operation) and/or increasing the chances of transaction aborts (in the cases of optimistic locking techniques).
- Optimistic CCPs’ main promise is to avoid a blocking coordination among transactions – i.e., by not using locks. Non-blocking operation is far from being a free lunch, as optimistic CCPs tend to cause transactions to “commit suicide” (abort) and re-run from scratch, on any data-access conflict with any other transactions. The “transaction suicide rate” in optimistic CCPs may become alarmingly high under some workloads. The “suicide rate” or abort-probability may rise even further when combining both short and lengthy transactions and queries. Furthermore, unfortunately, the increased complexities and uncertainties (e.g., about timing of events) in scale-out database configurations often cause the distributed variations of optimistic-locking CCPs to meaningfully (and sometimes even dramatically) increase the transaction aborts rate, effectively hurting the overall performance of the database.
To sum up, strong transactional semantics in true scale-out databases are perceived as causing meaningful performance compromises due to (but not limited to) locks, snapshots, suicides-on-conflict, as well as some other elements.
Regatta’s serializable isolation is using our proprietary, patented CCP that:
- Does not use locks: In Regatta’s CCP, the main part of the transaction execution – the part when it does the “real work” – is done in a non-blocking manner, and hence without needing locks. In that respect, that part is completely optimistic.
- Is snapshot-free: Transactions in Regatta’s serializable isolation level do not use snapshot, nor do they need clock-synchronization.
- Unlike most optimistic protocols, doesn’t cause transactions to abort on detected conflicts (well, except of course for deadlock cases in which both optimistic and pessimistic protocols tend to abort a transaction per deadlock-cycle – more about this later).
Autonomous Operation
It is beyond the scope of this document to dive deeper into the execution model of transactions, queries, and statements in Regatta. However, it is worth mentioning that a statement execution is generally distributed and/or parallelized across the nodes containing the relevant data elements (as well as, under some circumstances, compute nodes that the statement decides to employ). In other words, it’s not that there is “one node” that “truly executes the computation of a statement”, using all the rest of the nodes just as “storage nodes” from which it fetches the data. Instead, the transaction has Transaction Agents instantiated on the nodes that are relevant for its execution. Those agents collaborate directly and, whenever required, may also exchange data (in a many-to-many fashion, thus avoiding single-node bottlenecks). For lengthier statements, such a data exchange facilitates a distributed pipelined execution. Needless to say, Regatta’s collaboration among the agents is as efficient and less-chatty as possible, opting each agent to operate as autonomously in its node as possible.
Distributed algorithms are inherently harder to design than local algorithms. That fact is extra apparent with CCPs that are all about coordinating parallel distributed activities. Our philosophy in building the distributed aspects of Regatta’s CCP algorithms and mechanisms was – similarly to the way transactions execute autonomously in Regatta (as described above) – to allow each of the transaction agents that operates in the cluster maximal autonomy.
In Regatta, most of the CCP-related operations do not require tight coordination among the nodes. In fact, as discussed below, when the transaction is doing its “main work”, none of the CCP-related operations require coordination among the nodes. That general approach is great for performance but is also great for simplifying many of our internal algorithms and mechanisms…
Conflict Resolution
As mentioned, a transaction that executes in Regatta is completely non-blocking/optimistic during its “real work” part (hereafter Working Phase). Unlike some optimistic CCPs, that transaction does not care about detecting any conflicts with other transactions (during the working phase). That results in a lightweight, almost invisible, concurrency-control overhead during the working phase of the transactions. It also means that deadlocks (among transactions) cannot happen during the working phase, and hence no deadlock-handling is required during that phase.
If anybody would argue that transactions could run freely, completely ignoring conflicts for their entire lifetime, and serializability would magically “just happen”, they would probably be very wrong. Physics just won’t let that happen.
As stated above, Regatta’s serializable CCP doesn’t bother with conflicts during the working phase of the transaction. Once Regatta’s CCP does deal with conflicts, that activity is done cheaply and mostly local (i.e., done by each of the transaction agents separately in their own relevant nodes), maintaining the “autonomous operation” style discussed above. I won’t get into all the details here, but I can say that there is a meaningful conflict tolerance built into the algorithms. For instance, the conflict handling is done in the fine-resolution of a single data-cell (rather than at the row-level or at the level of a block/page of rows), thus reducing “false conflicts”. In addition, we care only about one out of the three types of conflicts (W/R, R/W, W/W). Furthermore, there is a sophisticated (optimistic flavor of) “predicate locking” in place.
And now for something else that sets our CCP apart: Unlike the typical optimistic protocols, in Regatta the detection of conflicts does not result in aborting the transaction.
As already mentioned, there is only one (legitimate and unavoidable) exception. In case of a deadlock, everybody, the pessimistic CCPs, the optimistic CCPs, as well as Regatta’s CCP will have to abort a transaction… Deadlocks are covered in more detail next.
Deadlocks Among Transactions
Deadlock Handling in General
In general, different transactions may access the data-cells of a database in various different orders that may sometimes create a deadlock among a set of transactions. Traditionally (and with good reasons!), it’s the responsibility of the database to deal with those deadlocks. Ideally, a deadlock should be detected reliably, and then resolved by killing one of the transactions participating in the deadlock.
Deadlock detection is typically achieved by an algorithm that searches for directed cycles in a dynamic transaction dependency-graph. It is a somewhat expensive process, but is considered reasonably doable within the boundaries of a single node database, where the dependency-graph resides entirely in the local RAM of that single node. Enter distributed databases… The dynamic transaction dependency information is potentially spread across the entire cluster. Seeking for directed cycles of a dynamic distributed graph (with cross-transaction dependencies potentially occurring in different nodes) is not an easy task, nor is it particularly cheap…
That challenge often leads distributed transactional databases to use Deadlock Prevention instead of deadlock detection techniques. Deadlock prevention techniques are much more “easy going” with killing transactions. Instead of detecting actual deadlocks, deadlock prevention techniques would terminate transactions much more “brutally”, sometimes even when the only issue is that they are merely “waiting” for a resource that will soon become available, without any real deadlock occurring.
Those prevention techniques indeed eliminate the creation of deadlocks, but with an amplified penalty of killing transactions that could have executed without causing a true deadlock. This, obviously, turns to effectively slow down transaction processing, sometimes just because there were some legitimate non-deadlocking conflicts or because there was more load causing more waits (which is a very bad constellation for killing a transaction and re-running its entire work, as it also makes the effective load even heavier…).
Regatta
Summarizing:
- Unlike the typical optimistic protocols, Regatta’s conflict detection does not result in aborting the transaction, with the legitimate and unavoidable exception of deadlocks. In case of a deadlock, everybody, the pessimistic CCPs, the optimistic CCPs, as well as Regatta’s CCP will have to abort a transaction.
- In Regatta, the transaction execution is completely non-blocking/optimistic during its working phase. Additionally, the transaction does not care about detecting any conflicts with other transactions during the working phase. That means that deadlocks cannot happen during the working phase, and hence no deadlock handling is required during that phase.
- When Regatta’s CCP deals with conflicts, that activity is executed cheaply and mostly local (i.e., done by each of the transaction agents separately inside its relevant node), maintaining the “autonomous operation” style.
In Regatta CCP, the type of deadlock that is related to conflicts is more simplistic, and its detection is more lightweight than performing deadlock handling during the transaction working phase. In fact, the deadlock detection takes place when the transactions no longer hold their possibly-heavy resources required for their execution. Furthermore, and because of the above, Regatta’s CCP takes the approach of distributed deadlock detection (rather than deadlock prevention) that does not needlessly abort transactions without any real deadlock.
Linearly-Scalable, Efficient Distributed Commitment
Regatta was designed and developed from day one with truly large clusters in mind. Regatta’s CCP was designed and built to excel both with small clusters of a handful of nodes or even a single node, very-large clusters (e.g., 10,000 nodes), and anything in between.
Designing a reliable and fault-tolerant atomic distributed commitment mechanism is challenging. Doing it in a linearly-scalable-performing fashion is even harder. Commitment mechanisms that work reasonably well in small clusters may behave less charmingly given a large cluster. A short, small, transaction that involves very few nodes (e.g., 3 nodes in a 500-node cluster) should be able to commit without involving many other nodes in the cluster, otherwise, linear-scalability would sooner or later be compromised.
When committing a transaction, Regatta’s CCP will generally involve the nodes that specifically wrote / modified data for the committing transaction. It will not involve a larger forum (this is essential for efficiency and for linear scalability). The commitment is performed in the most lightweight manner possible. For various reasons, we preferred to create a variation of a two-phase commit rather than to use a mechanism that is based on a consensus algorithm.
Row-Versions
Talking about our (flash-optimized) persistent data layouts is beyond the scope of this document. It is worth mentioning, however, that it’s not a slotted-page structure, nor is it an LSM Tree. Instead, we developed our own log-structured data layout that operates very differently from LSM Tree, effectively realizing an MVCC data-layout that allows writing and reading variable-size rows in a performant manner with well-bounded low I/O amplification, thus making it optimal for a large variety of workload types. Interestingly, this layout also releases us from needing to maintain redo/undo logs.
When all Regatta’s transactions are running in serializable isolation level, each row will tend to have a single row-version only. When a transaction TR100 modifies an existing row R10, then, for a very short period, we will have both the old (“current”) row-version and the new row-version (containing the modifications done by TR100). A short time after the commitment, the old row-version will be cleaned, very cheaply. In fact, the internal mechanics of this are very close to saying that the “row is overwritten by the new contents”, although a slightly better way of thinking about it is that the new row-version replaces the old row-version.
Row-versions are covered in more detail later in this document.
Predicate Locking
Regatta’s serializable CCP approach aims to permit as much concurrency as possible. For that purpose, among other things, Regatta’s CCP includes a sophisticated proprietary optimistic flavor of “predicate-locking” algorithms.
Predicate Locking is a topic that, despite the fact that it is rarely discussed, is arguably one of the big rocks in allowing the restrictive serializable isolation level to gain better concurrency. If transaction TR100 selects a (potentially small) number of rows to act upon (hereafter The Selected Row-Set), by specifying a predicate, then, generally, it is desired (and in most cases essential) that other transactions (e.g., TR1) wouldn’t modify rows in a way that would alter the selected row-set (e.g., cause another row to satisfy TR100’s predicate without TR100’s being able to do anything about it as it already selected its rows) and commit before TR100 commits. Perfect enforcement, while still allowing as much concurrency as possible, is known to be a hard problem, and the typical solutions provided by relational databases are arguably very compromising.
Regatta can utilize predicate-locking even on complex predicates whose evaluation may be distributed, and that may also combine indexed and non-indexed elements. Regatta’s predicate-locking algorithms go much further than just “let’s lock an index-page and thus block rows from illegally entering or leaving the population selected by the predicate”.
This topic clearly deserves a blog of its own, but, in the meantime, here is an example that demonstrates what our technology can provide:
Consider a fictitious employees table that, among others, contains the columns profession, hair_color, and shoe_size (each may or may not be indexed). Suppose we have a transaction TR100 that would like to act upon all employees that are dentists whose hair color is red/orange/yellow and whose shoe size is in the range 6-8.
Suppose, at the time TR100 searched for its selected row-set, James was a carpenter, with brown hair and a shoe size of 9. James is clearly outside of TR100’s selected row-set. Now let’s assume TR100 is still executing, and already chose its selected row-set. Let’s consider three short transactions, TR1, TR2 and TR3:
- TR1 modifies James’ profession to dentist.
- TR2 modifies James’ hair color to red.
- TR3 modifies James’ shoe size to 7.
Assuming TR2 and TR3 do not execute at all, could TR1 execute (and commit) concurrently to TR100? The answer is “yes”. The fact that James became dentist wouldn’t necessitate its inclusion in TR100’s selected row-set.
Note: Whether a typical relational database will allow those to run concurrently is a different question… Regatta’s technology does allow that concurrency.
In fact, any single transaction out of those three (TR1 or TR2 or TR3) could in theory execute and commit concurrently to TR100 without violating the desired consistency expectations.
Furthermore, any combination of two out of those three transactions (e.g., TR1 and TR2) could in theory execute and commit concurrently to TR100 without violating consistency.
However, all the three transactions (TR1 and TR2 and TR3) cannot execute and commit concurrently to TR100, as their end results will make James qualified to be selected by TR100 (that unfortunately didn’t select him). Then, the anomaly that TR100 did not include James could easily lead to violation of serializability.
Regatta’s technology can understand all that and allow just the right level of concurrency.
Maybe just another small example, if James’ friend Jane is a dentist with red hair and shoe size 7, a transaction TR4, modifying her hair color to orange, could execute and commit concurrently to TR100, as the inclusion of Jane in TR100’s selected row-set is agnostic to whether Jane’s hair color is red or orange. Here again, Regatta’s technology can allow that level of concurrency. Whether other databases could allow it is a different question.
Regatta’s Read Committed and Repeatable Read Isolation Levels
General Description
In addition to the Serializable isolation level, Regatta supports (as first-class-citizens) the Read Committed and Repeatable Read isolation levels (that are often also denoted as “snapshot isolation levels”). While those isolation levels are by-definition more relaxed than the Serializable level, they still fully provide their transactional guarantees in a non-compromised manner across the entire cluster, also for transactions that deal with multiple rows that reside in multiple nodes.
To achieve that, Regatta uses proprietary Point-in-Time (PiT) technology that supports an extremely high rate of transactions (i.e. millions per second or even more) in very large clusters, pretty cheaply, without paying the penalty typically associated with the creation of consistent distributed snapshots, and without relying on clock-synchronization.
Unlike the typical semi-optimistic approaches used by other database, where reading is done in a non-blocking manner and writing requires the usage of blocking locks, Regatta transaction’s working phase is completely non-blocking also under Read Committed and Repeatable Read isolation levels.
In Regatta, many of the principles discussed above for the Serializable isolation level are applied in those snapshot isolation levels as well. It’s just that reading is done using the contents of a specific point-in-time, as those isolation levels define, and the conflict detection and resolution is more relaxed, thus complying with the defined semantics of those snapshot isolation levels.
Row-Versions
In General
When those snapshot isolation levels are in use (either for OLTP or OLAP-like workloads), multiple row-versions may be created for some of the rows (that happened to be modified multiple times recently), as multiple transactions or statements may read the contents of those rows for multiple different points-in-time, where those recent modifications could result in different contents (=row-versions) for those different points-in-time.
Note: While this could generally lead to non-serializable transactional execution, such anomalies are par for the course for those MVCC isolation levels.
Once a transaction that created a snapshot completes its work (e.g., the statement completed, or the transaction commits – again, depending on the exact isolation level), its snapshot is no longer needed. That means that some row-versions stored by the database may no longer be needed, and they can be Cleaned (i.e., removed). Normally, the “most up-to-date” row-version survives (well, except for a case where a transaction TR1 took a snapshot, is reading row R10, and then a transaction TR100 removes the row and commits, while TR1 is still running – interestingly, R10 no longer exists, but a historical row-version of it is still maintained for the sake of TR1 – until TR1 completes its execution).
In general, the cleaning of those historical row-versions is by no means trivial, as sometimes, a specific row-version “serves” multiple snapshots, so the discovery of which row-versions could be cleaned, and when, is often considered non-trivial.
As a result of all the above, relational databases that support snapshot isolation levels need to maintain historical row-versions of the database, at least until the moment each of those could be cleaned.
Different Implementations for Row-Versioning
Different databases vary in the way they implement, store, and clean row-versions. For example, Oracle generally has one “main version” of the row stored on disk (in the main “table space”), and the rest of the row-versions are stored in the undo log (on disk). Note that the “new version” is first written to the redo log and only then moved to the main table space. PostgreSQL, as another example, stores each new row-version as a separate entry in its main data-layout.
Each of those approaches have their advantages and disadvantages. For example, in Oracle, the undo log has a maximal size. If a lengthy transaction holds a snapshot for “too-long”, and it reads a row (from that snapshot) whose relevant row-version was already retired from the undo log (as it has a limited size), the transaction will simply fail… PostgreSQL, on the other hand, will generally not suffer from that problem, but it relies on the heavy VACUUM background process that crawls the database, seeking for row-versions that could be cleaned. That crawling is often criticised as undesired heavy resource consumption that hurts the overall performance, sometimes in a pretty unpredictable manner.
The Penalty of Reading a Row that has Multiple Row-Versions
The fact that databases store multiple historical row-versions of the same row requires the database to “know” which row-version a certain (reading) transaction needs. Many of those MVCC databases need to search for the “right” row-version of the row by scanning all its row-versions (and, for example, comparing their “timestamp” with the pertinent snapshot’s “timestamp” until reaching the “right” row-version for the read). With that often-used strategy, the existence of many different row-versions for the same row could hurt (or even meaningfully hurt) the row read performance, as the above search is often done by traversing the on-disk row-versions, using multiple disk I/Os. This is clearly one of the (quite many) reasons why “OLTP and OLAP don’t work well together”: When one executes multiple lengthy OLAP queries together with a decent OLTP workload, there are greater chances for a larger number of row-versions, at least for some of the rows (well – the “hotter” rows, which are accessed more frequently – which actually worsens the performance degradation…).
Regatta
In Regatta, and unlike the typical MVCC databases, a row read will not require the expensive on-disk traversal among all the row’s versions to determine the right row-version to read. Instead, the inherent metadata that is anyways (efficiently) accessed during a row read, provides all the information necessary to instantly laser-focus on the right row-version. In that respect, a workload containing a mixture of OLTP transactions and OLAP-like queries (that would typically use those snapshot isolation-levels) may create rows with multiple row-versions. In Regatta, statements that read such rows under snapshot isolation won’t pay an extra penalty. The same could be said for concurrent OLTP transactions that run in Serializable isolation.
As for row-version cleaning: In Regatta, unlike Oracle’s approach mentioned-above, row-versions that are “needed” by existing transactions will not be retired. In that respect, Regatta is closer to the above-mentioned PostgreSQL approach. However, Regatta knows how to extremely efficiently pinpoint which row-versions could be cleaned, without needing to perform heavy operations, and without needing possibly-heavy acts such as the PostgreSQL VACUUM crawling.