Regatta’s Architecture: A Bird’s Eye View
In this blog I will provide a high level overview of Regatta’s design and architectural principles. While I keep this introduction still relatively high-level, in the next few blogs we will get into more details of various topics that I am covering here.
We designed Regatta such that its core architecture and implementation will be agnostic to its deployment type. That is, Regatta’s core database is essentially the same regardless of whether you install it on physical (or virtual) servers in your on-prem, your cloud of choice, or whether you consume it as a service.
Scale-Out, Mostly Shared-Nothing
Regatta is mainly a scale-out shared-nothing clustered architecture where heterogeneous nodes (servers, VM’s, containers, etc.) cooperate and can perform lengthy (as well as short) SQL statements in a parallel/distributed manner, with many-to-many data propagation among the cluster nodes (i.e., intermediate data doesn’t need to pass via a central point). In a “shared-nothing” architecture, the disks are typically locally attached to the nodes. Even if they don’t, the guiding principle is that each disk is accessible only by a single node in the cluster.
Unlike some scale-out architectures that embrace symmetrical homogeneous nodes, a Regatta cluster can consist of nodes of various sizes and configurations. The native mode is where there is no distinction between storage nodes and compute nodes and all nodes can do both data access and statement processing. Furthermore, compute-only and storage-only nodes can be added to the cluster – seamlessly supporting a “decoupled storage and compute” two-layer topology that may include “diskless” nodes and nodes with disks in the same cluster.
Relational Data Model
Regatta is a relational database, fully supporting the power of the relational model and everything else that should be expected from a comprehensive relational database, whether transactional or analytical. The architecture supports extension of the model for storing and processing semi-structured as well as unstructured data.
Cluster-Wide Functionality
Regatta’s entire functionality (whether transactional or analytical) is always fully supported regardless of how the data is distributed across the cluster – e.g., the strong ACID guarantees are fully supported even if the transaction includes rows that reside on different nodes; distributed JOINs are fully supported across node boundaries, etc. This is a fundamental design principle differentiating Regatta from the typical approach of “scale-out-by-sharding” databases.
Very Large Clusters from Day One
Regatta was designed and developed with truly large clusters in mind from day one. Regatta’s team has decades of in-depth experience building large scale, mission-critical, and high-performance data management and storage software systems. While Regatta can scale to thousands of nodes in a cluster, its scale-out design does not compromise performance or performance-efficiency of small configurations – even not of single node configurations.
Concurrency-Control Protocol (CCP)
For transactional processing, Regatta supports a fully serializable (and externally consistent) isolation level based on Regatta’s proprietary Concurrency Control Protocol (CCP). Regatta’s CCP is mainly optimistic, although unlike most optimistic protocols, it doesn’t cause transactions to abort on detected conflicts (well, except of course for deadlock cases in which both optimistic and pessimistic protocols tend to abort a transaction per deadlock-cycle). In general, Regatta’s CCP is snapshot-free and does not require clock synchronization. As already mentioned, the CCP provides the very same transactional guarantees, regardless of whether the data is distributed across the nodes or not. Furthermore, the CCP was built with scale-out in mind, and is fully optimized for large scale-out configurations. In addition to the strong serializable mode, Regatta also supports more relaxed isolation levels such as Repeatable Read and Read Committed. Those relaxed isolation levels are also fully optimized for scale-out configurations. As described later on, read-only queries do not block writing transactions.
Regatta’s CCP approach aims to permit as much concurrency as possible. For that purpose, among other things, Regatta’s CCP includes sophisticated “predicate-locking” algorithms (well, our own proprietary optimistic flavor of “predicate-locking”). Regatta can utilize predicate-locking even on complex predicates whose evaluation may be distributed, and that may also combine indexed and non-indexed elements. Regatta’s predicate-locking algorithms go much further than just “let’s lock an index-page and thus block rows from illegally entering or leaving the population selected by the predicate”.
Transaction Commitment
Transaction commitment in Regatta is a linearly-scalable mechanism that involves the nodes that specifically wrote data for the committing transaction rather than a much broader set of nodes. That property is obviously very important for linearly scalable performance of short transactions (that tend to involve a small number of nodes each) in very large clusters.
Deadlock Detection
Whenever deadlocks among transactions occur (which, by the way, can never happen during the main parts of Regatta’s transaction execution), we use distributed deadlock detection techniques rather than deadlock prevention techniques, without hurting the performance-at-scale of the system. Note that deadlock prevention techniques (which are often used by distributed databases) may cause unneeded aborts due to false-positive detections.
Up-to-the-Second Queries of Changing Data
Consistent/serializable read-only queries, short or lengthy, can be performed on real-time up-to-the-second transactional data without blocking writing transactions from progressing. Regatta achieves this without the penalties, non-linearities, and global quiesce that are typically associated with distributed snapshot creations, as well as without the penalties and difficulties associated with clock synchronization techniques. The mechanisms are optimal for both high-rate short queries and lengthier queries. All of this applies to Regatta in both small and very large clusters.
Regatta uses a persistent data layout that supports multi-versioning. As such, at any given moment, a row may have multiple versions associated with it. Unlike some databases that use MVCC, locating and reading the relevant version of a row does not require traversing the row’s versions themselves, and therefore does not create read I/O amplification in case a row happens to have a larger number of versions.
Query Planning, Optimizers, Parallel & Distributed Computation
Regatta’s query planner as well as static and dynamic optimizers are built from scratch with distributed configurations in mind. In other words, Regatta’s query planning and optimization were not first designed for a single node and then later expanded for distributed configurations as an afterthought. Whenever appropriate, lengthy computations are performed in a massively parallel and distributed manner across a potentially large number of nodes. These nodes cooperate and exchange intermediate results in a many-to-many fashion, thus avoiding potential single-node bottlenecks that could occur if such data were routed via a single node. Regatta can perform the most demanding and complex distributed JOINs, as well as other challenging computations such as distributed GROUP BY, distributed SORT, and so forth.
Modular (On-Disk) Data Layout Architecture
Regatta’s architecture is built with the understanding that different data and/or data-access styles may benefit from different (on-disk) data layouts. Therefore, while Regatta today supports a certain row store layout, it is already designed for future additions of other types of row stores, as well as column store, blob store, etc.
To optimize I/O performance, Regatta implements its own data layouts directly on top of raw block storage, and does not need any underlying filesystem.
Flash-optimized Data-Layouts, No Slotted-Pages, No Log-Structured Merge Tree
Regatta’s first row store data layout type is specifically optimized for flash media. It allows us to optimally support both traditional small-rows-with-more-or-less-fixed-size, and variable-sized-large-rows-with-a-large-dynamic-range-of-sizes (within the same table). Since traditional slotted pages were not an ideal fit for that purpose, and since some properties of log-structured merge-trees may not always align well with flash media or certain types of workloads, we developed our own log-structured data layout that operates very differently from LSM tree. Regatta’s data-layout is optimal for a large variety of workload types. Additionally, Regatta’s B+Trees (that are used, for example, for indexes) massively leverage the high read-concurrency of flash media, allowing meaningfully faster and more efficient B+Tree accesses than algorithms that would assume more “generic” underlying storage (i.e., magnetic HDD).
Extensive Read and Write-Back Caching
Regatta uses in-RAM read and write-back caching. Sophisticated write-back caching techniques (both volatile and non-volatile) optimize transaction execution as well as support advanced processing, resulting in meaningful amortized disk I/O reduction for both log-based and B+Tree persistent structures – without hurting strict durability guarantees. Needless to say, Regatta ensures that the data is hardened to disk before a transaction commitment is acknowledged to the client. For read caching, we took an approach that we call “logical caching”. Many databases cache disk pages in RAM, which may be wasteful. For example, if a hot 100-byte row resides in an 8KB disk page that otherwise contains cold data, caching the entire page in RAM can lead to a “waste” of ~7.9KB of RAM, dramatically reducing the effectiveness of the read-cache. In Regatta, the hot row will be stored in RAM without needing to store the entire surrounding page content in RAM. Maintaining CPU-efficiency and RAM-allocation-efficiency with a logical cache has many fascinating challenges that I’ll cover in a separate blog.
Data Distribution and Elasticity
Data is distributed by an automatic algorithm that complies with user-defined constraints. With that approach, the user can decide how much control they assume and what degrees of freedom they leave to Regatta. In one extreme, Regatta can balance the data completely autonomously and transparently, and in the other extreme, the user can take full control and decide by themselves how to distribute the data. And, of course, any in-between combination can apply… Regatta’s architecture supports the ability to non-disruptively, under-the-hood, migrate data among nodes for the sake of balancing, etc., while transactions, short or lengthy, are executing.
Distributed Primary and Secondary Indexes
Regatta fully supports primary as well as secondary indexes, where the indexes can be massively distributed, if needed. By design, the distribution of an index across nodes does not have to be aligned with the distribution of the corresponding rows’ data across nodes, allowing the best performant usage of the indexes.