In addition to transactional capabilities, Regatta supports powerful query and analytics capabilities using standard SQL. This eliminates the need to move (ETL, data pipelines) data to a separate data warehouse or data lakes. Unlike others, Regatta’s entire functionality, whether transactional or analytical, is always fully supported across the whole cluster. Cluster-wide functionality works regardless of how data happens to be distributed across the nodes. Regatta uses state-of-the-art parallel and distributed analytics algorithms that can execute sophisticated distributed query operations such as JOIN, SORT, GROUP, DISTINCT, and various aggregations, across node boundaries.
Traditionally, OLTP and OLAP workloads must run in separate databases. This is done because traditionally the performance of OLAP queries on OLTP systems slows down their transactional throughput. Likewise, complex OLAP queries historically needed distinct platforms that could handle the volume and complex dimensionality needed which would crush OLTP systems. To do this, data is moved (“ETL” and “data pipelines”) from the OLTP / operational database to a data warehouse. Such ETL processes are inefficient, operationally complex, multiply data capacities and the related need for storage, networking, and licenses, and snowball total costs. Maybe more importantly, ETL processes cause data in data warehouses to be delayed. As a result, data warehouse operations cannot include real-time data, and ETL and data pipelines are fragile. Hence, despite the additional operations and costs, data warehouses do not facilitate the execution of queries on combined historical and real-time data.
Regatta is built to allow analytics to run alongside transactions and/or other workloads such as data ingress. The queries operate on a fully consistent and up-to-date data without blocking transactions in any way.
Regatta’s powerful query planner and optimizer drives the query execution strategy dynamically, based on statistics and on intermediate calculation results. All Regatta’s distributed query mechanisms have been designed to utilize the cluster’s aggregated compute resources as well as to eliminate single node bottlenecks. Depending on the circumstances, Regatta can decide to execute the query calculations without requiring an “aggregator node” through which all data and intermediate results must flow. Instead, nodes that participate in a distributed query operation communicate in a many-to-many fashion, reducing the network overheads to the absolute minimum.
Regatta leverages the cluster’s available resources to maximize query operations’ efficiency. Compute operations may run on nodes that hold the relevant data and can expand onto other nodes that may hold other data, and onto nodes that do not hold any data at all. Leveraging and aggregating such free compute resources across the cluster can significantly shorten query completion times by enabling to hold intermediate query results in (aggregated) RAM rather than swapping them to disk. In addition, the parallelization of CPU operations can significantly improve the processing-time of CPU-intensive query operations.
Regatta’s fractured mirrors can maintain multiple copies of the same table(s), possibly in different sets of nodes inside the cluster. This allows the same data to be represented differently among various mirrors. This allows physical separation of data and compute between operational transactions and analytics, effectively eliminating any impact of analytical workloads on the transactional performance. While the mirror for transactional operations is organized as a row-store, the mirror for analytics purposes can be organized as either a row-store or a column-store. Regatta guarantees high performance and data currency – for both types of stores, at all times. When a column-store is used as a fractured mirror of a row-store that is used for transactional operations, Regatta’s proprietary technology for fractured-mirror updating enables both the row store and the column store to be completely up-to-date, at all times, without degrading OLTP performance.
The output of some query operations may be extremely large. In such cases, no matter how massively parallel the query execution is, returning such large query output to the client may take a lot more time than execution of the query itself. For example, a massively-parallel distributed JOIN may execute in one second and output 60 gigabyte of data. While the query execution only took one second, delivering the 60 gigabyte to the client may take more than a minute, and defeat any performance benefits. Regatta enables to quickly store such large query outputs in its own distributed storage in a parallel manner and to enable the client to access the results in parallel as well.