File under “now possible”: A 20B row JOIN and 50,000 UPDATES/second in a single distributed database

Platform

Jim Walker

Chief Product Officer

Share it with others:

Distributed joins are among the most demanding operations in large‑scale data systems. Trying to do this at the same time as transactions are being committed is an incredible challenge that many databases simply cannot execute; let alone return results in an acceptable time.

To demonstrate how RegattaDB can meet this challenge, we implemented a scenario that removes common optimizations and exposes the system’s raw execution capabilities. It joins 20 billion rows, across 50 small cloud nodes, with no indexes, while executing 50,000 ACID‑compliant updates per second on the same data.

What we set out to prove: three constraints

Our goal was to show how RegattaDB could handle both OLTP and OLAP workloads simultaneously. We use three hard constraints to test this ability. Most databases can handle one or two of these, but none can do all three. Note that we do not set performance as a constraint as the three constraints alone are difficult enough. RegattaDB also delivers query results in an acceptable time, but that is not defined as a constraint, per se.

Large‑scale, complex distributed joins
Concurrent transactional writes using strong consistency (serializable)
Random data distribution with no locality advantages

The data model and data distribution across the cluster

We made up a fairly simple data model to test these constraints. We have defined a Customers and an Orders table. The test performs updates while executing a join that will look for all “foreign orders”. Basically, we will look for all the orders where the region id of the store in which the transaction took place does not match the region id of the customer. We will also count types of payment for these as well. More about the JOINs is described below.

To support this, we created two tables and populated each with 10 billion rows. For the test, CustomerID is used for our join key, RegionID is used for the “foreign order” comparison and PaymentMethod is used for aggregation. Also, there are no secondary indexes defined for either table.

CREATE TABLE Customers (
    CustomerID     BIGINT,
    Name           VARCHAR(100),
    ZipCode        BIGINT,
    PhoneNumber    VARCHAR(20),
    LoyaltyLevel   BIGINT,
    RegionID       BIGINT
)

CREATE TABLE Orders (
    OrderID        BIGINT,
    CustomerID     BIGINT,
    StoreId        BIGINT,
    OrderYear      BIGINT,
    Amount         BIGINT,
    ProductID      BIGINT,
    Quantity       BIGINT,
    OrderDate      BIGINT,
    PaymentMethod  BIGINT,
    StoreRegionID  BIGINT
)

Random distribution of data across the cluster

We populated the 10B rows for each table and ensured the data for both tables was randomly and evenly distributed across the 50-node cluster so that there was no sharding on CustomerID, no affinity rules, and no attempt to co‑locate related rows.

As a result, the chances for an order row and its corresponding customer row to land on the same node is roughly 1 in 50. So a vast majority of the joined rows require actual data movement among the nodes, with no shortcuts. Also, to force full scans, we disabled all index use. Additionally, we chose predicates that cannot be optimized by predicate-pushdowns, thus forcing all the 10B row pairs to be truly joined. This ensured that the JOIN had to:

Scan all table fragments
Propagate left and right sub-rows among nodes (many to many)
Construct approximately 10B join pairs at runtime

Ultimately, this tests the fundamental throughput and coordination of RegattaDB, not the quality of how the user defines the indexes within the data model in RegattaDB.

Cluster configuration

For this test, we created a 50-node cluster of RegattaDB in GCP using small cloud instances (n2-highmem-8). Each instance had 8 vCPUS with 64GB RAM and 375 GB SSDs. We then deployed the 50 instances of RegattaDB to this environment.

Concurrency Constraint: 50,000 serializable updates/second

Before we initiated the JOIN, we kicked off a process that will continuously execute 50,000 ACID‑compliant updates per second. We wanted to prove that even with the long‑running scans and network shuffles of the JOIN, the system can maintain serializable isolation, external consistency, and acceptable commit latency for all transactions.

We chose UPDATES so that the database would have to not only find each row but also conflict with the 20B rows that are participating in the JOIN at the same time. Ultimately, this tests the ability to run both OLTP and OLAP at scale.

The process executes the following updates using a script that defines the random_row_id for each iteration and ensures there are 50,000 transactions per second. The corresponding row_id that is used is an internal concept that allows us to efficiently get to a record. The purpose of this test is to demonstrate the JOIN speed and less about the update, so we use this out of convenience.

update orders set Amount = 1 where rowid = random_row_id; 
update customers set LoyaltyLevel= 1 where rowid = random_row_id;

The 20B row JOINs

For this test, we defined two JOINs that require the intermediate creation of approximately 10B virtual rows each containing a sub-row from the right table and a sub-row from the left table where the right table’s sub-row and the corresponding left table’s sub-row are not originally stored in the same node.

To force that to happen, the predicate used by the JOIN only compares fields from the right table and the left table, so one cannot “filter out” rows ahead of time. In RegattaDB, when you JOIN a large number of right and left sub-rows that are not collocated on the same nodes, a many-to-many propagation of the data occurs so that the matching left and right sub-rows land on the same node and the JOIN data is ultimately spread evenly across many nodes.

JOIN 1: Identify “foreign” orders

SELECT count(*) FROM Customers c JOIN Orders o ON c.CustomerID = o.CustomerID and c.RegionID != o.StoreRegionID;

A foreign order is one where: Customers.RegionID != Orders.StoreRegionID and we designed the synthetic data so that roughly 100 million orders meet this condition. So, to answer this question, the database must:

Scan customers and orders in parallel
Propagate the right sub-rows and the left sub-rows across the network, so matching left and right sub-rows meet on the same node
On each of the node, identify matching left and right sub-rows and join them
Evaluate the predicate on the joined rows
Emit matching rows for downstream processing

This is a classic distributed join followed by a cross‑table predicate evaluation.

JOIN 2: Group foreign orders by payment method

From the JOIN 1 output where we find all “foreign orders”, we then add to the query and calculate the number of foreign orders in each of the different payment methods for these.

SELECT o.PaymentMethod, COUNT(*) AS order_count FROM Customers c JOIN Orders o ON c.CustomerID = o.CustomerID AND c.RegionID != o.StoreRegionID GROUP BY o.PaymentMethod;

Even though the group‑by is small relative to the join, it validates that the pipeline operates end‑to‑end without bottlenecks.

The Results

We executed these queries a few times and the results were clear: Not only can RegattaDB handle this workload, but it does so with incredible performance. Again, this is two 10b row tables, a complex join and with 50,000 updates happening concurrently.

JOIN 1: 174.613 seconds
JOIN 2: 181.929 seconds

Below, we include an edited video to demonstrate the execution. The full, unedited version is also available.

Why This Matters

Many analytical queries involve large tables and complex joins; however, it has been nearly impossible to execute these within a distributed database that is guaranteeing transactional writes at the same time. This demonstration illustrates that this is not only possible in RegattaDB, but it can also be accomplished without the need for specialized hardware, manual tuning, or tightly controlled data layouts.

For organizations that need to manage rapidly growing datasets, these capabilities translate into tangible benefits. They reduce operational overhead, lower infrastructure costs, and enable the simultaneous support of transactional applications alongside analytical queries. Additionally, they minimize restrictions on schema design, indexing strategies, and read patterns, giving teams greater flexibility and efficiency in handling diverse workloads.

Want to see this live? Let us know!

< Back to blog

File under “now possible”: A 20B row JOIN and 50,000 UPDATES/second in a single distributed database

What we set out to prove: three constraints

The data model and data distribution across the cluster

Random distribution of data across the cluster

Cluster configuration

Concurrency Constraint: 50,000 serializable updates/second

The 20B row JOINs

JOIN 1: Identify “foreign” orders

JOIN 2: Group foreign orders by payment method

The Results

Why This Matters

Latest Blogs

File under “now possible”: A 20B row JOIN and 50,000 UPDATES/second in a single distributed database

The Shift to Agentic AI and a Modern Database

Regatta’s Concurrency-Control Protocols