Sharding – some dirty little secrets
A few years ago, my Regatta Co-Founder Erez Webman and I wrote a white paper describing the dirty little secrets of scale-out sharding. Scale-out sharding as a technique to try to cope with the limitations of single-node writer database systems is in wide use, although it has many problematic issues. In this blog, I will offer a summary of the issues that we covered in that whitepaper, and encourage you to read the full whitepaper for a more comprehensive treatise.
Who needs scale-out sharding anyway?
Large data sets or high throughput applications challenge any single-node database. Large capacity exceeds the server’s storage. Heavy workloads exhaust CPU capacity, I/O resources, or memory (RAM). Upgrading servers with more storage, memory, and faster CPU might solve this challenge. However, this is expensive, and there is a limit to the capacity or workload that a single server can handle. The contemporary solution is often scale-out sharding.
With scale-out sharding, data is divided across multiple servers (“shards”) in a cluster according to a sharding criterion. Each server operates as an independent database that can easily handle its part of the total data.
However, scale-out sharding requires the application code to be well-aware of the exact data placement across the local independent database servers (shards) such that it will be able to manage the workload across those shards. Sharding breaks developers’ expectations around serializability, externalizable behaviors and overall ACID system attributes.
If you are building an application, you do not want to be in the business of designing and developing concurrency control and durability yourself. Having to implement ACID and other data integrity guarantees in the application code will slurp up a significant amount of development and financial resources. Attempts to maintain ACID-like guarantees at the application layer are known to be hard or even impossible, and in any case complex, painful, and usually buggy. Even experienced developers agree that it is much easier to rely on the database than to attempt to achieve ACID at the application level.
Contrary to what is sometimes popular belief, scale-out sharding configurations typically do not offer the notion of a large, single, database. Scale-out sharding does not try to align many nodes to act like one big database. Rather, each node operates as an independent sub-database.
As a result, some basic database functionality that can be expected from a single node database cannot be achieved in multi-node sharded databases. For example, cross-shard-boundaries transactions and analytics operations may not work, or not work well. In addition, sharded databases may suffer other issues including, for example, per-node performance bottlenecks, inefficient workload execution, and even volume-skews surpassing nodes’ available capacity. Changing sharded database cluster configurations, like adding or removing servers, is generally challenging, and requires significant modifications that developers inherently must make in the application code.
Diving Deeper
In the Dirty Little Secrets of Scale-out Sharding whitepaper, Erez and I dive deeper into these, and various related, topics.
Sharded databases cannot guarantee distributed ACID
If data accuracy of database operations is important, then the database should guarantee ACID. Some developers might believe that it surely is possible to build applications on modern databases that do not guarantee ACID — writing code and logic that somehow guarantees correctness and ordering of operations at the application level. Of course, this is not necessarily possible, and inevitably introduces tradeoffs and complexity in the process and the application logic. This is known to be hard or even impossible, and in any case complex, painful, and usually buggy.
Why are multi-nodes databases challenged to guarantee strong ACID?
- Consider what happens when a distributed transaction was only able to commit “partially”, i.e. one local transaction (TR1) on Shard-1 commits successfully, while another local transaction (TR2) on Shard-2 fails to commit. In the whitepaper, Erez and I examine how this partial commit of the distributed transaction causes a window of inconsistency that remains open until the application can resolve the scenario that caused it. We will also review approaches that the application logic can choose that have a good chance of succeeding and thus to close the window of inconsistency. However, as you will see, there are cases where categorically none of those strategies can succeed, causing the inconsistent state to remain stuck forever.
- Consistency issues can occur even if both local transactions commit successfully. In the whitepaper, Erez and I walk through a scenario where both local transactions, on both local shards, commit successfully. We will illustrate how even when both local transactions committed successfully, there was still no way to guarantee that both committed at the same time exactly. Hence data written by the first local transaction could have been visible after the first local transaction successfully committed and before the second local transaction committed. This breaks the atomicity. No atomicity means no ACID. This can in turn lead to further data inconsistencies in the database, lead to bugs in the reader’s code, and to crashing of the application.
Sharded databases are not well-suited for executing queries
In this section of the whitepaper Erez and I analyze several examples of the impact of executing queries on scale-out sharding distributed databases. We review why application developers need to do a lot of hard work and to write a lot of complex and error-prone code in the application to attempt to realize the execution of queries. We also illustrate why in many cases writing code for the execution of such queries at the application level is simply not possible at all.
Regatta frees developers from the burden of dealing with the underlying data infrastructure challenges, and allows developers to focus for 100% on the business logic. Simpler, better code, faster iterations = happy developers, happy users. Simple, efficient but powerful database platforms = happy platform operators and data architects.