Datastore Sharding — an Overview of Strategies and Concerns

Photo by 은 하 on Unsplash

I did more learning on distributed systems this year, so would like to share some of my learning notes and experiments with you :) .

Category:

a. Horizontal Partitioning: Different rows are split into different partitions.

b. Vertical Partitioning: Different columns are split into different partitions (maybe some denormalisation are needed e.g. with userId — but it worths it, as it’s better to query in 1 partitions than doing a join across multiple partitions)

When to shard:

When a single node datastore reaches its bottlenecks:

  • Computing power: A server needs to handle concurrent queries and those need computing resources to run. This is the I/O limit.
  • Network bandwidth: It’s possible that the concurrent read/write exceeds the throughput of the database.
  • Geography. It might be efficient to store data of the users in the same region to reduce latency of data access.
  • High degree of data isolation and privacy requirement for some data arises.

Before sharding, you could try other options:

  • Creating more read replicas. Split read/write load and since most applications are read heavy, it might alleviate the problem.

Strategy:

1. Hash based:

Simple, use a hash function to compute the shard key into a hash value, and then mod it by number of shards, e.g. admin@email.com --hash--> 123249 --locate-shard--> % 20 = 9.

Pros:

  • Easy to maintain with the hash function vs. a lookup table.

Cons:

Use case: Memcached, Redis, MongoDB

2. Range based

Here, each shard is assigned a range of values, and when new record comes in, it is put on the shard whose range is the upper/lower bound of the shard key of the record.

Pros:

Cons:

  • Rebalancing can be difficult as the data is dynamically redistributed when the total number of data grows, and it’s difficult to pre-split the table into multiple shards.
  • Warming problem as at the start only a single node is taking all the queries because there’s not yet enough splits to happen for the data to be distributed to other idle nodes.

Use case: Google Spanner and HBase

3. Consistent Hashing

This is actually also a hash-based strategy, but it’s very scalable vs. the basic one.

With consistent hashing, data is evenly and randomly distributed across shards on a “hash ring”.

Pros:

Cons:

Use case: DynamoDB and Cassandra

4. Directory based:

Similar to the range based hashing, a lookup table is maintained to keep track of which shard holds which data, but the shard key here has to be have low cardinality. (Low-cardinality refers to columns with few unique values. e.g. boolean value)

Potential Problems:

  • Keep shards balanced is important, but difficult to achieve, especially with hot key problem.
  • Rebalancing can be expensive, and doing this without incurring downtime is difficult. Thus, growth should be taken into consideration so that each shard has enough space to hold more records.
  • Less joins and more denormalisation. Joins across shards are expensive to operate and n alternative is to denormalise data so queries can be performed on one shard.
  • Challenges with referential integrity and consistency: As synchronisation across shards take time (normally async operations), referential integrity like foreign key or uniqueness is challenging without proper consideration. For example. auto- incrementing ID is a bad idea due to potential conflicts between shards, the same is for timestamped ID as the clock time of shards can be different.

That’s it! Happy Reading.