Partitioning and Sharding

Partitioning and Sharding - Definition

Partitioning is a technique used in databases to break a single table into smaller chunks or partitions. This provides a foundation for faster queries and ingestion, while enabling ease of maintenance for extremely large tables.  Having the data partitioned into smaller tables can help tailor the schema to be in line with the data access patterns.  Vertical and Horizontal partitioning are two approaches used by customers depending on their needs.   

Vertical Partitioning: In the case of vertical partitioning, a table is split into two or more tables with different sets of columns.  Each of the individual tables can be customized to include the columns that support specific query access patterns.  

Partitioning and Sharding - image 1

After vertical partitioning, above table is split into 2 separate tables each with its own set of columns.

Partitioning and Sharding - image 2

Horizontal Partitioning: On the other hand, with horizontal partitioning a table is split into multiple tables where all tables have identical schemas but hold a smaller subset of the rows of the main table.  

Partitioning and Sharding - image 3

With horizontal partitioning, data is split into different partitions based on sales order date column.  This is shown below.

Partitioning and Sharding - image 4

Sharding typically references horizontal partitioning. In the case of sharding, the partitions themselves could be spread out on multiple machines enabling scale-out capabilities.  This approach results in each individual machine processing only a portion of the data resulting in faster data processing.  A shard key is used to split the data into smaller partitions. 

Partitioned or sharded data can then be efficiently organized using clustering to ensure data is sorted according to a key.  This provides best options when sequential access to sorted data is needed.  Partition pruning and clustering are techniques used by analytics platforms to avoid scanning or ordering data resulting in efficient query processing.  Partition manipulation can also be used to archive data or to remove older data as conducting wholesale partition operations are efficient compared to deletes and updates.  These techniques are leveraged in large scale analytics platforms.