Send me as a pdf

ClickHouse vs Druid

A detailed comparison

Compare ClickHouse vs Druid by the following set of categories:

Clickhouse was originally developed at Yandex, the Russian search engine, as an OLAP engine for low latency analytics. It was built as an on-premise solution with coupled compute & storage, and a large variety of tuning options in the form of indexes and merge trees. Clickhouse’s architecture is famous for its focus on performance and low-latency queries. The tradeoff is that it is considered very difficult to work with. SQL support is very limited, and tuning/running it requires significant engineering resources.

Druid is an OLAP engine designed to provide fast real time analytics. Druid adopts a clustered architecture with servers that host various role specific processes. These processes address real time and batch ingestion, indexing, querying of historical and real time data. Apache Druid can be deployed as a virtual machine or a Kubernetes based cluster. Druid does not support a decoupled compute & storage architecture. Deep storage in the form of object storage is used to replicate data to.

ClickHouse vs Druid

The biggest difference among cloud data warehouses are whether they separate storage and compute, how much they isolate data and compute, and what clouds they can run on.

Clickhouse doesn’t offer any dedicated scaling features or mechanisms. While it can deliver linearly scalable performance for some types of queries, scaling itself has to be done manually. Hardware is self-managed in Clickhouse. This means that to scale you would have to provision a cluster and migrate. 

Druid provides the ability to handle fast ingest and high concurrency. Custom sizing and cluster tuning are required to balance the compute, memory, storage needs of each process within Druid and to provide high concurrency. Druid clusters can be grown by adding nodes with automatic rebalancing of storage segments assigned to nodes.

ClickHouse vs Druid

There are three big differences among data warehouses and query engines that limit scalability: decoupled storage and compute, dedicated resources, and continuous ingestion.

Clickhouse is famous for being one of the fastest local runtimes ever built for OLAP workloads. Its columnar storage, compression and indexing capabilities make it a consistent leader in benchmarks. Its lack of support for standard SQL and lack of query optimizer means that it’s less suitable for traditional BI workloads, and more suitable for engineering managed workloads. While fast, it requires a lot of tuning and optimization. 

Druid provides high performance through columnar storage format, parallel processing, bitmap indexes and roll-ups. Druid, however, recommends a denormalized data model for performance needs. Join operations in Druid are a relatively new feature with various limitations, especially if there is a need to join large datasets.

ClickHouse vs Druid

Performance is the biggest challenge with most data warehouses today.
While decoupled storage and compute architectures improved scalability and simplified administration, for most data warehouses it introduced two bottlenecks; storage, and compute. Most modern cloud data warehouses fetch entire partitions over the network instead of just fetching the specific data needed for each query. While many invest in caching, most do not invest heavily in query optimization. Most vendors also have not improved continuous ingestion or semi-structured data analytics performance, both of which are needed for operational and customer-facing use cases.

Clickhouse was not designed to be a data warehouse, but rather a low-latency query execution runtime. Managing it typically requires significant engineering overhead. Hence, it’s a good fit for engineering managed operational use cases and customer-facing data apps, where low latency matters. It is not a good fit for a general purpose data warehouse, nor for Ad-Hoc analytics or ELT.

Druid is designed as an OLAP engine to provide fast access to aggregations that are run against large volumes of data. Druid is typically used for customer facing analytics and streaming data processing. Druid is used as an add-on with other data warehousing products that are efficient at scaling, joining, and filtering large volumes of data. It is not a suitable option for data warehouse replacement.

ClickHouse vs Druid

There are a host of different analytics use cases that can be supported by a data warehouse. Look at your legacy technologies and their workloads, as well as the new possible use cases, and figure out which ones you will need to support in the next few years.

Compare other data warehouses

See all data warehouse comparisons ->

Talk to a Firebolt solution architect