ON THIS PAGE
## Architecture
The biggest difference among cloud data warehouses are whether they separate storage and compute, how much they isolate data and compute, and what clouds they can run on.
| Feature | ClickHouse | Databricks |
|---|---|---|
| Separation of storage and compute | Yes – SharedMergeTree engine in ClickHouse Cloud enables full separation of storage and compute, with compute-compute separation through Warehouses feature (introduced 2025) allowing multiple isolated compute services sharing the same data | Yes |
| Supported cloud infrastructure | AWS, GCP, Azure, cloud service and on-premises | AWS, Azure, GCP. Marketplaces and BYOC |
| Isolated tenancy – option for dedicated resources | • Multi-tenant metadata layer • Isolated tenancy for compute & storage per client in cloud | • Control plane in Databricks account • Data plane in customer VPC (optional) • Storage in customer VPC • Serverless SQL runs in Databricks account with private connectivity |
| Control vs abstraction of compute | Configurable cluster size and compute types in ClickHouse Cloud with granular control over nodes (1-128 nodes) and node characteristics. Warehouses feature enables multiple isolated read-only compute environments. | • Configurable clusters and instance types • Serverless SQL warehouses (GA 2025) run in Databricks account with private connectivity, no public IPs • Pro/Classic warehouses run in customer VPC |
| Self-hosted and hybrid deployment options | Self-managed deployments available with full control over infrastructure | • Databricks on customer cloud accounts • Unity Catalog for hybrid governance |
| ACID Compliance and Transactions | Limited ACID compliance with MergeTree engine family. | • ACID transactions with Delta Lake • Time travel and versioning • Concurrent read/write operations |
ClickHouse was originally developed at Yandex, the Russian search engine, as an OLAP engine for low latency analytics. It was built as an on-premise solution with coupled storage & compute, and a large variety of tuning options in the form of indexes and merge trees. ClickHouse's architecture is famous for its focus on performance and low-latency queries. The tradeoff is that it is considered very difficult to work with. SQL support is very limited, and tuning/running it requires significant engineering resources.
Databricks was built by the founders of Spark as an analytics platform to support machine learning use cases. It leverages the Spark framework to process data residing in a data lake and is supported on AWS, GCP and Azure. Databricks coined the marketing term "Lakehouse '' architecture to illustrate the unification of data lake and data warehouse use cases. Customers still manage Spark clusters that process data residing in a Delta lake. Conversion of data to Delta Lake format is required to leverage the functionality of Delta Lake. Databricks Sql is a relatively new addition to simplify access to data stored in a data lake.
## Scalability
There are three big differences among data warehouses and query engines that limit scalability: decoupled storage and compute, dedicated resources, and continuous ingestion.
| Feature | ClickHouse | Databricks |
|---|---|---|
| Elasticity – Scaling for larger data volumes and faster queries | Automatic horizontal and vertical scaling in ClickHouse Cloud with SharedMergeTree architecture. Manual scaling for self-managed deployments with cluster rebalancing capabilities | Autoscaling clusters based on workload demand. Serverless SQL warehouses provide near-instant scaling (2-6 seconds startup) |
| Elasticity – Scaling for higher concurrency | Supports high concurrency with proper resource allocation and configuration. Vertical auto-scaling and horizontal manual scaling. Additional warehouses can idle to zero billing. Primary service always on in multi-warehouse configurations. | • 10 concurrent queries per cluster limit • Scales up to 40 clusters per warehouse (400 total concurrent queries) • Serverless SQL warehouses provide near-instant autoscaling • Pro/Classic warehouses take several minutes to provision new clusters • Real-world performance degradation typically occurs at 50-150 concurrent queries depending on complexity |
ClickHouse doesn't offer any dedicated scaling features or mechanisms. While it can deliver linearly scalable performance for some types of queries, scaling itself has to be done manually. Hardware is self-managed in ClickHouse. This means that to scale you would have to provision a cluster and migrate.
Databricks allow for autoscaling of clusters based on utilization. Additionally, increasing concurrency associated with a sql endpoint can be accomplished through the addition of clusters. Query concurrency per cluster is maxed at 10. However, scaling with additional clusters for concurrency is possible. Databricks provides a choice of instance types.
## Performance
Performance is the biggest challenge with most data warehouses today. While decoupled storage and compute architectures improved scalability and simplified administration, for most data warehouses it introduced two bottlenecks; storage, and compute. Most modern cloud data warehouses fetch entire partitions over the network instead of just fetching the specific data needed for each query. While many invest in caching, most do not invest heavily in query optimization. Most vendors also have not improved continuous ingestion or semi-structured data analytics performance, both of which are needed for operational and customer-facing use cases.
| Feature | ClickHouse | Databricks |
|---|---|---|
| Indexes | • Primary indexes • Skipping indexes (minmax, set, bloom filters, ngrambf_v1, tokenbf_v1) • MergeTree indexes • Incremental Materialized views | None |
| Compute tuning | Configurable compute resources in cloud offering | Choice of cluster type, node types including SSD-optimized instances. Serverless provides automatic resource allocation with Intelligent Workload Management (IWM) |
| Storage format | Columnar, supports sorted, compressed, encoded & sparsely indexed files with native Apache Iceberg support. | • Delta Lake format with Liquid Clustering (February 2025 – replaces Z-ordering and traditional partitioning) • Cannot use Liquid Clustering alongside Z-ordering on same table • Allows for sorted data in Delta Lake • Requires Optimize to maintain ordering |
| Table-level partition & pruning techniques | Partitioning by date/time and custom partitions with MergeTree indexes. | • Table level partitioning • Liquid Clustering for improved query performance and reduced data skew (February 2025) • Z-ordering (legacy, replaced by Liquid Clustering) • Periodic optimization of storage required |
| Result cache | Yes, results cache with TTL and query condition cache. | Multi-layered caching: local in-memory cache per cluster plus remote result cache (serverless only) that persists across all warehouses in workspace |
| Warm cache (SSD) | Yes, at indexed data-range level granularity | Yes. Delta cache for data read by queries at file level granularity |
| Support for semi-structured data & JSON functions within SQL | Yes, including Lambda expressions and native JSON data type (GA in v25.3) | Yes |
| Vector Search and AI Capabilities | • Native vector search capabilities and embeddings • MCP Server for AI driven analytics • Natural Language to SQL • SQL based Inference | • MLflow integration and Databricks ML platform • Native vector search in Delta Lake (Vector Search) • AI and ML workloads optimized |
| Query Optimizations | • Primary indexes (ORDER BY) • Data skipping indexes (minmax, set, bloom filters, ngrambf_v1, tokenbf_v1) • Materialized views • Projections • PREWHERE optimization • Query analysis tools • Automatic global join reordering (v25.9) • Enhanced JSON query optimization • Streaming secondary indices | • Photon engine (C++ vectorized engine providing 3-8x average speedups, maximum speedups over 10x) • Automated stats collection (January 2025) enables cost-based optimization • Predictive I/O for faster point lookups and data updates • Liquid Clustering (February 2025) • Intelligent Workload Management (IWM) with AI-powered resource allocation • Delta cache • Materialized views support |
ClickHouse is famous for being one of the fastest local runtimes ever built for OLAP workloads. Its columnar storage, compression and indexing capabilities make it a consistent leader in benchmarks. Its lack of support for standard SQL and lack of query optimizer means that it's less suitable for traditional BI workloads, and more suitable for engineering managed workloads. While fast, it requires a lot of tuning and optimization.
Databricks is designed to leverage the Spark framework for processing large volumes of data. It leverages compressed Parquet files in a Delta Lake. To reduce the amount of data processed, it uses data pruning on partitions and Parquet file metadata. Databricks does not provide any indexes.
## Use cases
There are a host of different analytics use cases that can be supported by a data warehouse. Look at your legacy technologies and their workloads, as well as the new possible use cases, and figure out which ones you will need to support in the next few years.
| Feature | ClickHouse | Databricks |
|---|---|---|
| Low-latency dashboards | • Sub-second load times at TB+ scale with proper indexing • ClickHouse Cloud reduces engineering overhead with managed service • Proven low-latency performance (120ms at 2500 QPS in benchmarks) • Purpose-built for low-latency OLAP and real-time analytics | • Sub-second to seconds load times at TB+ scale • Enhanced by Photon engine (3-8x average speedups) and Delta cache • Serverless SQL warehouses provide rapid startup (2-6 seconds) • Performance depends on cluster configuration |
| Enterprise BI | • Growing ecosystem with 50+ integrations including major BI tools • Native MySQL protocol support enables broad BI tool compatibility • Strong SQL compliance with PostgreSQL compatibility • Best suited for modern analytical workloads and engineering-managed use cases | • Strong for data science and ML workloads • Unified analytics platform approach • Growing traditional BI integrations • Serverless SQL warehouses improve accessibility • Delta sharing capabilities |
| Data Apps and AI Applications (Customer-facing low-latency high concurrency) | • Sub-second response times at TB+ scale • Supports 1000 concurrent users per replica • Strong price-performance on customer-facing applications • Native vector search and embeddings | • 10 concurrent queries per cluster, scaling to 400 total concurrent queries per warehouse • Real-world performance degradation typically occurs at 50-150 concurrent queries depending on workload complexity • Serverless provides near-instant autoscaling • Photon engine delivers 3-8x performance improvements • Strong ML and AI platform integration |
| Ad hoc | • Good for ad-hoc queries with ClickHouse Cloud's separated storage/compute architecture • Join optimizations enable more query complexity • Strong sampling capabilities (TABLESAMPLE) for exploratory analysis • Resource management through user quotas prevents query interference • Materialized views offer performance improvements for common aggregation patterns, ad-hoc users specify directly in SQL | • Excellent for ad-hoc with decoupled storage/compute • Serverless SQL warehouses provide instant provisioning • Intelligent Workload Management handles unpredictable workloads automatically • Strong for exploratory data analysis and ML workloads • Automated stats collection improves query planning |
ClickHouse was not designed to be a data warehouse, but rather a low-latency query execution runtime. Managing it typically requires significant engineering overhead. Hence, it's a good fit for engineering managed operational use cases and customer-facing data apps, where low latency matters. It is not a good fit for a general purpose data warehouse, nor for Ad-Hoc analytics or ELT.
Databricks is a mature Spark based platform proven for processing streaming data. It is widely used for Machine Learning use cases by data scientists through the use of integrated notebooks. From a low latency query perspective, while it offers features like Delta Cache, it does not provide specialized indexes that can deliver low latency queries.