Databricks was built by the founders of Spark as an analytics platform to support machine learning use cases. It leverages the Spark framework to process data residing in a data lake and is supported on AWS, GCP and Azure. Databricks coined the marketing term “Lakehouse '' architecture to illustrate the unification of data lake and data warehouse use cases. Customers still manage Spark clusters that process data residing in a Delta lake. Conversion of data to Delta Lake format is required to leverage the functionality of Delta Lake. Databricks Sql is a relatively new addition to simplify access to data stored in a data lake.
Snowflake was one of the first decoupled storage and compute architectures, making it the first to have nearly unlimited compute scale and workload isolation, and horizontal user scalability. It runs on AWS, Azure and GCP. It is multi-tenant over shared resources in nature and requires you to move data out of your VPC and into the Snowflake cloud. “Virtual Private Snowflake” (VPS) is its highest-priced tier, and can run a dedicated isolated version of Snowflake. Its virtual warehouses can be T-shirt sized along an XS/S/M…/4XL axis, where each discrete T-shirt size is bundled with fixed HW properties that are abstracted from the users.
Databricks allow for autoscaling of clusters based on utilization. Additionally, increasing concurrency associated with a sql endpoint can be accomplished through the addition of clusters. Query concurrency per cluster is maxed at 10. However, scaling with additional clusters for concurrency is possible. Databricks provides a choice of instance types.
Snowflake scales very well both for data volumes and query concurrency. The decoupled storage/compute architecture supports resizing clusters without downtime, and in addition, supports auto-scaling horizontally for higher query concurrency during peak hours.
Databricks is designed to leverage the Spark framework for processing large volumes of data. It leverages compressed Parquet files in a Delta Lake. To reduce the amount of data processed, it uses data pruning on partitions and Parquet file metadata. Databricks does not provide any indexes.
Snowflake typically comes on top for most queries when it comes to performance in public TPC-based benchmarks when compared to BigQuery and Redshift, but only marginally. Its micro partition storage approach effectively scans less data compared to larger partitions. The ability to isolate workloads over the decoupled storage & compute architecture lets you avoid competition for resources compared to multi-tenant shared resource solutions, and the ability to increase warehouse sizes can often enhance performance (for a higher price), but not always linearly. Snowflake’s recently released “Search optimization service” delivers index-like behavior for point queries, but comes at an additional cost.
Databricks is a mature Spark based platform proven for processing streaming data. It is widely used for Machine Learning use cases by data scientists through the use of integrated notebooks. From a low latency query perspective, while it offers features like Delta Cache, it does not provide specialized indexes that can deliver low latency queries.
Snowflake is a well rounded general purpose cloud data warehouse, that can also span beyond traditional BI & Analytics use cases into Ad-Hoc and ML use cases. Thanks to the flexible decoupeld storage & compute architecture that allows you to isolate and control the amount of compute per workload, it’s possible to tackle a broad spectrum of workloads. However, like its close siblings Redshift & BigQuery, it struggles to deliver low-latency query performance at scale, making it a lesser fit for operational use cases and customer-facing data apps.