Snowflake vs Databricks delta lake

Snowflake vs Databricks vs Firebolt

More and more, people are asking me “how do you compare Snowflake and Databricks?”

I have to admit, I’ve avoided the question a little. For one, it’s not an apples to apples comparison. They both support analytics, and they have increasingly been competing with each other. But they’re not quite the same thing. Snowflake is a data warehouse that now supports ELT. Databricks, which is built on Apache Spark, provides a data processing engine that many companies use with a data warehouse. They can also use Databricks as a data lakehouse by using Databricks Delta Lake and Delta Engine.

But it’s a really important question, in part because many companies already use Databricks, Snowflake, or both. So here’s my best answer, which I’ve broken into two parts. 

  • First, how should you compare a data lakehouse and data warehouse, and when should you choose one or the other for a given use case?
  • Second, how do Databricks and Snowflake compare today? I’ve included a detailed comparison across more than 30 categories spanning architecture, performance, scalability, use cases, and cost based on the criteria in the data warehouse comparison guide.

Data lakehouse vs data warehouse

First let’s define a data lake, data lakehouse, and data warehouse, and their purpose.

A data warehouse is the database of choice for general-purpose analytics, including reporting, dashboards, ad hoc, and any other high-performance analytics. It gets its performance in part by sorting, pre-computing, and compressing data during ingestion for certain types of operations including joins and aggregations. In part because the focus is on performance, a data warehouse generally only holds structured and more recently semi-structured data. A modern cloud data warehouse supports “data-lake scale” meaning you can hold massive  data sets and mostly only pay cloud storage costs. You do have to access the data through the data warehouse engine, which adds compute costs as well.

A data lake is a data store (only) for any raw structured, semi-structured, and unstructured data that makes data easily accessible to anyone. You can use it as a batch source for a data warehouse or any other workload. For example, it enables self-service ELT for any workload so that you can quickly add new data to a data warehouse. 

A data lakehouse is often described as a new, open data management architecture that combines the best of a data lake with a data warehouse. The goal is to implement the best of a data lake and a data warehouse, and to reduce complexity by moving more analytics directly against the data lake, thereby eliminating the need for multiple query engines.

But getting to this goal is tricky. A data lakehouse is usually implemented as a combination of a data lake with a federated query engine. That poses several challenges today:

  • It can limit access to the raw data to one query engine, which defeats the purpose of a data lake as a data store to help make raw data accessible to everyone. 
  • The data in a data lake is raw data. You need to organize all of the data or make data easily discoverable. 
  • Data lakehouses do not yet have as broad a set of integrations with ETL/ELT tools or BI tools as data warehouses. 
  • Perhaps most importantly, data lakehouses do not have the performance of a data warehouse. That requires recreating everything done in ETL/ELT and data warehouse ingestion, not to mention the other optimizations in the best data warehouse engines. It takes time.

Despite all these challenges, I do recommend you consider offering a federated SQL engine as a default service with your data lake. Not only does it make it easy for new groups without their own tools to access the raw data. If you put together the right team you will deliver better, faster SQL access and lower costs for the other teams. One great service will be less expensive than several mediocre sets of tools spread out across groups. 

I just can’t recommend a single engine that combines a data lake, data warehouse and other specialized analytics needs because it does not exist today.

When should you choose a data warehouse or a data lakehouse? Eventually, you should end up with an architecture where your general-purpose analytics, especially your interactive analytics that need performance, are supported by a data warehouse. You should have all your raw data in a data lake. 

A data lakehouse can support reporting and dashboards, so long as you don’t need your queries to run in a few seconds or less every time. It can also be your go-to place for one-off analytics to answer specific questions with new data, or for prototyping.

There is a framework you can use to decide as well. Just answer these three questions. 

  1. How fast do the queries need to run? 
  2. Do you know the data you need to use in advance?
  3. Do you know the analytics you need to run in advance?
clickhouse vs data warehouse

If you want all the details, I answered them in another blog on how to choose

You can also think about when you add each new technology over time as you modernize your data pipeline and overall data infrastructure. 

  1. Add a data lake the moment you need to store raw data, and do not know how you will need to use it, or you need to improve how you perform ELT into your data warehouse.
  2. Add a data lakehouse or federated query engine to a data lake the moment anyone needs SQL access to raw data for any workload including one-off reports or prototyping.
  3. For general-purpose reporting and dashboards, use a data warehouse. 
  4. Add even faster engines for interactive internal or external analytics and data applications as you need them.

A data lakehouse cannot be as fast as a high-performance data warehouse today because it does not structure the data in advance to improve query performance the same way.

Be ready to choose several engines. Do not resist it. With a modern data pipeline, you can choose the right engine for each workload when you need them because the value of each engine easily outweighs the cost. New deployments are much faster and lower cost with modern data pipelines, and then you pay as you go for compute. Even if you have multiple copies, storage is relatively cheap across all solutions. 

As an example, all of our customers run Firebolt side-by-side with Snowflake, BigQuery, Redshift, Athena, or Presto. They added Firebolt to improve performance and efficiency at scale for the fast analytics that weren’t working well. The rest was working well, so why replace it?

Databricks vs Snowflake vs Firebolt

Now for the second part; how do Databricks and Snowflake compare as vendors?

Many companies use them together; Databricks for data processing (ETL), and Snowflake as their data warehouse. But Databricks and Snowflake have been steadily moving into each other’s core markets - ETL and data processing, and data warehousing/lakehousing - for some time as they both try to become a data platform of choice for multiple workloads. 

If you compare Snowflake vs Databricks Delta Lake and Delta Engine today, at first glance they may seem pretty similar (see below.) For example, they can both support reporting and dashboard workloads in the cloud.

Snowflake has marketed themselves as a lakehouse for several years. Decoupled storage and compute means you can store any amount of data in Snowflake at cloud storage costs. Snowflake has also been steadily expanding their ELT capabilities. Many companies run their ELT with Snowflake. For example, some use Airflow to orchestrate data movement outside of Snowflake, and dbt with Snowflake, or Snowflake Tasks and Streams to orchestrate SQL tasks in Snowflake.

Databricks became a data lake vendor more recently. They first added Delta Lake, which is a data lake built on their own versioned Parquet. Delta Lake requires integration to make it work with other engines. While Delta Lake doesn’t yet have broad data access, other engines can directly access the data as well, and the list of integrations has been growing. Then they added Delta Engine, an open source federated query engine which is a combination of Databricks Photon, a specialized engine for SparkSQL, and a caching layer. They have also been steadily expanding their (Spark) SQL support so that you can do ELT/ETL using more SQL. 

For data processing, if you need to go beyond SQL, Apache Spark or Databricks are great options. But for analytics use cases, there are really two major differences that lead you to choose one vendor over the other.

The first big difference is the types of data that can be stored and processed. Databricks Delta Lake is a data lake that can store raw unstructured, semi-structured, and structured data. When combined with Delta Engine it becomes a data lakehouse.

Snowflake supports semi-structured data, and is starting to add support for unstructured data as well.  But today, if you really need a data lake for all your raw data, a data lake is a better option. Evaluate Delta Lake and Delta Engine versus a data lake and Presto, for example.

The second big difference is performance. If you need query performance for interactive queries, you will probably end up choosing a data warehouse like Snowflake, BigQuery, Redshift, or Firebolt.

Delta Lake has Delta Engine as a query engine optimized to run SparkSQL In practice Databricks has shown 2-4x acceleration of SparkSQL for deployments, and claims up to 60x performance improvements for specific queries.

This should not be too surprising. Delta Engine consists of a C++ based vectorized SQL query optimization and execution engine (Photon) and caching on top of Delta Lake versioned Parquet. Databricks and the broader Spark community know best how to optimize SparkSQL. They can optimize for Apache Arrow or another internal format to avoid the cost of serialization and deserialization. It also adds hash aggregations to accelerate GROUP BY and other aggregation operators. The Databricks version of Delta Engine brings more advanced features like caching or data collocation (Z-Order). Delta Lake also has other features similar to a data warehouse, like vacuuming, to help with performance.

But even with this performance acceleration, Delta Engine on Delta Lake will always struggle to match the performance of a data warehouse for more interactive query workloads where caching doesn’t help as much because a data warehouse optimizes storage for data access by the query engine during ingestion.

For example, Snowflake has cluster keys to sort data and micro-partition pruning to reduce the number of partitions accessed and stored in the Snowflake virtual warehouse (compute) cache. Snowflake also adds search optimization, which improves the performance of returning small subsets of rows. Cluster keys, pruning, and search optimization all shrink the amount of data fetched and scanned compared to a federated query engine like Delta Engine that does not have these features. Given that Snowflake also has an optimized, and vectorized query engine, you should expect Snowflake to be faster for any ad hoc or interactive queries that need to fetch data from remote storage.

Firebolt goes even farther on indexing. Sparse (primary) indexes sort data based on multiple columns in the order they’re listed in the index definition. This index shrinks data access down to exact data ranges that are much smaller than partitions. Aggregating indexes precompute aggregations during ingestion and replace compute and data access with results in cache without having to rewrite your queries. Join indexes do a similar thing for joins. 

The results are 10x or less data accessed and stored in compute. This translates to 10x or faster data access times, and 10x or faster query execution since 10x or less data is scanned and processed even before you add in performance increases from using cached, precomputed results. This is how customers have been able to achieve 10x or faster performance and 10x lower cost in their own benchmarks.

The detailed comparison - Databricks vs Snowflake vs Firebolt

Now for the detailed comparison across more than 30 categories of Databricks vs Snowflake vs Firebolt, including: 

  • Architecture: elasticity, technology support, isolation and security
  • Scalability: elastic scale, scale efficiency, query and user concurrency, 
  • Performance: query, indexing, data type and storage optimization
  • Use cases: reporting, dashboards, ad hoc, operations and customer-facing analytics
  • Cost: administration, infrastructure resources, vendor pricing

I’ve included Firebolt because, well, I work for Firebolt. But these three do cover the full range of analytics and data applications across analytics, employees, and customers, so it should help you understand the key criteria for each type of analytics and data apps as well.

Databricks Snowflake Firebolt
Architecture (overall)
Elasticity - separation of storage and compute Yes Yes Yes
Supported cloud infrastructure AWS, Azure, Google Cloud AWS, Azure, Google Cloud AWS only
Isolated tenancy - option for dedicated resources Isolated compute and storage Multi-tenant resources (Only VPS dedicated on isolated Snowflake account) Multi-tenant dedicated resources
Compute - node types Complete choice 1-128 nodes, unknown types 1-128 choice of any types
Data - internal/external, writable storage Only queries Delta Lake tables External tables supported External tables (used for ingestion)
Security - data Separate customer keys, encryption at rest , complete RBAC for clusters, pools, jobs and table-level. Supports secrets Separate customer keys (only VPS is isolated tenant,Encryption at rest, RBAC Dedicated resources for storage and compute. Encryption at rest, RBAC
Security - network TLS, Firewall whitelist/ blacklist control, isolation Firewall, SSL, PrivateLink whitelist/blacklist control, isolated for VPS only Firewall and WAF, SSL, PrivateLink whitelist/ blacklist control
Scalability
Elasticity - (individual) query scalability Autoscaling based on load (Databricks) 1-click cluster resize, no choice of node size 1-click cluster resize of node type, number of nodes
Elasticity - user (concurrent query) scalability Autoscaling based on load (Databricks) Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table Unlimited manual scaling
Write scalability - batch Strong Strong Strong. Multi-master parallel batch
Write scalability - continuous Strong with ACID support and versioning Limited to 20 DML writes in queue per table. 1 minute or greater ingestion latency recommended Multi-master continuous writes
Data scalability No known limit No specified storage limit. 4XL data warehouse (128 nodes) No limit
Performance
Indexes Hash aggregations None Indices for data access, joins, aggregation, search
Query optimization - performance Cost-based optimization, vectorization Cost-based optimization, vectorization Index- and cost-based optimization,vectorization JIT, pushdown optimization
Tuning Choice of any node types and tuning options Can only choose warehouse size not node types Choice of node types, Indexing
Storage format Optimized versioned Parquet but not integrated with cache or optimized with indexing Optimized micro-partition storage (S3), separate RAM Optimized F3 storage (on S3). Data access integrated across disk, SSD, RAM)
Ingestion performance Strong batch and continuous ingestion with versioning Batch-centric (micro-partition level locking, limit of 20 queued writes per table) Multi-master, lock-free high performance ingestion with unlimited scale for batch or continuous ingestion
Ingestion latency Strong via structured streaming and/or Delta Live Tables into Delta Lake Batch write preferred (1+ minute interval). Requires rewrite of entire micro-partition Immediately visible during ingestion
Partitioning N/A. Depends on how you ingest or process data. By default it’s a data lake with raw data Micro-partition / pruning, cluster keys F3 with sparse indexing
Caching Caching as part of Delta Engine Result cache, materialized view F3 (cache) with aggregating, join, and search indexes
Semi-structured data - native JSON functions within SQL Yes (SparkSQL) Yes Yes (Lambda)
Semi-structured data - native JSON storage type Stored as single field in Parquet, but you can optimized by converting using Spark Limited VARIANT (single field) Yes (Nested array type, compact)
Semi-structured data - performance Not fast by default but you can optimize as part of (Spark) data processing. No indexing Slow (can require full load into RAM, full scan) Fast (array operations without full scans)
Use cases
Reporting Yes Yes Yes
Dashboards Fixed view Fixed view Fixed view, dynamic / fast queries, changing data
Ad hoc Seconds-minutes first-time query performance to load the cache Seconds-minutes first-time query performance Sub-second to seconds first-time query performance
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data) Scales, but the slowest performance for analytics workloads Slower query performance. Limited continuous writes and concurrency, slow semi-structured data performance Yes. Support continuous writes at scale, fast semi-structured performance
Data processing engine (Exports or publishes data) The best (data lake) for batch-based extraction. Supports SQL extraction as well Export query results or table Export query results
Data science/ML Strong ML support Spark, Arrow, Python connectors, integration with ML tools, export query results Export query results
Cost
Administration - deployment, management The most tuning and administration required, including work on ingestion to Delta Lake. Live tables does make ETL easier Easy to deploy and resize Strong performance visibility limited tuning Easy to deploy and resize, Easy to add indexing, change instance types
Choice - provision different cluster types on same data Yes Choice of fixed size warehouses Choice of node types, engine sizes
Choice - provision different number of nodes Yes Yes Yes
Choice - provision different node types Yes No Yes
Pricing - compute Open source option. See pricing (based on DBU compute) $2-$4+ per node. Fast analytics need large cluster ($16-32+/hour) or greater Choose any any node. Compute costs range $1-10/hour with 10x price- performance advantage
Pricing - storage AWS pricing All storage. $23/40 per TB per month on demand/up front $23/TB
Pricing - transferred data None None None


*While Google BigQuery publishes official quotas and other limits, in fact you can ask for them to be raised when you have reserved slots. None appear to be hard technical limits.

Summary

In short, Databricks Delta Lake and Delta Engine is a lakehouse. You choose it as a data lake, and for data lakehouse based workloads including ELT for data warehouses, data science and machine learning, even static reporting and dashboards if you don’t mind the performance difference and don’t have a data warehouse.

Most companies still choose a data warehouse like Snowflake, BigQuery, Redshift or Firebolt for general purpose analytics over a data lakehouse like Delta Lake and Delta Engine because they need performance.

But it doesn’t matter. You need more than one engine. Don’t fight it. You will end up with multiple engines for very good reasons. It’s just a matter of when. 

Eventually one vendor may combine all these capabilities into a single offering and make it much easier for more companies to build and manage data pipelines. But for now, choose several to get the best of lakehouses and data warehouses with your data pipeline.

Read all the posts

Intrigued? Want to read some more?

Curious to learn more?

Discover how Firebolt achieves extreme performance at a fraction of the cost